Embed K-Neighbor Classifier and Linear SVM in scikit-learn for Word semantic ambiguity

I am trying to use linear SVM and the K Neighbors Classifier to determine the meaning of the meaning of Word (WSD). Here is a piece of data that I use to train the data:

<corpus lang="English"> <lexelt item="activate.v"> <instance id="activate.v.bnc.00024693" docsrc="BNC"> <answer instance="activate.v.bnc.00024693" senseid="38201"/> <context> Do you know what it is , and where I can get one ? We suspect you had seen the Terrex Autospade , which is made by Wolf Tools . It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly , you should n't have to bend your back during general digging , although it wo n't lift out the soil and put in a barrow if you need to move it ! If gardening tends to give you backache , remember to take plenty of rest periods during the day , and never try to lift more than you can easily cope with . </context> </instance> <instance id="activate.v.bnc.00044852" docsrc="BNC"> <answer instance="activate.v.bnc.00044852" senseid="38201"/> <answer instance="activate.v.bnc.00044852" senseid="38202"/> <context> For neurophysiologists and neuropsychologists , the way forward in understanding perception has been to correlate these dimensions of experience with , firstly , the material properties of the experienced object or event ( usually regarded as the stimulus ) and , secondly , the patterns of discharges in the sensory system . Qualitative Aspects of Experience The quality or modality of the experience depends less upon the quality of energy reaching the nervous system than upon which parts of the sensory system are <head>activated</head> : stimulation of the retinal receptors causes an experience of light ; stimulation of the receptors in the inner ear gives rise to the experience of sound ; and so on . Muller nineteenth - century doctrine of specific energies formalized the ordinary observation that different sense organs are sensitive to different physical properties of the world and that when they are stimulated , sensations specific to those organs are experienced . It was proposed that there are endings ( or receptors ) within the nervous system which are attuned to specific types of energy , For example , retinal receptors in the eye respond to light energy , cochlear endings in the ear to vibrations in the air , and so on . </context> </instance> ..... 

The difference between training and test data is that test data does not have an answer tag . I created a dictionary for storing words that are neighbors of the word "head" for each instance with a window size of 10. If there are several for one instance, I will consider only the first . I also created a set to record the entire dictionary in the training file so that I can calculate the vector for each instance. For example, if the general vocabulary is [a, b, c, d, e], and one instance has the words [a, a, d, d, e], then the resulting vector for this instance will be [2, 0,0,2,2 , 1] . Here is the segment of the dictionary that I built for each word:

 { "activate.v": { "activate.v.bnc.00024693": { "instanceId": "activate.v.bnc.00024693", "senseId": "38201", "vocab": { "although": 1, "back": 1, "bend": 1, "bicycl": 1, "correct": 1, "dig": 1, "general": 1, "handlebar": 1, "hefti": 1, "lever": 1, "nt": 2, "quit": 1, "rear": 1, "spade": 1, "sprung": 1, "step": 1, "type": 1, "use": 1, "wo": 1 } }, "activate.v.bnc.00044852": { "instanceId": "activate.v.bnc.00044852", "senseId": "38201", "vocab": { "caus": 1, "ear": 1, "energi": 1, "experi": 1, "inner": 1, "light": 1, "nervous": 1, "part": 1, "qualiti": 1, "reach": 1, "receptor": 2, "retin": 1, "sensori": 1, "stimul": 2, "system": 2, "upon": 2 } }, ...... 

Now I just need to provide input for the K Neighbors Classifier and Linear SVM from scikit-learn to train the classifier. But I just don’t know how I can create a vector of functions and a label for each. I understand that the label must be a tuple of the instance tag and the senseid tag in the "response". But then I'm not sure about that. Should I group all vectors with the same word that has the same instance tags and the smoothing tag in the “response”? But there are about 100 words and hundreds of copies for each word, how should I deal with this?

In addition, the vector is one of the functions, I need to add additional functions later, for example, synset, hypernyms, hyponyms, etc. . How am I supposed to do this?

Thanks in advance!

+5
source share
2 answers

Machine learning problems are a kind of optimization problem in which you don’t have a predefined “best for all” algorithm, but rather find the best result using different approaches, parameters and data preprocessing. Thus, you are absolutely right, starting with the simplest task - taking only one word and several feelings.

But I'm just not sure how I can create a vector function and label for each.

You can take only these values ​​as vector components. List vector words and write down the numbers of such a word in each text. If the word is missing, enter a blank value. I modified your example a bit to clarify the idea:

 vocab_38201= { "although": 1, "back": 1, "bend": 1, "bicycl": 1, "correct": 1, "dig": 1, "general": 1, "handlebar": 1, "hefti": 1, "lever": 1, "nt": 2, "quit": 1, "rear": 1, "spade": 1, "sprung": 1, "step": 1, "type": 1, "use": 1, "wo": 1 } vocab_38202 = { "caus": 1, "ear": 1, "energi": 1, "experi": 1, "inner": 1, "light": 1, "nervous": 1, "part": 1, "qualiti": 1, "reach": 1, "receptor": 2, "retin": 1, "sensori": 1, "stimul": 2, "system": 2, "upon": 2, "wo": 1 ### added so they have at least one common word } 

Let the hidden object vector. List all the words and note how many times the word is in the dictionary.

 from collections import defaultdict words = [] def get_components(vect_dict): vect_components = defaultdict(int) for word, num in vect_dict.items(): try: ind = words.index(word) except ValueError: ind = len(words) words.append(word) vect_components[ind] += num return vect_components # vect_comps_38201 = get_components(vocab_38201) vect_comps_38202 = get_components(vocab_38202) 

Let's get a look:

 >>> print(vect_comps_38201) defaultdict(<class 'int'>, {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 2, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1}) >>> print(vect_comps_38202) defaultdict(<class 'int'>, {32: 1, 33: 2, 34: 1, 7: 1, 19: 2, 20: 2, 21: 1, 22: 1, 23: 1, 24: 1, 25: 1, 26: 1, 27: 2, 28: 1, 29: 1, 30: 1, 31: 1}) >>> vect_38201=[vect_comps_38201.get(i,0) for i in range(len(words))] >>> print(vect_38201) [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] >>> vect_38202=[vect_comps_38202.get(i,0) for i in range(len(words))] >>> print(vect_38202) [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1] 

These vect_38201 and vect38202 are vectors that you can use in the fit model:

 from sklearn.svm import SVC X = [vect_38201, vect_38202] y = [38201, 38202] clf = SVC() clf.fit(X, y) clf.predict([[0, 0, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 2, 1]]) 

Conclusion:

 array([38202]) 

Of course, this is a very simple example that just demonstrates the concept.

What can you do to improve it?

  • Normalize vector coordinates.

  • Use the excellent Tf-Idf vectorizer tool to extract data from text.

  • Add additional data.

Good luck

+2
source

The next step is to implement a multidimensional linear classifier.

Unfortunately, I do not have access to this database, so this is a little theoretical. I can suggest this approach:

Affect all data in one CSV file as follows:

 SenseId,Word,Text,IsHyponim,Properties,Attribute1,Attribute2, ... 30821,"BNC","For neurophysiologists and ...","Hyponym sometype",1,1 30822,"BNC","Do you know what it is ...","Antonym type",0,1 ... 

Then you can use sklearn tools:

 import pandas as pd df.read_csv('file.csv') from sklearn.feature_extraction import DictVectorizer enc=DictVectorizer() X_train_categ = enc.fit_transform(df[['Properties',]].to_dict('records')) from sklearn.feature_extraction.text import TfidfVectorizer vec=TfidfVectorizer(min_df=5) # throw out all terms which present in less than 5 documents - typos and so on v=vec.fit_transform(df['Text']) # Join all date together as a sparsed matrix from scipy.sparse import csr_matrix, hstack train=hstack( (csr_matrix(df.ix[:, 'Word':'Text']), X_train_categ, v)) y = df['SenseId'] # here you have an matrix with really huge dimensionality - about dozens of thousand columns # you may use Ridge regression to deal with it: from sklearn.linear_model import Ridge r=Ridge(random_state=241, alpha=1.0) # prepare test data like training one 

Read more about: Ridge , Range Classifier .

Other methods of solving the problem of high dimensionality .

Sample code in text classification using sparse function matrices .

+2
source

Source: https://habr.com/ru/post/1215841/


All Articles