Machine learning problems are a kind of optimization problem in which you don’t have a predefined “best for all” algorithm, but rather find the best result using different approaches, parameters and data preprocessing. Thus, you are absolutely right, starting with the simplest task - taking only one word and several feelings.
But I'm just not sure how I can create a vector function and label for each.
You can take only these values as vector components. List vector words and write down the numbers of such a word in each text. If the word is missing, enter a blank value. I modified your example a bit to clarify the idea:
vocab_38201= { "although": 1, "back": 1, "bend": 1, "bicycl": 1, "correct": 1, "dig": 1, "general": 1, "handlebar": 1, "hefti": 1, "lever": 1, "nt": 2, "quit": 1, "rear": 1, "spade": 1, "sprung": 1, "step": 1, "type": 1, "use": 1, "wo": 1 } vocab_38202 = { "caus": 1, "ear": 1, "energi": 1, "experi": 1, "inner": 1, "light": 1, "nervous": 1, "part": 1, "qualiti": 1, "reach": 1, "receptor": 2, "retin": 1, "sensori": 1, "stimul": 2, "system": 2, "upon": 2, "wo": 1
Let the hidden object vector. List all the words and note how many times the word is in the dictionary.
from collections import defaultdict words = [] def get_components(vect_dict): vect_components = defaultdict(int) for word, num in vect_dict.items(): try: ind = words.index(word) except ValueError: ind = len(words) words.append(word) vect_components[ind] += num return vect_components
Let's get a look:
>>> print(vect_comps_38201) defaultdict(<class 'int'>, {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 2, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1}) >>> print(vect_comps_38202) defaultdict(<class 'int'>, {32: 1, 33: 2, 34: 1, 7: 1, 19: 2, 20: 2, 21: 1, 22: 1, 23: 1, 24: 1, 25: 1, 26: 1, 27: 2, 28: 1, 29: 1, 30: 1, 31: 1}) >>> vect_38201=[vect_comps_38201.get(i,0) for i in range(len(words))] >>> print(vect_38201) [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] >>> vect_38202=[vect_comps_38202.get(i,0) for i in range(len(words))] >>> print(vect_38202) [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1]
These vect_38201 and vect38202 are vectors that you can use in the fit model:
from sklearn.svm import SVC X = [vect_38201, vect_38202] y = [38201, 38202] clf = SVC() clf.fit(X, y) clf.predict([[0, 0, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 2, 1]])
Conclusion:
array([38202])
Of course, this is a very simple example that just demonstrates the concept.
What can you do to improve it?
Good luck