I am working on something that will try to figure out the author of a column using my own dataset.
I plan on using the mlpy python library. It has good documentation, (about 100 pages in pdf format). I am also open to another suggestion library.
The fact is that I got lost in the Data Mining and Machine Learning concepts. Too much work, too many algorithms and concepts.
I ask for guidance on which algorithms / concepts I should study, and a search for my specific problem.
So far I have created a dataset that looks something like this.
| author | feature x | feature y | feature z | some more features | |--------+-----------+-----------+-----------+--------------------| | A | 2 | 4 | 6 | .. | | A | 1 | 1 | 5 | .. | | B | 12 | 15 | 9 | .. | | B | 13 | 13 | 13 | .. |
Now, I will get a new column and analyze it, after which I will have all the functions for the column, and my goal is to find out who the author of this column is.
Since I'm not a ML guy, I can only think about getting the distance between the functions on all lines and choosing the closest one. But I'm sure this is not how I should go.
I would be grateful for any instructions, links, evidence, etc.
source share