What algorithms / concepts to dig for author prediction

I am working on something that will try to figure out the author of a column using my own dataset.

I plan on using the mlpy python library. It has good documentation, (about 100 pages in pdf format). I am also open to another suggestion library.

The fact is that I got lost in the Data Mining and Machine Learning concepts. Too much work, too many algorithms and concepts.

I ask for guidance on which algorithms / concepts I should study, and a search for my specific problem.

So far I have created a dataset that looks something like this.

| author | feature x | feature y | feature z | some more features | |--------+-----------+-----------+-----------+--------------------| | A | 2 | 4 | 6 | .. | | A | 1 | 1 | 5 | .. | | B | 12 | 15 | 9 | .. | | B | 13 | 13 | 13 | .. | 

Now, I will get a new column and analyze it, after which I will have all the functions for the column, and my goal is to find out who the author of this column is.

Since I'm not a ML guy, I can only think about getting the distance between the functions on all lines and choosing the closest one. But I'm sure this is not how I should go.

I would be grateful for any instructions, links, evidence, etc.

+4
source share
4 answers

If you have enough training data, you can use the kNN (k-Nearest Neighbor) classifier for your purpose. It is easy to understand, but powerful.

Check scikits.ann for a possible implementation.

This tutorial here serves as a good reference for the one found in scikits-learn.

Edit: Also, here is the kNN scikits-learn page. You can easily understand this from the above example.

And, mlpy also seems to have kNN .

+3
source

You have a wide selection of algorithms implemented on mlpy, so everything should be in order. I agree with Steve L when I said that Vector Machines support is great, but even when it’s easier to use internal parts, it’s not easy to understand, especially if you are new to ML.

In addition to kNN, you can consider the Classification Tree ( http://en.wikipedia.org/wiki/Decision_tree_learning ) and Logistic Regression ( http://en.wikipedia.org/wiki/Logistic_regression ).

Firstly, decision trees have the advantage of creating a conclusion that is easy to understand and therefore easier to debug.

Logistic regression, on the other hand, can give you good results and scale very well if you need more data.

I would say that in your case you will be looking for an algorithm that after reading it will be a little more convenient for you to work. In most cases, they are all very capable of giving you very decent results. Good luck

+2
source

As already mentioned, you can use many algorithms for attribution of authorship. kNN is a good starting point. In addition, you can try several other algorithms, such as Logistic Regression , the Naive Bayes classifier, and neural networks, which are likely to give more accurate predictions.

I was also interested in attribution authorship and plagiarism detection. In fact, I used the above methods to attribute the authorship of the source code. You can learn more about this using the following research papers.

In addition, if you plan to use Python, you can also look at the http://scikit-learn.org/stable/ library. It is also an extensive library that comes with good documentation.

+2
source

Given that you are not familiar with ML, the first three algorithms that I would recommend would be the following:

1- Logistic regression 2- Naive Bayes 3- Support for vector machines

If you are only interested in predictive performance, you have enough training data and no missing values, you will find that using more sophisticated methodologies, such as Bayesian networks, will not lead to statistically significant improvements in your predictive performance. Even so, you should start with these three (relatively) simple methodologies and use them as reference standards.

+1
source

All Articles