Liblinear how to use it

I am new to machine learning and text development in general. It caught my attention the presence of a ruby ​​library called Liblinear https://github.com/tomz/liblinear-ruby-swig .

What I want to do so far is to prepare software to determine if the text mentions anything related to bicycles or not.

Can someone please highlight the steps that I have to follow (for example: preprocess the text and how), share resources and ideally share a simple example to make me move.

Any help will help, thanks!

+4
source share
1 answer

Classic approach:

  • Collect a representative sample of input texts, each of which is designated as interconnected / unrelated.
  • Divide the sample into training and test sets.
  • Extract all terms in all documents in the training set; call it vocabulary, V.
  • For each document in the training set, convert it to a vector of Boolean elements, where the i-th element is true / 1 if the i-th term in the dictionary contains in the document.
  • Submission of a vectorized set of training in the learning algorithm.

Now, to classify a document, vectorize it in the same way as in step 4. and pass it to the classifier to get a related / unrelated label for it. Compare this with the actual label to make sure it did the right thing. This simple method allows you to get at least about 80% accuracy.

To improve this method, replace booleans with the term number normalized to the length of the document or, even better, tf-idf .

+2
source

All Articles