Classic approach:
- Collect a representative sample of input texts, each of which is designated as interconnected / unrelated.
- Divide the sample into training and test sets.
- Extract all terms in all documents in the training set; call it vocabulary, V.
- For each document in the training set, convert it to a vector of Boolean elements, where the i-th element is true / 1 if the i-th term in the dictionary contains in the document.
- Submission of a vectorized set of training in the learning algorithm.
Now, to classify a document, vectorize it in the same way as in step 4. and pass it to the classifier to get a related / unrelated label for it. Compare this with the actual label to make sure it did the right thing. This simple method allows you to get at least about 80% accuracy.
To improve this method, replace booleans with the term number normalized to the length of the document or, even better, tf-idf .
source share