Comparison of NLP / Machine Learning Text Messages

Question

Comparison of NLP / Machine Learning Text Messages

I am currently developing a program with the ability to compare small text (say, 250 characters) with a set of similar texts (about 1000-2000 texts).

The goal is to analyze if text A is similar to one or more texts in the collection, and if so, the text in the collection should be restored by identifier. Each text will have a unique identifier.

I would like the result to be as follows:

Option 1: Text A corresponds to text B with 90% similarity, text C with 70% similarity, etc.

Option 2: Text Harmonized text D with highest affinity

I read some machine learning at school, but I'm not sure which algorithm is best for this problem, or if I should consider using NLP (not familiar with the object).

Does anyone have a suggestion on which algorithm to use or where can I find scientific literature to solve my problem?

Thanks for any input!

+8

compare machine-learning nlp

RobertH Aug 26 '13 at 8:28

source share

2 answers

This is not a machine learning problem, you are just looking for a measure of text similarity . Choosing one of them, you simply sort your data in accordance with the achieved "points".

Depending on your texts, you can use one of the following indicators ( list from the wiki ) or define your own:

Hamming Distance
Levenshtein distance and Damerau-Levenshtein distance
Needleman-Wunsch Distance or Seller Algorithm
Distance Smith Waterman
Distance Gotoh or Distance Smith-Waterman-Gotoh
Distance Monge Elkan
The distance in the block or the distance L1 or the distance from the city block
Yaro-Winkler distance
Distance indicator from Soundex
Simple Conformity Ratio (SMC)
Bone ratio
similarity to Jaccard or Jaccard coefficient or Tanimoto coefficient
Tversky Index
Overlap coefficient
Euclidean distance or L2 distance
Cosine of similarity
Variational distance
Distance Hellinger or distance Bhattacharya
Information radius (Jensen-Shannon divergence)
Oblique differences
Probability of confusion
Tau metric, Kullab-Leibler divergence approximation
Fellegi and Sunters Score (SFS)
Maximum matches
Lee distance

Some of the above (for example, the similarity to cosine) require converting your data into a vectorized format. This process can also be achieved in many ways with the simplest possible package of words / tfidf.

The list itself is far from complete, it is just a project of such methods. In particular, there are many string cores that are also suitable for measuring text similarity. In particular, Wordnet Kernel can measure semantic similarities based on one of the most comprehensive semantic data in the English language.

+19

lejlot Aug 26 '13 at 8:45

source share

RobertH · Accepted Answer · 2013-08-26T12:32:32+0000

I found a great article for measuring semantic similarity, which is perfect for my problem.

WordNet-based semantic affinity

Thanks for all the input!

Comparison of NLP / Machine Learning Text Messages

More articles: