Comparison of NLP / Machine Learning Text Messages

I am currently developing a program with the ability to compare small text (say, 250 characters) with a set of similar texts (about 1000-2000 texts).

The goal is to analyze if text A is similar to one or more texts in the collection, and if so, the text in the collection should be restored by identifier. Each text will have a unique identifier.

I would like the result to be as follows:

Option 1: Text A corresponds to text B with 90% similarity, text C with 70% similarity, etc.

Option 2: Text Harmonized text D with highest affinity

I read some machine learning at school, but I'm not sure which algorithm is best for this problem, or if I should consider using NLP (not familiar with the object).

Does anyone have a suggestion on which algorithm to use or where can I find scientific literature to solve my problem?

Thanks for any input!

+8
compare machine-learning nlp
source share
2 answers

I found a great article for measuring semantic similarity, which is perfect for my problem.

WordNet-based semantic affinity

Thanks for all the input!

+4
source share

This is not a machine learning problem, you are just looking for a measure of text similarity . Choosing one of them, you simply sort your data in accordance with the achieved "points".

Depending on your texts, you can use one of the following indicators ( list from the wiki ) or define your own:

  • Hamming Distance
  • Levenshtein distance and Damerau-Levenshtein distance
  • Needleman-Wunsch Distance or Seller Algorithm
  • Distance Smith Waterman
  • Distance Gotoh or Distance Smith-Waterman-Gotoh
  • Distance Monge Elkan
  • The distance in the block or the distance L1 or the distance from the city block
  • Yaro-Winkler distance
  • Distance indicator from Soundex
  • Simple Conformity Ratio (SMC)
  • Bone ratio
  • similarity to Jaccard or Jaccard coefficient or Tanimoto coefficient
  • Tversky Index
  • Overlap coefficient
  • Euclidean distance or L2 distance
  • Cosine of similarity
  • Variational distance
  • Distance Hellinger or distance Bhattacharya
  • Information radius (Jensen-Shannon divergence)
  • Oblique differences
  • Probability of confusion
  • Tau metric, Kullab-Leibler divergence approximation
  • Fellegi and Sunters Score (SFS)
  • Maximum matches
  • Lee distance

Some of the above (for example, the similarity to cosine) require converting your data into a vectorized format. This process can also be achieved in many ways with the simplest possible package of words / tfidf.

The list itself is far from complete, it is just a project of such methods. In particular, there are many string cores that are also suitable for measuring text similarity. In particular, Wordnet Kernel can measure semantic similarities based on one of the most comprehensive semantic data in the English language.

+19
source share

All Articles