PHP / MySQL - search for elements that have similar or corresponding properties

I am trying to develop a way to get an entity with a number of properties and search for similar objects in the database (by matching as many properties as possible in the correct order). The idea is that then it will return% of how similar it is.

The order of the properties should also be taken into account, so properties at the beginning are more important than those at the end.

For example:

Paragraph 1 - A, B, C, D, E

Paragraph 2 - A, B, C, D, E

Will fit 100%

Paragraph 1 - A, B, C, D, E

Paragraph 2 - B, C, A, D, E

This will not be a perfect match since the properties are in a different order.

Paragraph 1 - A, B, C, D, E

Item 2 - F, G, H, I, A

It would be a low coincidence, since only one property is the same, and it is in position 5

This algorithm will work for thousands and thousands of records, so it should be high-performance and efficient. Any thoughts on how I can do this in PHP / MySQL quickly and efficiently?

I looked at levenshtein , but as far as I can tell, it will also consider the distance between two completely different words in terms of spelling. It doesn't seem ideal for this scenario unless I just use it incorrectly.

Perhaps this can be done exclusively in MySQL, perhaps using full-text search or something like that.

This seems like a good solution , although it is not intended for this scenario. Maybe a binary comparison can be used in some way?

+8
php mysql compare
source share
2 answers

what I would do is encode the value of the order and property into a number. numbers have the advantage of quick comparisons.

This is a general idea and may still need some work, but I hope it helps in some way.

compute a number (some form of hash) for each property and multiply a number representing the order in which the property appears for the element.

say item1 has 3 properties A, B and C.

hash (A) = 123, hash (B) = 345, hash (C) = 456

then multiply this by the order of appearance, given that we know the number of properties:

(hash (A) * 1,000.00) + (hash (B) * 1,000) + (hash (C) * 1) = someval

the multiplier value can be changed to reflect your data set. you will have to identify the hash function. soundex maybe?

the problem now boils down to the question of uniqueness due to hash collisions, but we can be sure of properties that do not match.

this will have the advantage of the relative simplicity of checking whether the property will be displayed in another element in a different order, using the multiplier value to extract the hash value from the number of the generated number.

NTN.

edit: match checking example

of this clause 1 (abc) and item2 (abc). the computed hash of the elements will be equal. this is the best scenario. no further calculations are required.

of this clause 1 (abc) and item2 (dea). the computed hash of the elements is not equal. let's move on to breaking property hashes ...

say a hash table for the properties a = 1, b = 2, c = 3, d = 4, e = 5 with 10 ^ n for the multiplier. the calculated hash for item1 is 123, and item2 is 451, split the calculated hash for each property and compare for all combinations of properties, one for each element1 (which becomes item1 (1 2 3)) and item2 (which becomes item2 (4 5 1) ) then calculate the grade.

another way to look at this is to compare properties one by one, except for this time, you play with numbers instead of actual string values

+2
source share

You can draw inspiration (or algorithms with a flat algorithm) from various sequence alignment algorithms , such as Smith-Waterman . Actually what you are looking for seems to be a sequence alignment description. However, I am not sure if this can be done as an SQL query.

+1
source share

All Articles