Unfortunately, doing this in PHP is overly expensive (high CPU and memory usage). However, you can apply the algorithm to small data sets.
To explain in detail how to create a server crisis, several PHP built-in functions will determine the "distance" between the lines: levenshtein and Similar_text .. p>
Dummy data: (pretend these are news headlines)
$ titles = <<< EOF
Apple
Apples
Orange
Oranges
Banana
EOF;
$ titles = explode ("\ n", $ headers);
At this point, $ titles should just be an array of strings. Now create a matrix and compare each heading with EVERY other heading for similarities. In other words, for 5 headers you get a 5 x 5 matrix (25 entries). Where the processor and memory are loading.
This is why this method (via PHP) cannot be applied to thousands of records. But if you want:
$ matches = array ();
foreach ($ titles as $ title) {
$ matches [$ title] = array ();
foreach ($ titles as $ compare_to) {
$ matches [$ title] [$ compare_to] = levenshtein ($ compare_to, $ title);
}
asort ($ matches [$ title], SORT_NUMERIC);
} At the moment, you basically have a matrix with "text distances". In the concept (not in real data), it looks something like the one in the table below. Please note that there is a set of 0 values ββthat go diagonally - this means that in the correspondence cycle two identical words are, well, identical.
Apple Apples Orange Oranges Banana
Apple 0 1 5 6 6
Apples 1 0 6 5 6
Orange 5 6 0 1 5
Oranges 6 5 1 0 5
Banana 6 6 5 5 0
The actual $ matches array looks like this (truncated):
Array
(
[Apple] => Array
(
[Apple] => 0
[Apples] => 1
[Orange] => 5
[Banana] => 6
[Oranges] => 6
)
[Apples] => Array
(
...
In any case, it is up to you (through experiments) to determine what a good numerical distance limit can basically coincide with - and then apply it. Otherwise, read on sphinx-search and use it - since it has PHP libraries.
Orange are you glad you asked about this?