SQL and fuzzy comparison

Suppose we have a table of people (first name, last name, address, SSN, etc.).

We want to find all rows that are “very similar” to the specified person A. I would like to implement some kind of fuzzy logical comparison of A and all rows from the People table. There will be several fuzzy inference rules working separately for several columns (for example, 3 fuzzy rules for a name, 2 rules for a surname, 5 rules for an address)

Question: Which of the following two approaches would be better and why?

  • Inject all the fuzzy rules as stored procedures and use one heavy SELECT statement to return all strings that are “very similar” to A. This approach may include using soundex, sim metric, etc.

  • Inject one or more simplified SELECT statements that return less accurate results that are “more like” A, and then fuzzy, compare A with all returned rows (outside the database) to get “very similar” rows. Thus, a fuzzy comparison will be implemented in my favorit programming language.

Table People should have up to 500 thousand rows, and I would like to make about 500-1000 queries like this a day. I am using MySQL (but this has not yet been accounted for).

+8
sql mysql select fuzzy-logic fuzzy-comparison
source share
4 answers

I really don't think there is a definitive answer, because it depends on information not available in the question. Anyway, too long for comment.

DBMSs can extract information in accordance with indexes. It does not make sense for the db server to spend time on heavy computing if it is not intended for that specific purpose (as @Adrian replied).

Therefore, your client application must delegate the DBMS information required by the rules.

If the calculations are insignificant, everything can be done on the server. Otherwise, pull it into the client system.

The disadvantage of the second approach is the amount of data moving from the server to the client, and the number of connections to establish. Thus, this is usually a compromise between computing and transmitting data on the server. The balance that must be achieved depending on the characteristics of the rules of fuzzy expression.

Edit: I saw in the comment that you should almost certainly implement the code in the client. In this case, you should consider an additional criterion, the location of the code, for maintenance purposes, i.e. Try to link all the code connected together, rather than distributing it between systems (and languages).

+3
source share

I would say that it’s best for you to use simple choices to get the closest matches that you can without hacking the database, and then do the hard work at your application level. The reason I propose this solution is scalability: if you are doing your hard work at the application level, your problem is an ideal solution for reducing the size of the map, in which you can distribute affinity processing between nodes and get your results back much faster than if you put it through a database; plus, in this way, you are not blocking your database or slowing down any other operations that may continue at the same time.

+2
source share

Since you are still considering which database to use PostgreSQL has a fuzzystrmatch module that provides Levenshtein and Soundex functions. Alternatively, you can look at the pg_trm module as described here . You may also be able to put the index in the column using soundex (), so you don't have to calculate this every time. But you seem to have optimized prematurely, so my advice would be to check with pg and then ask if you need to optimize or not, the numbers you specified really don't look like you almost have two minutes to run one request.

+1
source share

I would consider adding a column to People Talbe, which is the SoundEx value for a person.

I made connections using

Select [Column} From People P Inner join TableA A on Soundex(A.ComarisonColumn) = P.SoundexColumn 

This will return something in TableA that has the same SoundEx value from the SoundEx column of the PeopleEx table.

I have not used such a query for tables of this size, but I see no problem trying it. You can also index this SoundExColumn for better performance.

0
source share

All Articles