How to search for text fragments in a database

Are there any open source tools or commercial tools available that allow you to index a piece of database content and can be queried with Java?

The background is a large MySQL database table with several hundred thousand records, containing several VARCHAR columns. In these columns, people would like to search for pieces of content, so a full-text index (based on word boundaries) would not help.

EDIT : [Added to clarify why these first sentences do not solve the problem:]

This is why MySQL, built in a full-text index, will not do the job, and neither Lucene nor Sphinx, all of which were not suggested in the answers. I already looked at both, but as far as I can tell, they are based on indexing words, excluding stop words and doing all kinds of reasonable things for real full-text search. However, this does not fit, because I can search for a search term, for example, "oison", which should match "Roisonic Street", as well as "Poison-Ivy". The main difference here is that the search query is just a fragment of the contents of the column , which does not have to be limited to any special characters or space.

EDIT2 : [Additional information added:] The requested function, which should be implemented on the basis of this, is a very free search for descriptions of elements in the goods management system. Users often do not know the correct position number, but only part of the element name. Unfortunately, the quality of these descriptions is rather poor; they come from an inherited system and cannot be easily changed. If, for example, people were looking for a sledgehammer, they would go into the sled. With a word / token-based index, it did not find matches that are stored as a “sledgehammer”, but only those who listen to the “sledgehammer”. There are all kinds of weird deviations that need to be addressed, making the token-based approach impractical.

Currently, we can only make a LIKE '%searchterm%' query LIKE '%searchterm%' , effectively disabling any index usage and consuming a lot of resources and time.

Ideally, any such tool will create an index that would allow me to get results for such queries very quickly, so that I can implement a searchlight-like search only by receiving "real" data from the MySQL table through the primary key when the user selects the result record.

If possible, the index should be updatable (without the need for a complete overhaul), as the data can change and should be immediately searchable by other clients.

I would be happy to receive recommendations and / or receive reports.

EDIT3: commercial solutions found that “just works” Despite the fact that I have many good answers to this question, I would like to point out here that in the end we went with a commercial product called “QuickFind” that was manufactured and sold by the German company "HMB Datentechnik". Please note that I am not affiliated with them in any way, because it may seem like this when I continue and tell you what their product can do. Unfortunately, their site looks pretty bad and only in German, but the product itself is really wonderful. I currently have a trial version - you will have to contact them without downloading - and I am very impressed.

Since the Internet is not complete documentation, I will try to describe my experience so far.

What they do is create a custom index file based on the contents of the database. They can integrate through ODBC, but from what I said, clients rarely do this. Instead - and this is what we are likely to do - you create an export of text (such as CSV) from your main database and pass it to your index. This allows you to be completely independent of the actual structure of the table (or any SQL database in general); in fact, we export data combined from several tables. Indexes can be incrementally updated later on the fly.

Based on the fact that their server (only 250 KB or so, acting as a console application or Windows service) is used to listen for requests on the TCP port. The protocol is text-based and looks a bit "old", but it is simple and works. Basically you just pass in which of the available indexes you want to query, and search terms (fragments), separated by spaces. Three available output formats are available: an array of HTML / JavaScript, XML, or CSV. I am currently working on a Java shell for a somewhat "obsolete" wire protocol. But the results are fantastic: I currently have an approximate data set of approximately 500,000 records with 8 columns indexed, and my test application starts searching all 8 columns for the contents of the JTextField every time I press a key while editing, and can update the display of the results (JTable) in real time! This happens without accessing the MySQL instance from which the data was originally obtained. Based on the columns you will return, you can ask for the “original” record by querying MySQL with the primary key of this row (this, of course, should be included in the QuickFind index).

The index is about 30-40% of the size of the text export version of the data. Indexing was mainly related to disk I / O; my 500,000 records took about a minute or two to be processed.

It’s hard to describe it, because it was even hard for me to believe when I saw a demo version of my own product. They presented a database of addresses of 10 million lines and searched for fragments of names, addresses and phone numbers, and when the "Search" button was clicked, the results returned in a second - everything was done on a laptop! From what I was told, they often integrate with SAP or CRM systems to improve search time when call center agents simply understand fragments of the caller’s names or addresses.

Somehow, I probably won't describe it much better. If you need something like this, you should definitely check it out. Google Translate does a pretty good job translating its site from German to English, so this may be a good start.

+5
source share
10 answers

I did not have this specific requirement, but my experience tells me that Lutsen can do the trick, although it may not be autonomous. I would definitely use it through Solr, as described by Michael Della Bitta in the first answer. The link he gave was in place - read it for more information.

In short, Solr allows you to define custom FieldTypes. They consist of a time index analyzer and a query time analyzer. The analyzers figure out what to do with the text, and each of them consists of a Tokenizer and zero for many TokenFilters. Tokenizer breaks your text into pieces, and then each TokenFilter can add, subtract or modify tokens.

Thus, a field can index something completely different from the source text, including, if necessary, several tokens. So what you want is a multipoint copy of the source code you request by sending Lucene something like "my_ngram_field: sledge". No wildcards :-)

Then you follow a model similar to the prefix search suggested in solrconfig.xml:

 <fieldType name="prefix_token" class="solr.TextField" positionIncrementGap="1"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> </fieldType> 

EdgeNGramFilterFactory is how they implement prefix matching to autocomplete the search box. It takes markers from previous steps (individual words with space separators are converted to lowercase) and inflates them into each substring at the leading edge. sledgehammer = s, sl, sle, sled, sledg, sled, couch, etc.

You need to follow this pattern, but replace EdgeNGramFilterFactory with your own, which makes all NGrams in the field. By default, org.apache.solr.analysis.NGramFilterFactory is a good start, but it does permute letters to check spelling. You can copy it and delete it - this is a pretty simple class to implement.

Once you have your own FieldType type (name it ngram_text) with your own MyNGramFilterFactory, just create the source field and ngram field like this:

  <field name="title" type="text" indexed="true" stored="true"/> <field name="title_ngrams" type="ngram_text" indexed="true" stored="false"/> 

Then say to copy the original field to fancy:

 <copyField source="title" dest="title_ngrams"/> 

Well, now that you are viewing "title_ngrams: sledge", you should get a list of documents that contain this. Then, in your list of fields for the query, you simply say to get a field with the title title, not a title_ngrams field.

That should be enough push so that you can approach each other and easily tune it to amazing levels of performance. At our old job, we had a database with more than ten million products with large HTML descriptions and managed to get Lucene to execute both a standard request and spell check for less than 200 ms on a medium-sized server that processes several dozen simultaneous requests. When you have a lot of users, caching causes it to scream!

Oh, and incremental (though not real-time) indexing is cinch. It can even do this under heavy loads, as it creates and optimizes a new index in the background and automatically starts it before replacing it. Very smooth.

Good luck

+4
source

Perhaps this is not what you want to hear, because I assume that you are trying to solve this with SQL code, but Lucene would be my first choice. You can also create fairly smart ranking and performance methods with additional tools. Lucene is written in Java, so it should provide you with exactly the interface you need.

If you were a Microsoft store, most of what you are looking for is built into SQL Server, and you can activate wildcards, which will enable you to perform partial word matches.

In Lucene and Lucene.Net, you can use a wildcard if you want. However, it does not support the use of wildcards as the first character in a search. If you want to use the characters of the first character, you probably need to implement some kind of trio index yourself, because in many cases it is an expensive operation to filter out a lot of terms to something reasonable for this kind of index, most often required for full-text applications finding where suffix stenosis is usually more valuable.

You can apparently modify the Query Parser instance in Lucene to override this rule by setting setAllowLeadingWildcard to true.

I am sure that searching for wildcards at both ends is essentially inefficient. Missing lists are sometimes used to improve performance on such plain text searches, but I think you are more likely to find an implementation like something like grep than a generalized text indexing tool.

There are other solutions to the problem that you describe, where one word can be written as two, or vice versa. For example, Lucene supports fuzzy queries. Spelling and morphological variants can be processed using either a filter that offers sentences based on some Bayesian mechanism, or by indexing tricks, namely, taking the corpus of frequent variants and typing the index in these terms. I even saw knowledge from structured data filled out in a full text engine (for example, adding the city name and the word “hotel” to entries from the hotel table to make it more likely that “Paris hotels” would include a retirement record -house Caisse des Dépôts .) Although this is not a completely trivial problem, it is manageable without destroying the benefits of word searches.

+10
source

If your table is MyISAM, you can use MySQL's full-text search capabilities: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

If not, then the "industry standard" http://www.sphinxsearch.com/

Some ideas on what to do if you are using InnoDB: http://www.mysqlperformanceblog.com/2009/09/10/what-to-do-with-mysql-full-text-search-while-migrating-to -innodb /

Also a good presentation that introduces Sphinx and explains the architecture + use http://www.scribd.com/doc/2670976/Sphinx-High-Performance-Full-Text-Search-for-MySQL-Presentation

Update
After reading your clarification on the question - Sphinx can perform interline games. You need to set "enable-star" and create an infix index with the corresponding min_infix_length (1 will give you all possible substrings, but obviously, the higher the set, the smaller your index and the faster your searches). See http://sphinxsearch.com/docs/current.html for more details.

+3
source

I would use Apache Solr. The indexing strategy is fully customizable (see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters ), you can incrementally read directly from the database to populate the index (see DataImportHandler in the same wiki), and can be requested mainly on any language that speaks HTTP and XML or something like JSON.

+3
source

What you're trying to do is unlikely to ever be much faster than LIKE '%searchterm%' without a lot of custom code. However, the LIKE 'searchterm%' equivalent of LIKE 'searchterm%' should be trivial. You can do what you ask by creating an index of all possible partial words that are not covered by the final wild card, but this would lead to an incredibly large index size, and it would be unusually slow for updates. Long tokens will lead to Bad Things ™. May I ask why you need it? Re: Spotlight ... You understand that Spotlight does not, does it? It is based on tokens, like any other full-text index. Typically, query expansion is the appropriate method to get inaccurate matches if your goal.

Edit:

I had a project in the same way as in one moment; partial rooms for all kinds of things. Finally, we settled on searchterm* in Xapian, but I believe Lucene also has an equivalent. You won’t find a good solution that handles wildcard searches on both sides of the token, but the final wild-card is usually more than good enough for what you want, and I suspect you will find that users adapt to yours quickly enough. if they have control over data cleansing. Combine it with a request extension (or even a limited token extension), and you should be pretty well tuned. A query extension converts a query for a sledgehammer to a sledgehammer * OR (sled * hammer *) or something similar. Not every query will work, but people are already well trained to try related queries when something doesn't work, and while at least one or two obvious queries come up with the expected results, you should be fine. It’s best to still clean up the data and organize it better. You will be surprised at how easily this ends if you are all versions and implement an egalitarian editing policy. Maybe let people add keywords to the record and be sure to index them, but set limits on how much you can set. Too much and you can actually degrade your search results.

+2
source

how to use tools such as those suggested above (lucene, etc.) for full-text indexing and LIKE searches for cases when nothing is found? (i.e. run LIKE only after the full-text index search returns zero results)

+2
source

A pebble search can do the trick.

http://en.wikipedia.org/wiki/W-shingling

For example, if you use 3-character roof tiles, you can divide Roisonic into roi, son, ic and save all three values ​​by linking them to the original entry. When searching for "oison", you will first search for "ois", "iso", "son". At first you are fuzzy - compare all the records with the help of tiles (find one with the "son"), and then you can refine the search using exact match of strings.

Please note that for a three-character pebble it is required that the fragment in the request be at least 5 characters long, and for a 4-char step, a 7-char request is required, etc.

+2
source

The exact answer to your question is right here. Will it work well enough for the size of your data, this is another question.

+1
source

I am sure that Mysql offers a full-text option, and maybe Lucene can also be used.

See here for relevant comments.

Best effective way to create full-text searches in MySQL

0
source

A “valid” full text index using parts of a word will be many times larger than the source text, and although the search can be faster, any update or insertion processing will be very slow.

We hope only that there is some kind of template in the “mistakes made”. You can apply a rule set of type “AI” to the incoming text and create a canonical form of text that you could apply text index to. An example of a rule would be splitting a word ending in a hammer into two words s / (\ w?) (Hammer) / \ 1 \ 2 / g or changing "sledg" "sled" and "schledge" to sledge. will apply the same set of rules to the query text. In a way that a product described as a "sledgehammer" can be matched by a search for "sledg hammer".

0
source

All Articles