Fuzzy search string in Java

I am looking for a high-performance Java library for finding fuzzy strings.

There are many algorithms for finding similar strings, Levenshtein distance, Daitch-Mokotoff Soundex, n-grams, etc.

What Java implementations exist? Pros and cons for them? I know about Lucen, is there any other solution, or is Lucen better?

I found them, does anyone have any experience with them?

+65
java nlp fuzzy-search
Nov 29 '08 at 13:17
source share
8 answers

Commons Lang has a Levenshtein distance implementation.

Commons Codec has an implementation of soundex and metaphone .

+35
Nov 29 '08 at 14:58
source share

You can use Apache Lucene, but depending on the use case, this may be too heavy. For very simple fuzzy searches, it can be a little difficult to use and (fix me if I'm wrong), this requires you to build an index.

If you need a simple online algorithm (= not supporting an index), you can use the fuzzy Bitap algorithm . I found an implementation in Java here . This code is suitable for one relatively short method with an almost self-evident signature:

public static List<Integer> find(String doc, String pattern, int k) 

Apache Commons StringUtils has an implementation of the Levenshtein algorithm for fuzzy string matching. It can be considered as a fuzzy version of String.equals , the bit file is similar to a fuzzy version of String.indexOf and still uses the Levenshtein distance measure. Usually more efficient than naive, using Levenshtein to compare a search pattern with each substring that might match.

Notes :

  • The Bitap algorithm seems to be mostly useful for relatively small alphabets, for example. simple ASCII. In fact, the version of Simon Watiau that I am associated with throws an ArrayIndexOutOfBoundsException on non-ASCII characters (> = 128), so you have to filter them out.
  • I tried using Bimap in the application to search the list in the list of people by name. I found that Levenshtein 2 distance gives too many false positives. Levenshtein’s distance from 1 works is better, but he cannot detect a typo where you change two letters, for example. "William" and "William." I can come up with several ways to solve this problem, for example

    • perform a fuzzy search only if the exact search does not find matches (and displays a message to the user about this)
    • configure Bitap to use Damerau-Levenshtein distance, where the swap has a distance of 1 instead of 2. According to wikipedia , this is possible, but I could not find an existing implementation in Java.
    • instead of "contains" do "startsWith". fuzzy search tools contain a prefix version of Damerau-Levenshtein, but it gave me an ArrayIndexOutOfBoundsException
    • tune the algorithm to enter search results ranking where exact matches are ranked higher

    If you're going to do 2 or 4, it might be better to use the right full-text search library like Lucene anyway.

  • More information on fuzzy searches can be found on this blog . This author also created an implementation in Java called BitapOnlineSearcher , but you need to use java.io.Reader along with the alphabet class. This Javadoc is written in Russian.
+10
Oct. 14 '15 at 12:43
source share

If you mainly compare short strings and want something portable and lightweight, you can use the famous python fuzzywuzzy algorithm ported to Java .

You can read about it here.

+9
Sep 14 '16 at 17:54 on
source share

SimMetrics is what you need: http://sourceforge.net/projects/simmetrics/

It has several algorithms for calculating various flavors of distance editing.

Lucene is a very powerful full-text search engine, but FT search is not exactly the same as fuzzy string matching (for example, if the list of strings finds me the one that looks most like some candidate string).

+8
Dec 18 '08 at 17:18
source share
+2
Oct 28 '11 at 10:21
source share

You can try bitap. I played with a beat written in ANSI C, and the Java implementation at http://www.crosswire.org was pretty fast.

+1
Apr 6 2018-10-14T00:
source share

You can try the Completely library, it relies on text preprocessing to create an index in memory for an effective (fuzzy) answer in large data sets. Unlike Lucene and other full-featured text search libraries, the API is small and easy to get started.

+1
Sep 19 '16 at 16:18
source share

Apache Lucene is the only way I think. I do not know the best search lib.

Apache Lucene (TM) is a high-performance, full-featured, text-based search library written entirely in Java. This technology is suitable for almost any application that requires full-text search, especially cross-platform.

0
Nov 29 '08 at 13:40
source share



All Articles