How to match the most “similar” lines from one list to another in python?

Below are two lists containing strings.

  • One of them contains the name of organizations (mainly universities) around the world - not only written in English, but always using the Latin alphabet.

  • Another list contains mostly complete addresses where lines (organizations) from the first list can appear.

Example:

addresses = [ "Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium", "Machine Learning and Computational Biology Research Group, Max Planck Institutes Tübingen, Tübingen, Germany 72076", "Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185", "Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754", "Computer Science Department, University of California, Santa Barbara, USA 93106", "Fraunhofer IAIS, Sankt Augustin, Germany", "Department of Computer Science, Cornell University, Ithaca, NY", "University of Wisconsin-Madison" ] organisations = [ "Catholic University of Leuven" "Fraunhofer IAIS" "Cornell University of Ithaca" "Tübingener Max Plank Institut" ] 

As you can see, the desired display will be:

 "Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium", --> Catholic University of Leuven "Machine Learning and Computational Biology Research Group, Max Planck Institutes Tübingen, Tübingen, Germany 72076", --> Max Plank Institut Tübingen "Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185", --> -- "Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754", --> Fraunhofer IAIS "Computer Science Department, University of California, Santa Barbara, USA 93106", "Fraunhofer IAIS, Sankt Augustin, Germany", --> Fraunhofer IAIS "Department of Computer Science, Cornell University, Ithaca, NY" --> "Cornell University of Ithaca", "University of Wisconsin-Madison", --> -- 

My thinking was to use some kind of “discipline algorithm” to calculate string similarity. Since I can’t just search for an organization in an address, just by doing if address in organisation , because it can be written a little differently in different places. Therefore, I first intended to use the difflib module. Especially the difflib.get_close_matches() function for selecting for each address the closest line from the list of organizations. But I'm not quite sure that the results will be accurate enough. Although I do not know how high I have to establish a relationship, which seams are a measure of similarity.

Before spending too much time testing the difflib module, I thought about asking more experienced people here if this is the right approach or if there is a more suitable tool / way to solve my problem. Thanks!

PS: I do not need the optimal solution.

+8
python string-matching
source share
2 answers

Use the following function as a string distance function (instead of simply levenshtein distance):

 def strdist(s1, s2): words1 = set(w for w in s1.split() if len(w) > 3) words2 = set(w for w in s2.split() if len(w) > 3) scores = [min(levenshtein(w1, w2) for w2 in words2) for w1 in words1] n_shared_words = len([s for s in scores if s <= 3]) return -n_shared_words 

Then use the Munkres assignment algorithm shown here , as it appears to be a 1: 1 mapping between organizations and addresses.

+2
source share

You can use soundex or metaphone to translate a sentence into a phonems list, and then compare the most similar lists.

Here is a Python implementation of a double metaphone .

0
source share

All Articles