Below are two lists containing strings.
One of them contains the name of organizations (mainly universities) around the world - not only written in English, but always using the Latin alphabet.
Another list contains mostly complete addresses where lines (organizations) from the first list can appear.
Example:
addresses = [ "Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium", "Machine Learning and Computational Biology Research Group, Max Planck Institutes Tübingen, Tübingen, Germany 72076", "Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185", "Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754", "Computer Science Department, University of California, Santa Barbara, USA 93106", "Fraunhofer IAIS, Sankt Augustin, Germany", "Department of Computer Science, Cornell University, Ithaca, NY", "University of Wisconsin-Madison" ] organisations = [ "Catholic University of Leuven" "Fraunhofer IAIS" "Cornell University of Ithaca" "Tübingener Max Plank Institut" ]
As you can see, the desired display will be:
"Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium", --> Catholic University of Leuven "Machine Learning and Computational Biology Research Group, Max Planck Institutes Tübingen, Tübingen, Germany 72076", --> Max Plank Institut Tübingen "Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185", --> -- "Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754", --> Fraunhofer IAIS "Computer Science Department, University of California, Santa Barbara, USA 93106", "Fraunhofer IAIS, Sankt Augustin, Germany", --> Fraunhofer IAIS "Department of Computer Science, Cornell University, Ithaca, NY" --> "Cornell University of Ithaca", "University of Wisconsin-Madison", --> --
My thinking was to use some kind of “discipline algorithm” to calculate string similarity. Since I can’t just search for an organization in an address, just by doing if address in organisation , because it can be written a little differently in different places. Therefore, I first intended to use the difflib module. Especially the difflib.get_close_matches() function for selecting for each address the closest line from the list of organizations. But I'm not quite sure that the results will be accurate enough. Although I do not know how high I have to establish a relationship, which seams are a measure of similarity.
Before spending too much time testing the difflib module, I thought about asking more experienced people here if this is the right approach or if there is a more suitable tool / way to solve my problem. Thanks!
PS: I do not need the optimal solution.