I have a question that is somewhat high, so I will try to be as specific as possible.
I do a lot of research related to combining disparate data sets with header information that relates to the same object, usually a company or financial security. This record binding typically includes header information in which the name is the only common primary identifier, but where some secondary information is often available (e.g. city and state, date of operation, relative size, etc.). These matches are usually one-to-many, but can be one-to-one or many-to-many. I usually did this comparison manually or with a very simple text comparison of cleared substrings. I sometimes used a simple matching algorithm, such as a Levenshtein distance measure, but I never got much from this, partly becausethat I didn’t have a good formal way of applying it.
I assume this is a fairly common question and that there should be some formalized processes designed for this type of thing. I read several scientific articles on this topic that relate to the theoretical appropriateness of these approaches, but I have not found a good source that looks at the recipe or at least the practical structure.
My question is this:
Does anyone know of a good source for implementing multidimensional matching of a fuzzy record, such as a book or website or a published article or working paper?
I would prefer something that has practical examples and a well-defined approach.
The approach can be iterative, with human tests for improvement in the intermediate stages.
() . , , " " " ".
Python, , .
, , , . , .