Processing distorted text data using machine learning or NLP

I am trying to extract data from several large text files containing records about people. The problem is that I can’t control the data entry method.

Usually it is in this format:

LASTNAME, Name Middlameame (Possibly an alias) Why is this text here January 25, 2012

Name Surname 2001 Some texts that do not bother me

Surname, first name blah blah ... January 25, 2012

I am currently using a huge regex that breaks all kindaCamelcase words, all words with the month name are tied to the end, and there are many special cases for names. Then I use more regular expressions to extract a large number of combinations for the name and date.

It seems suboptimal.

Are there Python machine learning libraries that can parse incorrect data that are somewhat structured?

I tried NLTK but it was not able to process my dirty data. Now I am doing Oring, and I like the style of OOP, but I'm not sure that I am wasting my time.

Ideally, I would like to do something similar to train the parser (with many I / O pairs):

 training_data = ( 'LASTNAME, Firstname Middlename (Maybe a Nickname)FooBarJanuary 25, 2012', ['LASTNAME', 'Firstname', 'Middlename', 'Maybe a Nickname', 'January 25, 2012'] ) 

Is something like this possible, or am I overestimating machine learning? Any suggestions would be appreciated since I would like to know more about this topic.

+8
python parsing machine-learning nlp
source share
5 answers

I ended up implementing a somewhat complex series of comprehensive regular expressions that covered all possible use cases using text “filters” that were replaced with the corresponding regular expressions when loading the parser.

If anyone is interested in the code, I will edit it in this answer.


Here is basically what I used. To build regular expressions from my "language", I had to make replacement classes:

 class Replacer(object): def __call__(self, match): group = match.group(0) if group[1:].lower().endswith('_nm'): return '(?:' + Matcher(group).regex[1:] else: return '(?P<' + group[1:] + '>' + Matcher(group).regex[1:] 

Then I created a generic Matcher class that built a regular expression for a specific pattern based on the pattern name:

 class Matcher(object): name_component = r"([AZ][A-Za-z|'|\-]+|[AZ][az]{2,})" name_component_upper = r"([AZ][AZ|'|\-]+|[AZ]{2,})" year = r'(1[89][0-9]{2}|20[0-9]{2})' year_upper = year age = r'([1-9][0-9]|1[01][0-9])' age_upper = age ordinal = r'([1-9][0-9]|1[01][0-9])\s*(?:th|rd|nd|st|TH|RD|ND|ST)' ordinal_upper = ordinal date = r'((?:{0})\.? [0-9]{{1,2}}(?:th|rd|nd|st|TH|RD|ND|ST)?,? \d{{2,4}}|[0-9]{{1,2}} (?:{0}),? \d{{2,4}}|[0-9]{{1,2}}[\-/\.][0-9]{{1,2}}[\-/\.][0-9]{{2,4}})'.format('|'.join(months + months_short) + '|' + '|'.join(months + months_short).upper()) date_upper = date matchers = [ 'name_component', 'year', 'age', 'ordinal', 'date', ] def __init__(self, match=''): capitalized = '_upper' if match.isupper() else '' match = match.lower()[1:] if match.endswith('_instant'): match = match[:-8] if match in self.matchers: self.regex = getattr(self, match + capitalized) elif len(match) == 1: elif 'year' in match: self.regex = getattr(self, 'year') else: self.regex = getattr(self, 'name_component' + capitalized) 

Finally, there is a generic Pattern object:

 class Pattern(object): def __init__(self, text='', escape=None): self.text = text self.matchers = [] escape = not self.text.startswith('!') if escape is None else False if escape: self.regex = re.sub(r'([\[\].?+\-()\^\\])', r'\\\1', self.text) else: self.regex = self.text[1:] self.size = len(re.findall(r'(\$[A-Za-z0-9\-_]+)', self.regex)) self.regex = re.sub(r'(\$[A-Za-z0-9\-_]+)', Replacer(), self.regex) self.regex = re.sub(r'\s+', r'\\s+', self.regex) def search(self, text): return re.search(self.regex, text) def findall(self, text, max_depth=1.0): results = [] length = float(len(text)) for result in re.finditer(self.regex, text): if result.start() / length < max_depth: results.extend(result.groups()) return results def match(self, text): result = map(lambda x: (x.groupdict(), x.start()), re.finditer(self.regex, text)) if result: return result else: return [] 

It was pretty complicated, but it worked. I will not publish all of the source code, but this should get someone to start. As a result, he converted the file as follows:

 $LASTNAME, $FirstName $I. said on $date 

A compiled regex with named capture groups.

+3
source share

I have a similar problem, mainly due to a problem with exporting data from Microsoft Office 2010, and the result is combining two consecutive words at some regular interval. A domain region is a morchological operation, such as spell checking. You can go to a machine learning solution or create a heuristic solution, just like me.

A simple solution is to assume that the newly formed word is a combination of proper names (with the first character, uppercase).

The second additional solution is to have a dictionary of real words and try a set of locations that generate two (or at least one) valid words. Another problem may arise when one of them is a proper name, which, by definition, is not from the dictionary in the previous dictionary. perhaps one way to use word length statistics that can be used to determine if a word is an erroneously formed word or is actually legal.

In my case, this is part of the manual correction of large text corps (man-in-loop testing), but the only thing that can be automated is the selection of probably incorrect words and a corrected recommendation.

0
source share

As for concatenated words, you can break them down with a tokenizer:

OpenNLP Tokenizers segments an input character sequence into tokens. Tokens are usually words, punctuation marks, numbers, etc.

For example:

Pierre Vinken, 61, will join the board as a non-existent director on November 29.

designated as:

Pierre Vinken, 61, will join the board as CEO on November 29.

OpenNLP has a "trained tokenizer" that you can train. If this does not work, you can try the answers: Detect the most likely words from the text without spaces / combined words .

When splitting is done, you can eliminate punctuation and pass it to the NER system, such as CoreNLP :

Johnson John Doe Possibly nickname Why this text is here January 25, 2012

which outputs:

  Tokens Id Word Lemma Char begin Char end POS NER Normalized NER 1 Johnson Johnson 0 7 NNP PERSON 2 John John 8 12 NNP PERSON 3 Doe Doe 13 16 NNP PERSON 4 Maybe maybe 17 22 RB O 5 aa 23 24 DT O 6 Nickname nickname 25 33 NN MISC 7 Why why 34 37 WRB MISC 8 is be 38 40 VBZ O 9 this this 41 45 DT O 10 text text 46 50 NN O 11 here here 51 55 RB O 12 January January 56 63 NNP DATE 2012-01-25 13 25 25 64 66 CD DATE 2012-01-25 14 2012 2012 67 71 CD DATE 2012-01-25 
0
source share

One part of your problem: "all words that have the name of the month are attached to the end",

If it appears that you have a date in the format Monthname 1-or-2-digit-day-number, yyyy at the end of the line, you should use a regular expression to wrap it the first time. Then it is now much easier for you to work with the rest of the input string.

Note. Otherwise, you may run into problems with the given names, which are also the names of the months, for example. April, May, June, August. Also, March is a last name that can be used as a “middle name”, for example. SMITH, John March .

Your use of the terminology “last / first / middle” is “interesting”. There are potential problems if your data includes non-Anglo names, such as:

Mao Zedong aka Mao Ze Dong aka Mao Tse Tung
Sima Qian aka Ssu-ma Ch'ien
Saddam Hussein Abd al-Majid al-Tikriti
Noda Yoshihiko
Kossuth Lajos
José Luis Rodríguez Zapatero
Pedro Manuel Mamede Passos Coelho
Sukarno

0
source share

A few pointers to get you started:

  • for parsing a date, you can start with a few regular expressions, and then you can use chronic or jChronic
  • for names, these OpenNlp models should work

As for the preparation of a machine learning model on its own, it is not so simple, especially in relation to the training (work) data ...

0
source share

All Articles