How to find the name of a person in the text? (Heuristic)

I have a huge list of names of people that I should look for in a huge text.

Only part of the name can be displayed in the text. And you can make mistakes, mistakenly, or shrink. There are no tokens in the text, so I do not know where in the text the name of the person begins. And I don’t know whether this name will appear or not in the text.

Example:

I have “Barack Hussein Obama” on my list, so I have to check the presence of this name in the following texts:

  • ... Candidate Barack Obama was elected President of the United States ... (incomplete)
  • ... Candidate Barack Hussein was elected President of the United States ... (incomplete)
  • ... Candidate Barack H. O .. was elected President of the United States ... (abbreviated)
  • ... Candidate Barack Oban was elected President of the United States ... (with error)
  • ... Candidate Barack Ovama was elected President of the United States ... (mistakenly, B is next to V)
  • ... Candidate John McCain lost the election ... (no Obama name appears)

It is clear that for him there is no deterministic solution, but ...

What is a good heuristic for such a search?

If you had to, how would you do it?

+4
source share
8 answers

You said that about 200 pages.

Divide it into 200 one-page PDF files.

Put each page on Mechanical Turk along with a list of names. Bring a reward of about $ 5 per page.

+6
source

Divide everything into spaces, removing special characters (commas, periods, etc.). Then use something like soundex to handle spelling errors. Or you can go with something like lucene if you need to find a lot of documents.

+5
source

What you want is the Natural Lanuage Processing library. You are trying to identify a subset of your own nouns. If names are the main source of proper nouns, then it will be easy if a decent amount of other proper nouns is added, which will make it more difficult. If you are writing in JAVA, check out OpenNLP or C # SharpNLP. After extracting all your own nouns, you can probably use Wordnet to remove most names other than the name. You can use wordnet to identify "John" type substrings, and then look for adjacent markers to absorb other parts of the name. You will have problems with something like "John Smith Industries". You will need to look at your baseline data to see if there are features you can use to help fix the problem.

Using the NLP solution is the only real reliable method I've seen for such problems. You may have problems, since the 200 pages are actually quite small. Ideally, you will have more text and you can use more statistical methods to help resolve the ambiguity between names and names.

+2
source

First, I'm going to blush for the index server. Lucene, FAST, or Microsoft Indexing Server.

+1
source

I would use C # and LINQ. I would select all the words in space, and then use LINQ to sort the text (and possibly using the Distinct () function) to isolate all the text that interests me. When manipulating text, I would track indexes (which you can do with LINQ) so that I can move text in the source document - if that is the requirement.

+1
source

The best way I can imagine is to define grammars in python NLTK . However, it can become quite complicated for you.

I would do mannaly for regular expressions, creating a permutation list with some programming.

0
source

Both SQL Server and Oracle have built-in SOUNDEX features.

In addition, there is a built-in SQL Server function called DIFFERENCE that can be used.

0
source

a clean old regular expression script will do the job.

use Ruby, it's pretty fast. read lines and words matching.

greetings

-1
source

All Articles