I am developing an application that is supposed to extract the names of people from short texts.
What is the best way to do this? is there a name database where i can check where is the name? the fact that the text is short may not be as intense in terms of processing needs.
Any ideas?
Thanks,
There
You can use the statistical Named Entity Recognizer (NER), such as Stanford NER , or LingPipe . These are machine learning based recognizers that do not require huge dictionaries of names as input.
Alternatively, you can get a list of usernames from the Internet (there are many) and use the Aho-Corasick string search algorithm to efficiently extract names from a list from text.
If you are on * nix, try looking at /usr/share/dict/propernames . Mac OS X has this, and I think that at least Ubuntu does too.
/usr/share/dict/propernames
You can use this with grep :
grep
grep -f /usr/share/dict/propernames short_text.txt
I found this link: Extract people names from RSS feeds using WordNet
How about the US Census Bureau genealogy
Get the name dataset:I created a collection of data sets for such tasks. Here you can use my datasets: https://mbejda.imtqy.com . All of them are in CSV format. Names are classified by race and gender.
Named Object Recognizer:Explore OpenNLP or StanfordNLP for name resolver and retrieval.