The best strategy for breaking English-style names into first and last names

I have a list of names, and I need to break them down into first and last names. Since some names have 2-3 spaces, a simple division into spaces will not be performed.

What heuristics do people use to perform a split?

Please note that this is not a duplicate of the questions that effectively ask the question of separation in space; I am looking for heuristics and algorithms, not actual code help.

Update: I am limiting the problem posed by English-style names. This is all I need to solve, and probably everything that suits any of this issue (English).

+6
source share
4 answers

I read a very interesting and comprehensive post on this subject:

http://www.w3.org/International/questions/qa-personal-names

He even offers to ask himself if you really need separate fields for your first and last name. It depends on the target region (s) of your application.

+5
source

Two approaches can help, although not completely solve this problem.

  • Programmatically separate simple ones, those that are not easy to get into another list, "staying divided." Manually sort this list. As you manually sort, some heuristics that can be encoded may occur, which further reduces the size of the remaining list. If this is a one-time thing and the list is not super massive, it will do its job.
  • A close problem is when the name is split, but you do not know what is the first and last. Some systems work around this problem, performing fuzzy queries in such a way that if no match is found in the first attempt, turn the first and last names over and try again. You did not say why you need to split the names. If you need to look for reference data, think of some kind of fuzzy search heuristic that allows you to try different splits instead of trying to get a split of the right front.

Not quite the answer, but in this case there is no perfect answer.

+3
source

Different countries and regions have different name formats. For example, in Asia, a surname's name is usually first followed by the following names. In the West, you have a first and last name, but it gets complicated when people double the barrel or include middle names. And then some regions of people are given only one name.

Personally, I do not think that this is the only algorithm that can give you 100% accurate results. I'm afraid.

+1
source

The following suggests English-style surnames. If not, please update your question.

It is usually safe to assume that the last space character signals the start of a person’s last name. But since there are exceptions, one strategy would be to collect a large database of famous verbose surnames from some other source. You can then check these surnames and treat them as exceptions.

0
source

All Articles