Named entity recognition using openNLP (default model)

Can someone specify the algorithm (s) used by the openNLP NameFinder module? The code is complex and only slightly documented and playing with it like a black box (with the default model provided) gives the impression that it is mostly heuristic. Here are some examples of input and output:

Input:

John Smith is upset.

John Smith is upset.

Barack Obama is upset.

Hugo Chavez is upset. (no more)

Jeff Atwood is upset.

Bing Liu is upset by the openNLP NER module.

Noam Chomsky is upset by the world.

Jaden Smith is upset.

Smith Jaden is disappointed.

Lady Gaga is disappointed.

Ms. Gaga is upset.

Ms. Gaga is upset.

Jaden is disappointed.

Mr. Liu is upset.

Output (I changed diamonds to square brackets):

[START: man] John Smith [END] is disappointed.

John Smith is upset.

[BEGIN: man] Barack Obama [END] is upset.

Hugo Chavez is upset. (no more)

[START: man] Jeff Atwood [END] is disappointed.

Bing Liu is upset by the openNLP NER module.

[START: man] Noam Chomsky [END] is upset by the world.

Jayden [START: Man] Smith [END] is disappointed.

[START: man] Smith [END] [START: man] Jayden [END] is disappointed.

Lady Gaga is disappointed.

Ms. Gaga is upset.

Ms. Gaga is upset.

Jaden is disappointed.

Mr. Liu is upset.

It seems that the model is simply exploring a fixed list of names that are annotated in the training data and allow for some fragments and combinations. Two notable (FN) examples:

  • Strong name indicators, such as Mr. and Mrs., are ignored.
  • Jayden (No. 4 the most popular name in the USA in 2011) was not identified until the next "Smith" was defined (in "Jayden Smith ..."). I suspect the model is “thinking” that the capitalized Jaden at the beginning of the sentence is the beginning of the sentence and should not be NE. Rejecting the Smith Jayden order as a hint (condition 1), openNLP identifies it as two distinctive NE elements, unlike other full names such as John Smith, possibly suggesting that Smith is on the list names ..

-> I am puzzled and upset, and if someone can point me to the algorithm (or check it sucks), I will be grateful.

ps both at Stanford and UIUC NER work much better with some minor differences that are interesting but not relevant to the topic (this question is too long as it is)

+6
source share
1 answer

As the name suggests, NameFinderME uses the Maximum Entropy model. Here is an original article about ME.

If OpenNLP performance does not meet your requirements and you cannot use Stanford or UIUC NER, I recommend trying Mallet using CRF . This sample code should get started.

+5
source

All Articles