Finding Offers Using NLP

I am trying to parse sentences from a huge amount of text. using java, I started with NLP tools like OpenNLP and Stanford Parser.

But this is where I am stuck. although both of these parsers are quite large, they fail when it comes to uneven text.

For example, in my text most sentences are limited by period, but in some cases, for example, they are not markers. Here, both analyzes fail.

I even tried setting a parameter in stanford syntax for several sentence terminators, but the result was not much better!

Any ideas?

Change To make things easier, I'm looking for parsing text where the delimiter is either a new line ("\ n") or a period (".") ...

+7
source share
5 answers

First you need to clearly define the task. What, in fact, is your definition of "proposal"? Until you have such a definition, you just wander in a circle.

Secondly, cleaning dirty text, as a rule, is a completely different task from the "layout of sentences." The various NLP offer couriers accept relatively clean input text. Another problem is getting HTML or extracted PowerPoint or other obstructions for the text.

Thirdly, Stanford and other large-caliber devices are statistical. Thus, they are guaranteed to have a non-zero error rate. The smaller your data looks like it was trained, the higher the error rate.

+6
source

Write your own sentence separator. You could use something like a Stanford splitter as a first pass, and then write a rule-based post processor for error correction.

I did something similar for the biomedical text I was parsing. After that, I used the GENIA delimiter, and then fixed the material.

EDIT: If you accept HTML input, you must pre-process it, for example, handle bulleted lists and more. Then apply your splitter.

+3
source

There is another great tool for processing natural language - GATE . It has a number of offer releases, including the standard ANNIE offer separator (does not meet your needs completely) and the RegEx offer separator . Use later for any complex splitting.

Precise conveyor for your purpose:

  • Reset PR Document.
  • English tokenizer ANNIE.
  • ANNIE RegEx Sentence Splitter.

You can also use the GATE JAPE rules for an even more flexible pattern search. (See Tao for full GATE documentation.)

+1
source

If you want to stick with Stanford NLP or OpenNLP, you'd better reinstall the model. Almost all of the tools in these packages are based on a learning engine. Only with personalized training data can they give you the perfect model and performance.

Here is my suggestion: manually divide the offer database according to your criteria. I think a couple of thousand sentences are enough. Then call the API or command line to reprogram the sentence delimiters. Then everything is ready!

But first of all you need to find out how it was said in the previous sections: "First you need to clearly define the task. What exactly is your definition of" proposal "?

I use Stanford NLP and OpenNLP in my project, โ€œ Food Mapโ€ , โ€œDelicious Food Discovery Engineโ€ based on NLP and machine learning. They work very well!

+1
source

In a similar case, what I did was split by text into different sentences (separated by new lines) based on where I want to split the text. As in your case, these are texts starting with bullets (or exactly text with a line break tag) at the end). It will also solve a similar problem that may occur if you are working with HTML for this. And by dividing them into different lines, you can send separate lines to determine the proposal, which will be more correct.

0
source

All Articles