Lexers / parsers for (un) structured text documents

Question

Lexers / parsers for (un) structured text documents

There are many parsers and lexers for scripts (i.e. structured computer languages). But I'm looking for one that can break a (almost) unstructured text document into larger sections, for example. chapters, paragraphs, etc.

It is relatively easy to identify them: where is the table of contents, confirmations or the beginning of the main body, and you can create systems based on the rules to define some of them (for example, paragraphs).

I do not expect this to be perfect, but does anyone know about such a wide "block" lexer / parser? Or could you point me to a line of literature that might help?

+7

parsing document lexer

wilson32 Jan 18 '10 at 16:57

source share

4 answers

Noufal ibrahim · Answer 1 · 2010-01-18T17:05:41+0000

Many lightweight markup languages, such as markdown (which by the way uses SO), restructured text, and (possibly) POD are similar to what you are talking about. They have minimal syntax and break the input down into parses of parsing. You can get some information by reading about their implementation.

ziya · Answer 2 · 2010-01-18T17:10:48+0000

Most lex / yacc programs work with well-defined grammar. if you can define your grammar in terms of BNF as a format (which most parsers accept similar syntax), then you can use any of them, This may indicate the obvious. However, you can still get a little blurry around the “blocks” (tokens) of text that will be part of your grammar. In the end, you define the rules for your tokens.

I used the Parse-RecDescent Perl module in the past with different levels of success for similar projects.

Sorry, this may not be the best answer, but the more general experience of my experience in such projects.

Joseph Turian · Answer 3 · 2010-01-22T05:23:08+0000

Define an annotation standard that indicates how you would like to understand.
Go to Amazon Mechanical Turk and ask people to tag 10K documents using your standard annotation.
Set up CRF (which is similar to HMM, but better) for training data.

If you really want to go this route, I can tell you in detail about the details. But it will be a lot of work.

Naveen · Answer 4 · 2010-05-17T06:17:27+0000

try: pygments , geshi , or prettify

They can handle everything that you throw at them, and really forgive mistakes in their grammar, as well as your documents.

References:
gitorius uses prettify,
github uses pigments,
rosettacode uses geshi,

Lexers / parsers for (un) structured text documents

More articles: