There are many parsers and lexers for scripts (i.e. structured computer languages). But I'm looking for one that can break a (almost) unstructured text document into larger sections, for example. chapters, paragraphs, etc.
It is relatively easy to identify them: where is the table of contents, confirmations or the beginning of the main body, and you can create systems based on the rules to define some of them (for example, paragraphs).
I do not expect this to be perfect, but does anyone know about such a wide "block" lexer / parser? Or could you point me to a line of literature that might help?
parsing document lexer
wilson32
source share