Scala Failover XML Analysis

I would like to be able to parse XML, which is not necessarily well-formed. I would look for a fuzzy, not a strict parser, capable of, for example, recovering from heavily nested tags. I could write my own, but it is worth asking first here.

Update:

What I'm trying to do is extract links and other information from HTML. In the case of well-formed XML, I can use the Scala XML interface. In the case of poorly formed XML, it would be nice to somehow convert it to the correct XML (somehow) and process it in the same way, otherwise I would have to have two completely different sets of functions for working with documents.

Obviously, because the input is not very well formed, and I'm trying to create a well-formed tree, I would have to use some kind of heuristic (for example, when you see <parent><child></parent> , you first close <child> and when you see <child> you ignore it). But of course, this is not the correct grammar, and therefore there is no right way to do this.

+2
source share
8 answers

What you are looking for will not be an XML parser. XML is very strict regarding nesting, closure, etc. One of the other answers is Tag Soup . This is a good offer, although technically it is much closer to the lexer than to the parser. If all you want from the XML-ish content is an event stream without any validation, then it is almost trivial to roll your own solution. Just skip the entry by consuming content that matches regular expressions (this is exactly what Tag Soup does).

The problem is that the lexer will not be able to provide you with many functions that you want from the parser (for example, creating a tree view of input). You have to implement this logic yourself, because there is no way for such a "soft" parser to determine how to handle cases such as:

 <parent> <child> </parent> </child> 

Think: what tree would expect from this? There really is no reasonable answer to this question, and that is why the parser will not be very useful.

Now, not to say that you cannot use Tag Soup (or your own manual lexer) to create some kind of tree structure based on this input, but the implementation will be very fragile. With tree-oriented formats, such as XML, you really have no choice but to be strict, otherwise it becomes almost impossible to get a reasonable result (this is part of why the browser is so difficult to work with compatibility).

+7
source

Try the parser on an XHtml object. This is much milder than the one in XML.

+2
source

Take a look at the htmlcleaner . I have successfully used it to convert "HTML from the wild" to valid XML.

+2
source

Try Tag Soup .

JTidy does something similar, but only for HTML.

+1
source

I basically agree with Daniel Spievak's answer. This is just another way to create your own parser.

While I don’t know any specific Scala solution, you can try using Woodstox , the Java library that implements the StAX API . (Being an even-base API, I assume that it will be more error tolerant than the DOM parser)

There is also a Scala shell around Woodstox called Frostbridge , developed by the same guy who created the Simple Build Tool for Scala.

I had mixed opinions about Frostbridge when I tried it, but maybe it is more suitable for your purposes.

+1
source

I agree with the answers that turning invalid XML into "correct" XML is not possible.

Why don't you just do a plain text hrefs search if that's all you are interested in? One of the problems will be with the comments, but if the XML is not valid, you may not be able to talk about what needs to be commented out!

+1
source

Caucho has a JAXP-compliant XML parser that is slightly more bearable than what you usually expect. (Including support for working with links to links to non-existent characters, AFAIK.)

Find JavaDoc for parsers here

0
source

A related topic (with my solution) is given below:

Scala and html analysis

0
source

All Articles