Working with Invalid XML

I am dealing with garbled XML in perl, which is generated by an upstream process that I cannot modify (it seems this is a common problem here). However, as I have seen, XML is incorrect in only one specific way: it has attribute values ​​that contain unescaped characters of a smaller size, for example:

<tag v="< 2"> 

I use perl with XML :: LibXML to parse, and this of course generates parsing errors. I tried using a recovery option that allows me to parse, but it just stops when it encounters the first parsing error, so I lose the data this way.

I seem to have two common options:

  • Fix input XML before parsing it, possibly using regular expressions.
  • Find a more forgiving XML parser.

I am inclined to option 1, since I would like to catch any other errors in XML. What would you suggest? If # 1, can anyone guide me through the regex approach?

+4
xml perl
source share
2 answers

One option is to catch the exceptions, find out where they occurred at the entrance, correct the entrance there and try again.

The following is a quick, ineffective proof of concept for a script using XML::Twig , because I still haven't figured out how to create and install libxml2 from scratch on Windows.

 #!/usr/bin/env perl use strict; use warnings; use XML::Twig; my $xml = q{ <tag v="< 2"/> }; while ( 1 ) { eval { my $twig = XML::Twig->new( twig_handlers => { tag => \&tag_handler }, ); $twig->parse( $xml ); 1; } and last; my $err = $@ ; my ($i) = ($err =~ /byte ([0-9]+)/) or die $err; substr($xml, $i, 1) eq '<' or die $err; $xml = substr($xml, 0, $i) . '&lt;' . substr($xml, $i + 1); } sub tag_handler { my (undef, $elt) = @_; print $elt->att('v'), "\n"; } 

I wrote about this on my blog .

+7
source share

I know that this is not the answer you need, but the XML specification is quite clear and strict.

An invalid XML file is fatal.

If it does not work in the validator, your code should not even try to "fix" it, no more than you try to automatically "fix" any program code.

From Anotated XML Specification :

fatal error [Definition:] An error that the corresponding XML processor must detect and report to the application. After encountering a fatal error, the processor may continue processing the data to search for further errors and may report such errors to the application. To support error correction, the processor can make raw data from the document (with mixed character data and markup) available for the application. However, if a fatal error is detected, the processor should not continue normal processing (that is, it should not continue to transfer personal data and information about the logical structure of the document to the application in the usual way).

And, in particular, a comment about why: Draconian error handling

We want XML to allow programmers to write code that can be transmitted over the Internet and run on a large number of desktops. However, if this code should include error handling for all kinds of inaccurate end-user practices, it should necessarily have a ball size to such an extent that it, like Netscape Navigator or Microsoft Internet Explorer, is tens of megabytes in size, defeating the target.

If you have ever tried to build an HTML parser, you will understand why this should be so - you end up writing SO MANY handlers for cross frames, bad tags, closing implict tags, that your code is a mess from the very beginning.

And because this is my favorite post in Qaru - here is an example from why: RegEx matches open tags except XHTML stand-alone tags

Now I understand that this is not always an option, and you probably would not come here if you were to ask that your upstream "fix your XML" was the least resistance. Nevertheless, I still urge you to report this as a defect in the XML creation application and to resist pressure as much as possible in order to β€œfix” it programmatically, because, as you correctly understood, it creates a world of pain for itself, when it’s right the answer is "fix the problem at the source."

If you are really stuck on this path, you can - as Sinan Ünür point out - your only option is a trap where your parser failed, then check and try to repair how you are going. But you will not find an XML parser that does this for you, because the one that, by definition, is broken.

I would suggest that you first:

  • Cut a copy of the specification to show the person who asked you to do it.
  • point out that the whole reason we have standards is to promote interoperability.
  • Therefore, by doing something that intentionally violates the standard, you run the risk of business - you create code that may one day mysteriously break, because using things like regular expressions or automatic commit creates a number of assumptions that may not be fulfilled.
  • A useful concept here is technical debt - explain that you have technical debt due to automatic fixing, for something that really is not your problem.
  • Then ask them if they want to accept this risk.
  • If they think the risk is acceptable, then just go ahead with it - you might consider it worthwhile - ignoring the fact that your source data looks like XML and processes it as if it were plain text - use regular expressions to extract corresponding data rows, etc.
  • Apologize in the comments for your future service programmer, explain who made the decision and why.

May also be useful as a checkpoint: Which character should not be set as values ​​in an XML file

+7
source share

All Articles