How to parse a DTD file in Ruby

I tried to convert a DTD file to a YAML file, and I tried to load it in both libXML and Nokogiri, but it seems that the DTD file is not a valid XML file. I am fine using any third-party gems if I can parse the DTD file.

My conversion attempt:

wget "http://xml.evernote.com/pub/enml2.dtd" irb require 'nokogiri' xml = Nokogiri::XML::Document.parse('enml2.dtd') xml.to_yaml => "--- !ruby/object:Nokogiri::XML::Document\ndecorators: \nnode_cache: []\nerrors:\n- !ruby/exception:Nokogiri::XML::SyntaxError\n message: |\n Start tag expected, '<' not found\n domain: 1\n code: 4\n level: 3\n file: \n line: 1\n str1: \n str2: \n str3: \n int1: 0\n column: 1\n" 

Any online XML validation engine also returns a Start Tag error. I assume that all valid XML documents start with <?xml , which seem to be missing. This is what led me to conclude that all DTD files are invalid XML files, however, it is strange that the syntax of the XML definition itself was not defined as valid XML. Why?

I am parsing a DTD file to remove invalid attributes from an XML file to find out which attributes to keep and which to delete, so I need a way to parse the DTD file.

And ultimately, this is just a step in an attempt to convert HTML to ENML (Evernote markup language). Stages include:

  • Convert HTML to valid XHTML
  • Convert body to en-note element
  • Removing invalid tags and attributes in a dtd file
  • Check enml file on dtd

I'm currently going to just copy the forbidden attributes and tags from Understanding the Evernote Markup Language and use this to validate my XHTML, but I would prefer to use DTD as the source.

The Nokogiri DTD class is a Node class for storing the built-in DTD Node and validating it. In my case, I have an external DTD file specified using the SYSTEM attribute, which Nokogiri does not seem to support . And even if it worked, all I got was a check.

I really got the correct one to work correctly using:

 #dtd = XML::Dtd.new File.read Rails.root.join('lib', 'assets','enml2.dtd') #enml_document = XML::Document.string enml #ret = enml_document.validate dtd 

I have not tried REXML. I will give you an answer and report back.

I am trying to convert an HTML document to an XML document that is validated with this DTD. Most HTML elements and attributes are not allowed in the ENML schema, so I have to remove or delete them. I also need to know which attributes are allowed and which are not, so that I can parse the XML correctly and remove / sanitize abusive elements and attributes.

For cleaning purposes, I use Loofah , but to use it I need a list of tag attributes-> which attributes are available for each tag). Instead of doing a few passes confirming the document that I am doing at the end of the cleanup, I just go through each XML tag and clear it. But in order to know how to clean them, I need to know which tags and elements are supported in the actual schema. So I need to parse the DTD file.

From what I understand, XLST is the right tool for the job, but itโ€™s not convenient for me to use it.

+8
ruby xml dtd nokogiri evernote
source share
1 answer

However, it seems strange to me that the xml definition syntax itself was not defined as valid XML. I would like to know all the reasons for this.

DTDs are a delay with SGML, the predecessor of XML, so itโ€™s actually not very strange that DTDs are not XML files. Saving DTDs and their specific syntax was a deliberate decision when creating XML.

More modern schema languages, such as W3C XML Schema and RELAX NG, use XML syntax.


The reason I am parsing a DTD file is because I want to remove invalid attributes from the XML file. To find out which attributes to save and which to delete, I need a way to parse the DTD file. (from the question)

I'm just looking for a way to parse DTD files, rather than just checking their use, because I want to do selective cleanup and checking with dtd. (from the text of thanks)

I do not understand what you mean by "user cleanup". I also see no reason to try to parse DTDs first.

To find out if any elements or attributes in the XML file are invalid (if they violate the rules in the associated DTD), you need to parse the XML file using XML parser.Then the parser will tell you if there are any errors that must be eliminated.

Nokogiri is based on libxml2, which provides a validation parser. It supports external DTDs that are specified using the syntax <!DOCTYPE foo SYSTEM "bar.dtd"> (how to do this work is shown in the commentary on the problem you are talking about: https://github.com/sparklemotion/nokogiri/issues / 440 # issuecomment-3031164 ).

Here's how to check:

 require 'nokogiri' xml = File.read("yourfile.xml") options = Nokogiri::XML::ParseOptions::DTDLOAD # Needed for the external DTD to be loaded doc = Nokogiri::XML::Document.parse(xml, nil, nil, options) puts doc.external_subset.validate(doc) 

If there is no output from this code, then the XML document is valid for DTD.

+2
source share

All Articles