I tried to convert a DTD file to a YAML file, and I tried to load it in both libXML and Nokogiri, but it seems that the DTD file is not a valid XML file. I am fine using any third-party gems if I can parse the DTD file.
My conversion attempt:
wget "http://xml.evernote.com/pub/enml2.dtd" irb require 'nokogiri' xml = Nokogiri::XML::Document.parse('enml2.dtd') xml.to_yaml => "--- !ruby/object:Nokogiri::XML::Document\ndecorators: \nnode_cache: []\nerrors:\n- !ruby/exception:Nokogiri::XML::SyntaxError\n message: |\n Start tag expected, '<' not found\n domain: 1\n code: 4\n level: 3\n file: \n line: 1\n str1: \n str2: \n str3: \n int1: 0\n column: 1\n"
Any online XML validation engine also returns a Start Tag error. I assume that all valid XML documents start with <?xml , which seem to be missing. This is what led me to conclude that all DTD files are invalid XML files, however, it is strange that the syntax of the XML definition itself was not defined as valid XML. Why?
I am parsing a DTD file to remove invalid attributes from an XML file to find out which attributes to keep and which to delete, so I need a way to parse the DTD file.
And ultimately, this is just a step in an attempt to convert HTML to ENML (Evernote markup language). Stages include:
- Convert HTML to valid XHTML
- Convert body to en-note element
- Removing invalid tags and attributes in a dtd file
- Check enml file on dtd
I'm currently going to just copy the forbidden attributes and tags from Understanding the Evernote Markup Language and use this to validate my XHTML, but I would prefer to use DTD as the source.
The Nokogiri DTD class is a Node class for storing the built-in DTD Node and validating it. In my case, I have an external DTD file specified using the SYSTEM attribute, which Nokogiri does not seem to support . And even if it worked, all I got was a check.
I really got the correct one to work correctly using:
#dtd = XML::Dtd.new File.read Rails.root.join('lib', 'assets','enml2.dtd') #enml_document = XML::Document.string enml #ret = enml_document.validate dtd
I have not tried REXML. I will give you an answer and report back.
I am trying to convert an HTML document to an XML document that is validated with this DTD. Most HTML elements and attributes are not allowed in the ENML schema, so I have to remove or delete them. I also need to know which attributes are allowed and which are not, so that I can parse the XML correctly and remove / sanitize abusive elements and attributes.
For cleaning purposes, I use Loofah , but to use it I need a list of tag attributes-> which attributes are available for each tag). Instead of doing a few passes confirming the document that I am doing at the end of the cleanup, I just go through each XML tag and clear it. But in order to know how to clean them, I need to know which tags and elements are supported in the actual schema. So I need to parse the DTD file.
From what I understand, XLST is the right tool for the job, but itโs not convenient for me to use it.