How to convert HTML to valid XHTML?
I have an HTML string, in this example it looks like
<img src="somepic.jpg" someAtrib="1" > I am trying to tweak a regex that matches the "img" node and apply a slash to the end of the node to make it look.
<img src="somepic.jpg" someAtrib="1" /> Essentially, the ultimate goal is to close the node, open nodes are valid in HTML, but not XML explicitly. Is there a regex buff there that can help?
Do not use regular expression, but dedicated parsers. In JavaScript, create a document using DOMParser , then serialize it using XMLSerializer :
var doc = new DOMParser().parseFromString('<img src="foo">', 'text/html'); var result = new XMLSerializer().serializeToString(doc); // result: // <html xmlns="http://www.w3.org/1999/xhtml"><head></head><body> (no line break) // <img src="foo" /></body></html> You can create an xhtml document and import / accept html elements. HTML strings can be parsed using the HTMLElement.innerHTML property. The key point is to use the Document.importNode () or Document.adoptNode () method to convert html nodes to xhtml nodes:
var di = document.implementation; var hd = di.createHTMLDocument(); var xd = di.createDocument('http://www.w3.org/1999/xhtml', 'html', null); hd.body.innerHTML = '<img>'; var img = hd.body.firstElementChild; var xb = xd.createElement('body'); xd.documentElement.appendChild(xb); console.log('html doc:\n' + hd.documentElement.outerHTML + '\n'); console.log('xhtml doc:\n' + xd.documentElement.outerHTML + '\n'); img = xd.importNode(img); //or xd.adoptNode(img). Now img is a xhtml element xb.appendChild(img); console.log('xhtml doc after import/adopt img from html:\n' + xd.documentElement.outerHTML + '\n'); The output should be:
html doc: <html><head></head><body><img></body></html> xhtml doc: <html xmlns="http://www.w3.org/1999/xhtml"><body></body></html> xhtml doc after import/adopt img from html: <html xmlns="http://www.w3.org/1999/xhtml"><body><img /></body></html> Rob W's answer does not work in chrome (at least 29 and below) because DOMParser does not support the text / html type, and XMLSerializer generates html (NOT xhtml) syntax for the html document in chrome.
In addition to Rob W answer, you can extract body contents using RegEx:
var doc = new DOMParser().parseFromString('<img src="foo">', 'text/html'); var result = new XMLSerializer().serializeToString(doc); /<body>(.*)<\/body>/im.exec(result); result = RegExp.$1; // result: // <img src="foo" /> Note: parseFromString(htmlString, 'text/html'); will cause an error in IE9 because text / html mimeType is not supported in IE9 , however it works with IE10 and IE11.
This will be a very good job:
result = text.replace(/(<img\b[^<>]*[^<>\/])>/ig, "$1 />"); Addendum: In the (unlikely) case where your code contains tag attributes containing angle brackets (this is not vaild XML / XHTML BTW), this one will do a little better:
result = text.replace(/(<img\b(?:[^<>"'\/]+|'[^']*'|"[^"]*")*)>/ig, "$1 />"); Why do you want to fix an HTML document in the DOM browser that is not valid XHTML?
It has already been processed and analyzed, and you already have a DOM. Any parsing error that caused the wrong / bad document has already occurred, and this will not be a regular expression on the DOM that will fix it.
Also, remember that almost all documents are parsed as HTML soup soup. If you cannot correct the document on the server side, simply ignore its reliability / quality factor on the client side.