Invalid parsing ARTICLE tag using MSHTML

I am trying to parse HTML using the MSHTML parser in Delphi 10 Seattle. It works fine, but the ARTICLE tag confuses it, the parsed ARTICLE element does not have innerHTML and children, although they are.

program Project1; {$APPTYPE CONSOLE} {$R *.res} uses System.SysUtils, Variants, ActiveX, MSHTML; procedure DoParse; var idoc: IHTMLDocument2; iCollection: IHTMLElementCollection; iElement: IHTMLElement; V: OleVariant; HTML: String; i: Integer; begin Html := '<html>'#10+ '<head>'#10+ ' <title>Articles</title>'#10+ '</head>'#10+ '<body>'#10+ ' <article>'#10+ ' <p>This is my Article</p>'#10+ ' </article>'#10+ '</body>'#10+ '</html>'; v := VarArrayCreate( [0,1], varVariant); v[0]:= Html; idoc := CoHTMLDocument.Create as IHTMLDocument2; idoc.designMode := 'on'; idoc.write(PSafeArray(System.TVarData(v).VArray)); idoc.close; iCollection := idoc.all as IHTMLElementCollection; for i := 0 to iCollection.length-1 do begin iElement := iCollection.item( i, 0) as IHTMLElement; if assigned(ielement) then WriteLN(iElement.tagName + ': ' + iElement.outerHTML); end; end; begin try DoParse; except on E: Exception do Writeln(E.ClassName, ': ', E.Message); end; ReadLN; end. 

Program output

 HTML: <HTML><HEAD><TITLE>Articles</TITLE> <META name=GENERATOR content="MSHTML 11.00.9600.18283"></HEAD> <BODY><ARTICLE> <P>This is my Article</P></ARTICLE>undefined</BODY></HTML> HEAD: <HEAD><TITLE>Articles</TITLE> <META name=GENERATOR content="MSHTML 11.00.9600.18283"></HEAD> TITLE: <TITLE>Articles</TITLE> META: <META name=GENERATOR content="MSHTML 11.00.9600.18283"> BODY: <BODY><ARTICLE> <P>This is my Article</P></ARTICLE>undefined</BODY> ARTICLE: <ARTICLE> P: <P>This is my Article</P> /ARTICLE: </ARTICLE> 

As you can see, there are errors with the ARTICLE tag, it has no content and / ARTICLE is defined as a separate tag.

Can someone help me understand this problem?

+6
source share
1 answer

See documents: user element | custom object .

Internet Explorer support for custom tags on an HTML page requires a namespace to be defined for the tag. Otherwise, the User tag is treated as an unknown tag when parsing the document. Although going to a page with an unknown tag in Internet Explorer does not result in an error, unknown tags have the disadvantage that they cannot contain other tags , and the behavior cannot be applied to them.

In your case, ARTICLE is an unknown tag. To make it a tag , which may contain other tags, you need to add a namespace to them. for example <MY:ARTICLE> and declare the namespace <html XMLNS:MY> (if you do not declare the namespace, the DOM parser will automatically add it)

See also: Using Custom Tags in Internet Explorer


In your comment, you mentioned that you were trying to parse an HTML5 live page (you did not mention this in the question).
Since I am not an HTML5 expert, I have not associated the ARTICLE tag with HTML5 standards.

By default, the program runs in IE7 compatibility mode, so MSHTML does not know about this special tag and treats it as an unknown tag.

So, try adding <!DOCTYPE html> as the first line of HTML and add <meta http-equiv="X-UA-Compatible" content="IE=edge"> as the first line of the HEAD section (it should be the first). Or try adding the FEATURE_BROWSER_EMULATION registry subkey: How to use the Delphi TWebbrowser component that works in IE9 mode?

PS: idoc.designMode := 'on'; not required.

+6
source

All Articles