I am trying to parse HTML using the MSHTML parser in Delphi 10 Seattle. It works fine, but the ARTICLE tag confuses it, the parsed ARTICLE element does not have innerHTML and children, although they are.
program Project1; {$APPTYPE CONSOLE} {$R *.res} uses System.SysUtils, Variants, ActiveX, MSHTML; procedure DoParse; var idoc: IHTMLDocument2; iCollection: IHTMLElementCollection; iElement: IHTMLElement; V: OleVariant; HTML: String; i: Integer; begin Html := '<html>'#10+ '<head>'#10+ ' <title>Articles</title>'#10+ '</head>'#10+ '<body>'#10+ ' <article>'#10+ ' <p>This is my Article</p>'#10+ ' </article>'#10+ '</body>'#10+ '</html>'; v := VarArrayCreate( [0,1], varVariant); v[0]:= Html; idoc := CoHTMLDocument.Create as IHTMLDocument2; idoc.designMode := 'on'; idoc.write(PSafeArray(System.TVarData(v).VArray)); idoc.close; iCollection := idoc.all as IHTMLElementCollection; for i := 0 to iCollection.length-1 do begin iElement := iCollection.item( i, 0) as IHTMLElement; if assigned(ielement) then WriteLN(iElement.tagName + ': ' + iElement.outerHTML); end; end; begin try DoParse; except on E: Exception do Writeln(E.ClassName, ': ', E.Message); end; ReadLN; end.
Program output
HTML: <HTML><HEAD><TITLE>Articles</TITLE> <META name=GENERATOR content="MSHTML 11.00.9600.18283"></HEAD> <BODY><ARTICLE> <P>This is my Article</P></ARTICLE>undefined</BODY></HTML> HEAD: <HEAD><TITLE>Articles</TITLE> <META name=GENERATOR content="MSHTML 11.00.9600.18283"></HEAD> TITLE: <TITLE>Articles</TITLE> META: <META name=GENERATOR content="MSHTML 11.00.9600.18283"> BODY: <BODY><ARTICLE> <P>This is my Article</P></ARTICLE>undefined</BODY> ARTICLE: <ARTICLE> P: <P>This is my Article</P> /ARTICLE: </ARTICLE>
As you can see, there are errors with the ARTICLE tag, it has no content and / ARTICLE is defined as a separate tag.
Can someone help me understand this problem?
source share