Get rendered text from HTML (Delphi)

I have HTML and I need to extract the actual written text from the page.

So far, I have been trying to use a web browser and display the page, then go to the document property and capture the text. This works, but only where the browser is supported (IE com object). The problem is that I want this to also be able to run under wine, so I need a solution that does not use IE COM.

There must be a software way to make this reasonable.

+4
source share
3 answers

I'm not sure what the recommended way to parse HTML in Delphi is, but if it were me, I would want to just copy a copy of html2text (or an older C ++ program using this name or a new Python program ) and make a call to one of these.

You can turn Python html2text into an executable using py2exe . Both html2text programs are licensed under the GPL, but for now, you just link your executable to your application and make your source available in accordance with the GPL limitations, then you should be fine.

+4
source

Instead of using TWebBrowser, you can directly use TIdHttp and its Get method.
You are returning an html string.

+1
source

Here's the simplest simple procedure, copied from Scalabium :

function StripHTMLTags(const strHTML: string): string; var P: PChar; InTag: Boolean; i, intResultLength: Integer; begin P := PChar(strHTML); Result := ''; InTag := False; repeat case P^ of '<': InTag := True; '>': InTag := False; #13, #10: ; {do nothing} else if not InTag then begin if (P^ in [#9, #32]) and ((P+1)^ in [#10, #13, #32, #9, '<']) then else Result := Result + P^; end; end; Inc(P); until (P^ = #0); {convert system characters} Result := StringReplace(Result, '&quot;', '"', [rfReplaceAll]); Result := StringReplace(Result, '&apos;', '''', [rfReplaceAll]); Result := StringReplace(Result, '&gt;', '>', [rfReplaceAll]); Result := StringReplace(Result, '&lt;', '<', [rfReplaceAll]); Result := StringReplace(Result, '&amp;', '&', [rfReplaceAll]); {here you may add another symbols from RFC if you need} end; 

Then you can easily change this to do exactly what you want.

+1
source

Source: https://habr.com/ru/post/1312264/


All Articles