HTML source code from TWebBrowser - How to determine stream encoding?

Based on this question: How to get HTML source code from TWebBrowser

If I run this code with an html page with a Unicode code page, the result will be gibberish because TStringStream is not Unicode in D7. the page can be encoded in UTF8 or another (Ansi) code page.

How to determine if TStream / IPersistStreamInit is Unicode / UTF8 / Ansi?

How to always return the correct result as a WideString for this function?

function GetWebBrowserHTML(const WebBrowser: TWebBrowser): WideString; 

If I replaced TStringStream with TMemoryStream and saved TMemoryStream so that all is well. It can be either Unicode / UTF8 / Ansi. but I always want to return the stream back as WideString:

 function GetWebBrowserHTML(const WebBrowser: TWebBrowser): WideString; var // LStream: TStringStream; LStream: TMemoryStream; Stream : IStream; LPersistStreamInit : IPersistStreamInit; begin if not Assigned(WebBrowser.Document) then exit; // LStream := TStringStream.Create(''); LStream := TMemoryStream.Create; try LPersistStreamInit := WebBrowser.Document as IPersistStreamInit; Stream := TStreamAdapter.Create(LStream,soReference); LPersistStreamInit.Save(Stream,true); // result := LStream.DataString; LStream.SaveToFile('c:\test\test.txt'); // test only - file is ok Result := ??? // WideString finally LStream.Free(); end; end; 

EDIT: I found this article - How to load and save documents in Delphi-style TWebBrowser

Which does what I need. but it only works correctly with Delphi Unicode compilers (D2009 +). read the Conclusion section:

Obviously, we can do a lot. A few things immediately spring to mind. We modify some of the Unicode functionality and support for non-ANSI encodings to pre-Unicode compiler code. Real code when compiling with anything earlier than Delphi 2009 will not correctly save the contents of the document in lines if the document character set is not ANSI.

The magic is obviously in the TEncoding class ( TEncoding.GetBufferEncoding ). but D7 does not have TEncoding . Any ideas?

+4
source share
1 answer

I used GpTextStream to handle conversion (should work for all versions of Delphi):

 function GetCodePageFromHTMLCharSet(Charset: WideString): Word; const WIN_CHARSET = 'windows-'; ISO_CHARSET = 'iso-'; var S: string; begin Result := 0; if Charset = 'unicode' then Result := CP_UNICODE else if Charset = 'utf-8' then Result := CP_UTF8 else if Pos(WIN_CHARSET, Charset) <> 0 then begin S := Copy(Charset, Length(WIN_CHARSET) + 1, Maxint); Result := StrToIntDef(S, 0); end else if Pos(ISO_CHARSET, Charset) <> 0 then // ISO-8859 (eg iso-8859-1: => 28591) begin S := Copy(Charset, Length(ISO_CHARSET) + 1, Maxint); S := Copy(S, Pos('-', S) + 1, 2); if S = '15' then // ISO-8859-15 (Latin 9) Result := 28605 else Result := StrToIntDef('2859' + S, 0); end; end; function GetWebBrowserHTML(WebBrowser: TWebBrowser): WideString; var LStream: TMemoryStream; Stream: IStream; LPersistStreamInit: IPersistStreamInit; TextStream: TGpTextStream; Charset: WideString; Buf: WideString; CodePage: Word; N: Integer; begin Result := ''; if not Assigned(WebBrowser.Document) then Exit; LStream := TMemoryStream.Create; try LPersistStreamInit := WebBrowser.Document as IPersistStreamInit; Stream := TStreamAdapter.Create(LStream, soReference); if Failed(LPersistStreamInit.Save(Stream, True)) then Exit; Charset := (WebBrowser.Document as IHTMLDocument2).charset; CodePage := GetCodePageFromHTMLCharSet(Charset); N := LStream.Size; SetLength(Buf, N); TextStream := TGpTextStream.Create(LStream, tsaccRead, [], CodePage); try N := TextStream.Read(Buf[1], N * SizeOf(WideChar)) div SizeOf(WideChar); SetLength(Buf, N); Result := Buf; finally TextStream.Free; end; finally LStream.Free(); end; end; 
+2
source

All Articles