How to guess file encoding without specification in .NET?

Question

How to guess file encoding without specification in .NET?

I am using the StreamReader class in .NET as follows:

using( StreamReader reader = new StreamReader( "c:\somefile.html", true ) { string filetext = reader.ReadToEnd(); }

This works fine when there is a specification in the file. I ran into a problem with a file without specification. Mostly I got gibberish. When I specified Encoding.Unicode, it worked fine, for example:

 using( StreamReader reader = new StreamReader( "c:\somefile.html", Encoding.Unicode, false ) { string filetext = reader.ReadToEnd(); }

So, I need to get the contents of the file into a string. So how do people usually deal with this? I know that there is no solution that will work in 100% of cases, but I would like to improve my chances. Obviously, there is software that is trying to guess (for example, notepad, browsers, etc.). Is there a way in the .NET Framework that will guess about me? Does anyone have code that they would like to share?

More background: This question is pretty much like mine, but I'm in .NET. This question led me to a blog listing various detection coding , but none of them are in .NET

+5

c # .net encoding unicode character-encoding

user70602 Mar 29 '09 at 16:41

source share

8 answers

You should read this article by Raymond Chen. He talks in detail about how programs can guess what encoding is (and some of the fun that comes from guessing)

http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx

+3

Jaredpar Mar 29 '09 at 17:08

source share

I got lucky with Pude , a C# port of Mozilla Universal Charset Detector .

+1

Jim Price Jun 20 '11 at 20:16

source share

UTF-8 is designed in such a way that it is unlikely to have text encoded in arbitrary 8-bit encoding, such as latin1, which will be decoded for proper unicode using UTF-8.

So the minimal approach is this (pseudo code, I am not saying .NET):

try: u = some_text.decode ("UTF-8") except UnicodeDecodeError: u = some_text.decode ("most probable encoding")

For the most probable coding, it is usually used, for example, latin1 or cp1252 or whatever. More sophisticated approaches may try to find language-specific pairs, but I don’t know what this does as a library or some of them.

0

deets Mar 29 '09 at 16:47

source share

I used this to do something similar a while ago:

http://www.conceptdevelopment.net/Localization/NCharDet/

0

dommer Mar 29 '09 at 16:54

source share

Use Win32 IsTextUnicode.

In a general sense, this is a difficult question. See: http://blogs.msdn.com/oldnewthing/archive/2007/04/17/2158334.aspx .

0

codekaizen Mar 29 '09 at 16:57

source share

A hacker technique could be to take MD5 text, then decode the text and transcode it into different encodings, MD5's each. If you agree, you are assuming this is encoding.

This is clearly too slow for it to process many files, but for something like a text editor, I could see how it works.

In addition, it will be dirty hands transferring java libraries from this post , which came from a Delphi SO question, or using the IE MLang function.

0

Chris s Mar 29 '09 at 17:10

source share

See my (recent) answer to this question (as far as I can tell, equivalent): How to determine the encoding / codepage of a text file

It does NOT try to guess a number of possible “national” encodings such as MLang and NCharDet, but rather assumes that you know which non-Unicode files you are likely to encounter. As far as I can tell from your question, it should solve your problem fairly reliably (without relying on the MLang black box).

0

Tao Apr 29 '11 at 9:27

source share

Michael piendl · Accepted Answer · 2009-03-29T16:51:58+0000

Libary http://www.codeproject.com/KB/recipes/DetectEncoding.aspx

fooobar.com/questions/844651 / ...

How to guess file encoding without specification in .NET?

More articles: