As already indicated, you cannot โknowโ or โdetectโ the encoding of the file. Full accuracy requires you to be told, since there is almost always a sequence of bytes that is ambiguous with respect to multiple character encodings.
You will find some more discussions about finding UTF-8 and ISO8859-1 in this SO question . . The most important answer is to check the sequence of bytes in the file in order to check its compatibility with the expected encoding. For UTF-8 byte encoding rules, see http://en.wikipedia.org/wiki/UTF-8 .
In particular, there is a very interesting article on detecting encodings / character sets http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html They claim that they have extremely high accuracy (guesses!). Price is a very sophisticated detection system in which there is knowledge of symbol frequencies in different languages โโthat do not correspond to 30 OP lines, designated as the correct code size. Apparently, the detection algorithm is built into Mozilla, so you can find and extract it.
We settled on a much simpler scheme: a) believe that you were told that there is a character set if you are told b) if not, check the specification and believe that it says if it is present, otherwise they smell for clean 7 bit ascii, then utf8 or iso8859 in that order. You can create an ugly procedure that does this in one go through the file.
(I think that over time the situation will worsen. Unicode has a new edition every year, with really insignificant differences in actual code points. To do this correctly, you need to check each code point for validity. "Lucky, they are all backward compatible. )
[EDIT: OP seems to have problems encoding this in Java. Our solution and the thumbnail on the other page are not Java encoded, so I cannot directly copy and paste the response. I am going to create a version of Java here based on its code; It does not compile or test. YMMV]
int UTF8size(byte[] buffer, int buf_index) // Java-version of character-sniffing test on other page // This only checks for UTF8 compatible bit-pattern layout // A tighter test (what we actually did) would check for valid UTF-8 code points { int first_character=buffer[buf_index]; // This first character test might be faster as a switch statement if ((first_character & 0x80) == 0) return 1; // ASCII subset character, fast path else ((first_character & 0xF8) == 0xF0) { // start of 4-byte sequence if (buf_index+3>=buffer.length) return 0; if (((buffer[buf_index + 1] & 0xC0) == 0x80) && ((buffer[buf_index + 2] & 0xC0) == 0x80) && ((buffer[buf_index + 3] & 0xC0) == 0x80)) return 4; } else if ((first_character & 0xF0) == 0xE0) { // start of 3-byte sequence if (buf_index+2>=buffer.length) return 0; if (((buffer[buf_index + 1] & 0xC0) == 0x80) && ((buffer[buf_index + 2] & 0xC0) == 0x80)) return 3; } else if ((first_character & 0xE0) == 0xC0) { // start of 2-byte sequence if (buf_index+1>=buffer.length) return 0; if ((buffer[buf_index + 1] & 0xC0) == 0x80) return 2; } return 0; } public static boolean isUTF8 ( File file ) { int file_size; if (null == file) { throw new IllegalArgumentException ("input file can't be null"); } if (file.isDirectory ()) { throw new IllegalArgumentException ("input file refers to a directory"); } file_size=file.size(); // read input file byte [] buffer = new byte[file_size]; try { FileInputStream fis = new FileInputStream ( input ) ; fis.read ( buffer ) ; fis.close (); } catch ( IOException e ) { throw new IllegalArgumentException ("Can't read input file, error = " + e.getLocalizedMessage () ); } { int buf_index=0; int step; while (buf_index<file_size) { step=UTF8size(buffer,buf_index); if (step==0) return false; // definitely not UTF-8 file buf_index+=step; } } return true ; // appears to be UTF-8 file }
Ira Baxter
source share