Refactoring automatic file encoding detection

Question

Refactoring automatic file encoding detection

I need to check encoding files. This code works, but it's a little long. How can I reorganize this logic. Maybe you can use some other option for this purpose?

The code:

class CharsetDetector implements Checker { Charset detectCharset(File currentFile, String[] charsets) { Charset charset = null; for (String charsetName : charsets) { charset = detectCharset(currentFile, Charset.forName(charsetName)); if (charset != null) { break; } } return charset; } private Charset detectCharset(File currentFile, Charset charset) { try { BufferedInputStream input = new BufferedInputStream( new FileInputStream(currentFile)); CharsetDecoder decoder = charset.newDecoder(); decoder.reset(); byte[] buffer = new byte[512]; boolean identified = false; while ((input.read(buffer) != -1) && (!identified)) { identified = identify(buffer, decoder); } input.close(); if (identified) { return charset; } else { return null; } } catch (Exception e) { return null; } } private boolean identify(byte[] bytes, CharsetDecoder decoder) { try { decoder.decode(ByteBuffer.wrap(bytes)); } catch (CharacterCodingException e) { return false; } return true; } @Override public boolean check(File fileChack) { if (charsetDetector(fileChack)) { return true; } return false; } private boolean charsetDetector(File currentFile) { String[] charsetsToBeTested = { "UTF-8", "windows-1253", "ISO-8859-7" }; CharsetDetector charsetDetector = new CharsetDetector(); Charset charset = charsetDetector.detectCharset(currentFile, charsetsToBeTested); if (charset != null) { try { InputStreamReader reader = new InputStreamReader( new FileInputStream(currentFile), charset); @SuppressWarnings("unused") int valueReaders = 0; while ((valueReaders = reader.read()) != -1) { return true; } reader.close(); } catch (FileNotFoundException exc) { System.out.println("File not found!"); exc.printStackTrace(); } catch (IOException exc) { exc.printStackTrace(); } } else { System.out.println("Unrecognized charset."); return false; } return true; } }

Question:

How is this program logic refactor?
What other encoding detection methods (like UTF-16 , etc.)?

+8

java encoding refactoring

nazar_art Mar 01 '13 at 9:38

source share

2 answers

As already indicated, you cannot “know” or “detect” the encoding of the file. Full accuracy requires you to be told, since there is almost always a sequence of bytes that is ambiguous with respect to multiple character encodings.

You will find some more discussions about finding UTF-8 and ISO8859-1 in this SO question . . The most important answer is to check the sequence of bytes in the file in order to check its compatibility with the expected encoding. For UTF-8 byte encoding rules, see http://en.wikipedia.org/wiki/UTF-8 .

In particular, there is a very interesting article on detecting encodings / character sets http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html They claim that they have extremely high accuracy (guesses!). Price is a very sophisticated detection system in which there is knowledge of symbol frequencies in different languages that do not correspond to 30 OP lines, designated as the correct code size. Apparently, the detection algorithm is built into Mozilla, so you can find and extract it.

We settled on a much simpler scheme: a) believe that you were told that there is a character set if you are told b) if not, check the specification and believe that it says if it is present, otherwise they smell for clean 7 bit ascii, then utf8 or iso8859 in that order. You can create an ugly procedure that does this in one go through the file.

(I think that over time the situation will worsen. Unicode has a new edition every year, with really insignificant differences in actual code points. To do this correctly, you need to check each code point for validity. "Lucky, they are all backward compatible. )

[EDIT: OP seems to have problems encoding this in Java. Our solution and the thumbnail on the other page are not Java encoded, so I cannot directly copy and paste the response. I am going to create a version of Java here based on its code; It does not compile or test. YMMV]

 int UTF8size(byte[] buffer, int buf_index) // Java-version of character-sniffing test on other page // This only checks for UTF8 compatible bit-pattern layout // A tighter test (what we actually did) would check for valid UTF-8 code points { int first_character=buffer[buf_index]; // This first character test might be faster as a switch statement if ((first_character & 0x80) == 0) return 1; // ASCII subset character, fast path else ((first_character & 0xF8) == 0xF0) { // start of 4-byte sequence if (buf_index+3>=buffer.length) return 0; if (((buffer[buf_index + 1] & 0xC0) == 0x80) && ((buffer[buf_index + 2] & 0xC0) == 0x80) && ((buffer[buf_index + 3] & 0xC0) == 0x80)) return 4; } else if ((first_character & 0xF0) == 0xE0) { // start of 3-byte sequence if (buf_index+2>=buffer.length) return 0; if (((buffer[buf_index + 1] & 0xC0) == 0x80) && ((buffer[buf_index + 2] & 0xC0) == 0x80)) return 3; } else if ((first_character & 0xE0) == 0xC0) { // start of 2-byte sequence if (buf_index+1>=buffer.length) return 0; if ((buffer[buf_index + 1] & 0xC0) == 0x80) return 2; } return 0; } public static boolean isUTF8 ( File file ) { int file_size; if (null == file) { throw new IllegalArgumentException ("input file can't be null"); } if (file.isDirectory ()) { throw new IllegalArgumentException ("input file refers to a directory"); } file_size=file.size(); // read input file byte [] buffer = new byte[file_size]; try { FileInputStream fis = new FileInputStream ( input ) ; fis.read ( buffer ) ; fis.close (); } catch ( IOException e ) { throw new IllegalArgumentException ("Can't read input file, error = " + e.getLocalizedMessage () ); } { int buf_index=0; int step; while (buf_index<file_size) { step=UTF8size(buffer,buf_index); if (step==0) return false; // definitely not UTF-8 file buf_index+=step; } } return true ; // appears to be UTF-8 file }

+3

Ira Baxter Mar 03 '13 at 17:16

source share

radai · Accepted Answer · 2013-03-01T09:45:35+0000

the best way to refactor this code is to bring in a third-party library that will recognize characters, because they probably do it better, and it will make your code smaller. see this question for several alternatives

Refactoring automatic file encoding detection

More articles: