Reading any text file having a strange encoding?

I have a text file with a strange encoding "UCS-2 Little Endian" that I want to read its contents using Java.

Opening the text file using NotePad ++

As you can see from the screenshot above, the contents of the file are displayed fine in Notepad ++, but when I read it using this code, only garbage is printed in the console:

String textFilePath = "c:\strange_file_encoding.txt" BufferedReader reader = new BufferedReader( new InputStreamReader( new FileInputStream( filePath ), "UTF8" ) ); String line = ""; while ( ( line = reader.readLine() ) != null ) { System.out.println( line ); // Prints garbage characters } 

The main thing is that the user selects the file to read, so it can be of any encoding, and since I cannot find the encoding of the file, I decode it using "UTF8", but, as in the above example, he cannot read it correctly.

Is it possible to read such strange files correctly? Or at least I can determine if my code can read it correctly?

+4
source share
3 answers

You are using UTF-8 as the encoding in the InputStreamReader constructor, so it will try to interpret bytes as UTF-8 instead of UCS-LE. Here is the documentation: Charset

I suppose you need to use UTF-16LE according to it.

Learn more about supported character sets and their Java names: Supported Encodings

+5
source

You cannot use UTF-8 encoding for all files, especially if you do not know what file encoding to expect. Use a library that can detect the encoding of a file before reading the file, for example: juniversalchardet or jChardet

See Java for more details : how to determine the correct encoding of a stream encoding

+1
source

You are entering the wrong encoding in InputStreamReader . Have you tried using UTF-16LE instead of UTF8?

 BufferedReader reader = new BufferedReader( new InputStreamReader( new FileInputStream( filePath ), "UTF-16LE" ) ); 

According to Charset :

UTF-16LE Sixteen-bit UCS conversion format, low-intensity byte order

0
source

All Articles