Strange Unicode characters when reading in a file in node.js application

I am trying to write a node application that reads in a set of files, breaks them into lines and puts the lines in an array. Pretty simple. It works with a large number of files, except for some SQL files that I work with. For some reason, I seem to get some kind of unicode output when I split the lines. The application looks something like this:

fs = require("fs"); var data = fs.readFileSync("test.sql", "utf8"); console.log(data); lines = data.split("\n"); console.log(lines); 

The input file looks something like this:

 use whatever go 

The result is as follows:

   use whatever go [ '  u\u0000s\u0000e\u0000 \u0000w\u0000h\u0000a\u0000t\u0000e\u0000v\u0000e\u0000r\u0000', '\u0000g\u0000o\u0000', '\u0000' ] 

As you can see, there is some unrecognized character at the beginning of the file. After reading the data and direct output, it looks good, except for this symbol. However, if I then try to split it into strings, I get all these characters in unicode format. Basically, these are all actual characters with "\ u0000" at the beginning of each of them.

I have no idea what is going on here, but it looks like it has something to do with the characters of the file itself. If I copy and paste the text of the file into another new file and run the application in a new file, it works fine. I assume that everything that causes this problem is removed during the copy and paste process.

+6
source share
3 answers

Your file is in UTF-16 Little Big Endian, not UTF-8.

 var data = fs.readFileSync("test.sql", "utf16le"); //Not sure if this eats the BOM 

Unfortunately, node.js only supports UTF-16 Little Endian or UTF-16LE (cannot be sure of reading documents, there is a slight difference between them, namely that UTF-16LE does not use specifications), so you should use iconv or convert the file to UTF-8 in any other way.

Example:

 var Iconv = require('iconv').Iconv, fs = require("fs"); var buffer = fs.readFileSync("test.sql"), iconv = new Iconv( "UTF-16", "UTF-8"); var result = iconv.convert(buffer).toString("utf8"); 
+15
source

Perhaps this is a BOM (Byte-Order-Mark)? Be sure to save your files without a BOM or include code to remove the BOM .

BOM are usually invisible in text editors.

I know that Notepad ++ has a function in which you can easily remove the BOM from a file. Encoding > Encode in UTF-8 without BOM .

0
source

Use the lite version of Iconv-lite

 var result= ""; var iconv = require('iconv-lite'); var stream = fs.createReadStream(sourcefile) .on("error",function(err){ //handle error }) .pipe(iconv.decodeStream('win1251')) .on("error",function(err){ //handle error }) .on("data",function(data){ result += data; }) .on("end",function(){ //use result }); 
0
source

All Articles