I have a file watcher that grabs content from a growing file encoded using utf-16LE. The first bit of data written to it has an accessible specification - I used this to identify the UTF-8 encoding (which MOST of my files goes into the encoding). I break the spec and recode to UTF-8, so my parser is not worried. The problem is that since a growing file does not have every bit of data that has a specification in it.
Here my question is - without adding specification bytes to each data set that I have ( because I have no control over the source code ), I can just look for the null bytes that are inherent in UTF-16 \ 000, and then use this how is my id instead of spec? Will it hurt me along the way?
My architecture includes a ruby web application that logs the received data in a temporary file when my parser, written in java, selects it.
Write that my identification / re-encoding code is as follows:
try {
FileInputStream fis = new FileInputStream(args[args.length-1]);
byte[] contents = new byte[fis.available()];
fis.read(contents, 0, contents.length);
if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
String asString = new String(contents, "UTF-16");
byte[] newBytes = asString.getBytes("UTF8");
FileOutputStream fos = new FileOutputStream(args[args.length-1]);
fos.write(newBytes);
fos.close();
}
fis.close();
} catch(Exception e) {
e.printStackTrace();
}
UPDATE
I want to support things like euro, em dash and other characters as such. I modified the above code to look like this, and it looks like all my tests for these characters are:
try {
FileInputStream fis = new FileInputStream(args[args.length-1]);
byte[] contents = new byte[fis.available()];
fis.read(contents, 0, contents.length);
byte[] real = null;
int found = 0;
if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
found = 3;
real = contents;
} else {
for(int cnt=0; cnt<10; cnt++) {
if(contents[cnt] == (byte)0x00) { found++; };
real = new byte[contents.length+2];
real[0] = (byte)0xFF;
real[1] = (byte)0xFE;
for(int ib=2; ib < real.length; ib++) {
real[ib] = contents[ib-2];
}
}
}
if(found >= 2) {
String asString = new String(real, "UTF-16");
byte[] newBytes = asString.getBytes("UTF8");
FileOutputStream fos = new FileOutputStream(args[args.length-1]);
fos.write(newBytes);
fos.close();
}
fis.close();
} catch(Exception e) {
e.printStackTrace();
}
What do you all think?