How to define different encodings without using specification?

Question

How to define different encodings without using specification?

I have a file watcher that grabs content from a growing file encoded using utf-16LE. The first bit of data written to it has an accessible specification - I used this to identify the UTF-8 encoding (which MOST of my files goes into the encoding). I break the spec and recode to UTF-8, so my parser is not worried. The problem is that since a growing file does not have every bit of data that has a specification in it.

Here my question is - without adding specification bytes to each data set that I have ( because I have no control over the source code ), I can just look for the null bytes that are inherent in UTF-16 \ 000, and then use this how is my id instead of spec? Will it hurt me along the way?

My architecture includes a ruby web application that logs the received data in a temporary file when my parser, written in java, selects it.

Write that my identification / re-encoding code is as follows:

  // guess encoding if utf-16 then
  // convert to UTF-8 first
  try {
    FileInputStream fis = new FileInputStream(args[args.length-1]);
    byte[] contents = new byte[fis.available()];
    fis.read(contents, 0, contents.length);

    if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
      String asString = new String(contents, "UTF-16");
      byte[] newBytes = asString.getBytes("UTF8");
      FileOutputStream fos = new FileOutputStream(args[args.length-1]);
      fos.write(newBytes);
      fos.close();
    }

    fis.close();
    } catch(Exception e) {
      e.printStackTrace();
  }

UPDATE

I want to support things like euro, em dash and other characters as such. I modified the above code to look like this, and it looks like all my tests for these characters are:

  // guess encoding if utf-16 then
  // convert to UTF-8 first
  try {
    FileInputStream fis = new FileInputStream(args[args.length-1]);
    byte[] contents = new byte[fis.available()];
    fis.read(contents, 0, contents.length);
    byte[] real = null;

    int found = 0;

    // if found a BOM then skip out of here... we just need to convert it
    if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
      found = 3;
      real = contents;

    // no BOM detected but still could be UTF-16
    } else {

      for(int cnt=0; cnt<10; cnt++) {
        if(contents[cnt] == (byte)0x00) { found++; };

        real = new byte[contents.length+2];
        real[0] = (byte)0xFF;
        real[1] = (byte)0xFE;

        // tack on BOM and copy over new array
        for(int ib=2; ib < real.length; ib++) {
          real[ib] = contents[ib-2];
        }
      }

    }

    if(found >= 2) {
      String asString = new String(real, "UTF-16");
      byte[] newBytes = asString.getBytes("UTF8");
      FileOutputStream fos = new FileOutputStream(args[args.length-1]);
      fos.write(newBytes);
      fos.close();
    }

    fis.close();
    } catch(Exception e) {
      e.printStackTrace();
  }

What do you all think?

0

java utf-8 byte-order-mark utf-16

eyberg 28 . '09 0:31

3

Stephen C · Answer 1 · 2009-08-28T00:50:13+0000

100% - . , , , , , "" . ( .) , , , .

, , , , . ( , , / , !).

EDIT: HTTP. , HTTP- "content-type" POST, , / /, .

paxdiablo · Answer 2 · 2009-08-28T00:50:56+0000

, . ( ASCII, UTF-16, ), , 0x7f, .

, , , .

, -.

, , , .

jwaddell · Answer 3 · 2009-08-28T05:15:07+0000

This question contains several options for detecting characters that do not appear to require specification.

My project is currently using jCharDet , but I may have to look at some of the other options listed here, since jCharDet is not 100% reliable.

How to define different encodings without using specification?

More articles: