Hadoop - textouputformat.separator use ctrlA (^ A)

Question

Hadoop - textouputformat.separator use ctrlA (^ A)

I am trying to use ^ A as a delimiter between a key and a value in my shortened output files. I found that the configuration setting "mapred.textoutputformat.separator" is what I want, and it correctly switches the delimiter to ",":

conf.set("mapred.textoutputformat.separator", ",");

But it cannot handle ^ A character:

 conf.set("mapred.textoutputformat.separator", "\u0001");

causes this error:

ERROR security.UserGroupInformation: PriviledgedActionException as:user (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 68; columnNumber: 94; Character reference "&#

I found this ticket https://issues.apache.org/jira/browse/HADOOP-7542 and I see that he tried to fix it, but returned the fix due to XML1.1 problems.

SO I wonder if anyone succeeded by setting the delimiter to ^ A (seems pretty common) using lightweight work. Or, if I should just install and use the tab separator.

Thanks!

I am running Hadoop 0.20.2-cdh3u5 on CentOS 6.2

+8

control-characters hadoop separator

alexP_Keaton Nov 20 '12 at 2:35

source share

1 answer

Binary nerd · Accepted Answer · 2012-11-20T03:56:17+0000

Looking around, there seem to be three options I have found to solve this problem:

The symlink "& # 1" is an invalid XML character - similar SO question
Unicode characters / Ctrl G or Ctrl A as delimiter TextOutputFormat (Hadoop)

Possible solutions, described in detail in the link above:

You can Base64 encode the delimiter character. Then you need to create a custom TextOutputFormat that overrides the getRecordWriter method and decodes the Base64 encoded delimiter.
Create the custom TextOutputFormat again, except for changing the default delimiter character in the tab.
Provide a delimiter through an XML resource file. You can specify a custom resource file using the addResource () method in job settings.

Hadoop - textouputformat.separator use ctrlA (^ A)

More articles: