Hadoop - textouputformat.separator use ctrlA (^ A)

I am trying to use ^ A as a delimiter between a key and a value in my shortened output files. I found that the configuration setting "mapred.textoutputformat.separator" is what I want, and it correctly switches the delimiter to ",":

conf.set("mapred.textoutputformat.separator", ","); 

But it cannot handle ^ A character:

 conf.set("mapred.textoutputformat.separator", "\u0001"); 

causes this error:

ERROR security.UserGroupInformation: PriviledgedActionException as:user (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 68; columnNumber: 94; Character reference "&#

I found this ticket https://issues.apache.org/jira/browse/HADOOP-7542 and I see that he tried to fix it, but returned the fix due to XML1.1 problems.

SO I wonder if anyone succeeded by setting the delimiter to ^ A (seems pretty common) using lightweight work. Or, if I should just install and use the tab separator.

Thanks!

I am running Hadoop 0.20.2-cdh3u5 on CentOS 6.2

+8
control-characters hadoop separator
source share
1 answer

Looking around, there seem to be three options I have found to solve this problem:

Possible solutions, described in detail in the link above:

  • You can Base64 encode the delimiter character. Then you need to create a custom TextOutputFormat that overrides the getRecordWriter method and decodes the Base64 encoded delimiter.
  • Create the custom TextOutputFormat again, except for changing the default delimiter character in the tab.
  • Provide a delimiter through an XML resource file. You can specify a custom resource file using the addResource () method in job settings.
+4
source share

All Articles