Apache Pig: download a file that renders perfectly with hasoop fs -text

I have files called part-r-000 [0-9] [0-9] and contain fields separated by tabs. I can view them with hadoop fs -text part-r-00000 , but I cannot load them with a pig.

What I tried:

 x = load 'part-r-00000'; dump x; x = load 'part-r-00000' using TextLoader(); dump x; 

but it only gives me trash. How to view a file using a pig?

What may be relevant is that my hdf files still use CDH-2. Also, if I upload the file to local and run file part-r-00000 , it says part-r-00000: data , I don’t know how to unzip it locally.

+7
source share
2 answers

According to the HDFS documentation , hadoop fs -text <file> can be used for "zip and TextRecordInputStream" data, so your data can be in one of these formats.

If the file was compressed, Hadoop usually adds the extension when outputting to HDFS, but if it wasn’t, you can try testing by unpacking / ungzipping / unbzip 2ing / etc locally. It seems that Pig should do this automatically, but a file extension (for example, part-r-00000.zip) may be required - more .

I'm not too sure about TextRecordInputStream .. it looks like this would be the default method for Pig, but I could be wrong. I did not see mention of downloading this data through Pig when I was doing a quick google.

Update: Since you found it to be a sequence file, here is how you can download it using PiggyBank:

 -- using Cloudera directory structure: REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar --REGISTER /home/hadoop/lib/pig/piggybank.jar DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader(); -- Sample job: grab counts of tweets by day A = LOAD 'mydir/part-r-000{00..99}' # not sure if pig likes the {00..99} syntax, but worth a shot USING SequenceFileLoader AS (key:long, val:long, etc.); 
+4
source

If you want to manipulate (read / write) sequence files using Pig , then you can try Twitter Elephant-Bird as well.

You can find here examples of how to read / write them.

If you use custom Writables in your sequence file, you can implement your own converter by extending AbstractWritableConverter .

Please note that Elephant-Bird must have Thrift installed on your computer. Before creating it, make sure that it uses the correct version of Thrift, and you can also specify the correct path to the Thrift executable in pom.xml

 <plugin> <groupId>org.apache.thrift.tools</groupId> <artifactId>maven-thrift-plugin</artifactId> <version>0.1.10</version> <configuration> <thriftExecutable>/path_to_thrift/thrift</thriftExecutable> </configuration> </plugin> 
+3
source

All Articles