According to the HDFS documentation , hadoop fs -text <file> can be used for "zip and TextRecordInputStream" data, so your data can be in one of these formats.
If the file was compressed, Hadoop usually adds the extension when outputting to HDFS, but if it wasn’t, you can try testing by unpacking / ungzipping / unbzip 2ing / etc locally. It seems that Pig should do this automatically, but a file extension (for example, part-r-00000.zip) may be required - more .
I'm not too sure about TextRecordInputStream .. it looks like this would be the default method for Pig, but I could be wrong. I did not see mention of downloading this data through Pig when I was doing a quick google.
Update: Since you found it to be a sequence file, here is how you can download it using PiggyBank:
-- using Cloudera directory structure: REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar --REGISTER /home/hadoop/lib/pig/piggybank.jar DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader(); -- Sample job: grab counts of tweets by day A = LOAD 'mydir/part-r-000{00..99}'
Dolan antenucci
source share