I use a custom output format that outputs a new sequence file for each mapper to the key, so you get something like this.
Enter
Key1 Value Key2 Value Key1 Value
Files
/path/to/output/Key1/part-00000 /path/to/output/Key2/part-00000
I noticed a huge performance hit, it usually takes about 10 minutes to simply match the input, but after two hours the file cabinets were not complete. Although they displayed lines. I expect that the number of unique keys will be about half the number of input lines, about 200,000.
Has anyone ever done something similar or could suggest something that could help performance? I would like to keep this process of key separation to the extent possible.
Thanks!
source share