I use Pig with my Cassandra data to do all kinds of amazing grouping feats that would be almost impossible to write imperatively. I am using the integration of DataStax Hadoop and Cassandra, and I have to say that it is impressive. Hat-off to these guys!
I have a rather small sandbox cluster (2 nodes) where I put this system through some tests. I have a CQL table that has ~ 53M rows (about 350 bytes ea.), And I notice that Mapper later takes a very long time to grind through these 53M rows. I started digging through magazines, and I see that the map spills repeatedly (I saw 177 spills from the cartographer), and I think this is part of the problem.
The combination of CassandraInputFormat and JobConfig creates only one mapper, so this mapper should read 100% of the rows from the table. I call it antiparallel :)
Now in this photo there are many programs, including:
- 2 physical nodes
- hasoop node is in DC "Analytics" (by default), but physically in the same rack.
- I see a job using LOCAL_QUORUM
Can someone tell me how to get Pig to create more input partitions so that I can run more cards? I have 23 slots; It seems a pity to use only all the time.
Or am I completely insane and don't understand the problem? I welcome both answers!
hughj source
share