Managing Pig & Cassandra and DataStax

Question

Managing Pig & Cassandra and DataStax

I use Pig with my Cassandra data to do all kinds of amazing grouping feats that would be almost impossible to write imperatively. I am using the integration of DataStax Hadoop and Cassandra, and I have to say that it is impressive. Hat-off to these guys!

I have a rather small sandbox cluster (2 nodes) where I put this system through some tests. I have a CQL table that has ~ 53M rows (about 350 bytes ea.), And I notice that Mapper later takes a very long time to grind through these 53M rows. I started digging through magazines, and I see that the map spills repeatedly (I saw 177 spills from the cartographer), and I think this is part of the problem.

The combination of CassandraInputFormat and JobConfig creates only one mapper, so this mapper should read 100% of the rows from the table. I call it antiparallel :)

Now in this photo there are many programs, including:

2 physical nodes
hasoop node is in DC "Analytics" (by default), but physically in the same rack.
I see a job using LOCAL_QUORUM

Can someone tell me how to get Pig to create more input partitions so that I can run more cards? I have 23 slots; It seems a pity to use only all the time.

Or am I completely insane and don't understand the problem? I welcome both answers!

+4

cassandra hadoop apache-pig datastax datastax-enterprise

hughj Oct 24 '13 at 20:01

source share

3 answers

nate · Answer 1 · 2014-04-14T19:18:59+0000

pig.noSplitCombination = true. .

script:

dse pig -Dpig.noSplitCombination=true /path/to/script.pig

Pig script:

SET pig.noSplitCombination true
table = LOAD 'cfs://ks/cf' USING CqlStorage();

/etc/dse/pig/pig.properties. :

pig.noSplitCombination=true

Pig () : 1.

alexLiu · Answer 2 · 2013-10-28T17:54:24+0000

cassandra.input.split.size - , 64k, , . node Cql? ?

split_size url

CassandraStorage Cassandra://? [ : @]/[slice_start = & slice_end = [& = True] [& = 1] [& allow_deletes = True] [& widerows = True] [& use_secondary = ] [& =] [& split_size =] [& = ] [& init_address =] [& rpc_port =]]

CqlStorage CQL://? [ : @]/[[PAGE_SIZE =] [& =] [& output_query =] [& where_clause =] [& split_size =] [& =] [& use_secondary = | ] [& init_address =] [& rpc_port =]]

JRomio · Answer 3 · 2015-02-18T23:10:11+0000

pig.noSplitCombination = true - 769

Managing Pig & Cassandra and DataStax

More articles: