Configure and configure Titan for Spark and Cassandra clusters

There are already a few questions about the aurelius mailing list, as well as here on stackoverflow about specific issues with setting up Titan to make it work with Spark. But what, in my opinion, is missing is a high-level description of a simple installation that uses Titan and Spark.

What I'm looking for is a somewhat minimal setting that uses the recommended settings. For example, for Cassandra, the replication coefficient should be 3, and for analytics, a dedicated data center should be used.

From the information I found in the Spark, Titan, and Cassandra documentation, such a minimal setup might look like this:

  • Real-time DC processing: 3 nodes with Titan + Cassandra (RF: 3)
  • DC analytics: 1 Spark master + 3 Spark slaves with Cassandra (RF: 3)

Some questions that I have about this installation and Titan + Spark in general:

  • Is this setting correct?
  • Should Titan also be installed on 3 Spark slaves and / or Spark master?
  • Is there another setting that you would use instead?
  • Will Spark slaves only read data from DC analytics and ideally even from Cassandra on the same node?

Perhaps someone can even share a configuration file that supports this setting (or better).

+4
source share
1 answer

So, I just tried and created a simple Spark cluster to work with Titan (and Cassandra as storage), and here's what I came up with:

High Level Review

I'm just focused on the analytic side of the cluster, so I released the processing nodes in real time.

Analytics Processing Center High Level Review

A spark consists of one (or more) masters and several subordinates (workers). Because the slaves do the actual processing, they need to access the data they are working on. Therefore, Cassandra is installed on workers and stores graphics data from Titan.

Work is sent from the Titan nodes to the spark master, who distributes them to his employees. Therefore, Titan basically only communicates with the master Spark.

HDFS is needed only because TinkerPop stores intermediate results in it. Note that this has changed in TinkerPop 3.2.0 .

Installation

HDFS

I just followed the tutorial I found here . There are only two things to remember for Titan:

  • Select a compatible version for Titan 1.0.0, this is 1.2.1.
  • Hadoop's TaskTrackers and JobTrackers are not needed, since we only want HDFS, not MapReduce.

Spark

Again, the version should be compatible, which is also 1.2.1 for Titan 1.0.0. Installation basically means extracting the archive using the compiled version. In the end, you can configure Spark to use your HDFS by exporting HADOOP_CONF_DIR , which should point to the conf Hadoop directory.

Titan configuration

You will also need HADOOP_CONF_DIR on the Titan node from which you want to run OLAP jobs. It should contain a core-site.xml file that specifies the nameNode:

 <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://COORDINATOR:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri authority is used to determine the host, port, etc. for a filesystem.</description> </property> </configuration> 

Add HADOOP_CONF_DIR to your CLASSPATH , and TinkerPop must have access to HDFS. The TinkerPop documentation contains more information about this and how to properly configure HDFS.

Finally, the configuration file that worked for me:

 # # Hadoop Graph Configuration # gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph gremlin.hadoop.graphInputFormat=com.thinkaurelius.titan.hadoop.formats.cassandra.CassandraInputFormat gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat gremlin.hadoop.deriveMemory=false gremlin.hadoop.jarsInDistributedCache=true gremlin.hadoop.inputLocation=none gremlin.hadoop.outputLocation=output # # Titan Cassandra InputFormat configuration # titanmr.ioformat.conf.storage.backend=cassandrathrift titanmr.ioformat.conf.storage.hostname=WORKER1,WORKER2,WORKER3 titanmr.ioformat.conf.storage.port=9160 titanmr.ioformat.conf.storage.keyspace=titan titanmr.ioformat.cf-name=edgestore # # Apache Cassandra InputFormat configuration # cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner cassandra.input.keyspace=titan cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000 cassandra.input.columnfamily=edgestore cassandra.range.batch.size=2147483647 # # SparkGraphComputer Configuration # spark.master=spark://COORDINATOR:7077 spark.serializer=org.apache.spark.serializer.KryoSerializer 

The answers

This leads to the following answers:

Is this setting correct?

It seems. At least it works with this setting.

Should Titan also be installed on 3 Spark and / or Spark master slaves?

Since this is not required, I would not do it, because I prefer to separate the Spark and Titan servers that the user can access.

Is there any other setting that you would use?

I would be glad to hear from someone else who has a different setting.

Will Spark slaves only read data from the analytic DC, and ideally even from Cassandra on the same node?

Because Cassandra nodes (from DC analytics) are explicitly configured, Spark slaves cannot retrieve data from completely different nodes. But I'm still not sure about the second part. Maybe someone else can give more insight here?

+2
source

All Articles