So, I just tried and created a simple Spark cluster to work with Titan (and Cassandra as storage), and here's what I came up with:
High Level Review
I'm just focused on the analytic side of the cluster, so I released the processing nodes in real time.

A spark consists of one (or more) masters and several subordinates (workers). Because the slaves do the actual processing, they need to access the data they are working on. Therefore, Cassandra is installed on workers and stores graphics data from Titan.
Work is sent from the Titan nodes to the spark master, who distributes them to his employees. Therefore, Titan basically only communicates with the master Spark.
HDFS is needed only because TinkerPop stores intermediate results in it. Note that this has changed in TinkerPop 3.2.0 .
Installation
HDFS
I just followed the tutorial I found here . There are only two things to remember for Titan:
- Select a compatible version for Titan 1.0.0, this is 1.2.1.
- Hadoop's TaskTrackers and JobTrackers are not needed, since we only want HDFS, not MapReduce.
Spark
Again, the version should be compatible, which is also 1.2.1 for Titan 1.0.0. Installation basically means extracting the archive using the compiled version. In the end, you can configure Spark to use your HDFS by exporting HADOOP_CONF_DIR , which should point to the conf Hadoop directory.
Titan configuration
You will also need HADOOP_CONF_DIR on the Titan node from which you want to run OLAP jobs. It should contain a core-site.xml file that specifies the nameNode:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://COORDINATOR:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri authority is used to determine the host, port, etc. for a filesystem.</description> </property> </configuration>
Add HADOOP_CONF_DIR to your CLASSPATH , and TinkerPop must have access to HDFS. The TinkerPop documentation contains more information about this and how to properly configure HDFS.
Finally, the configuration file that worked for me:
# # Hadoop Graph Configuration # gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph gremlin.hadoop.graphInputFormat=com.thinkaurelius.titan.hadoop.formats.cassandra.CassandraInputFormat gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat gremlin.hadoop.deriveMemory=false gremlin.hadoop.jarsInDistributedCache=true gremlin.hadoop.inputLocation=none gremlin.hadoop.outputLocation=output # # Titan Cassandra InputFormat configuration # titanmr.ioformat.conf.storage.backend=cassandrathrift titanmr.ioformat.conf.storage.hostname=WORKER1,WORKER2,WORKER3 titanmr.ioformat.conf.storage.port=9160 titanmr.ioformat.conf.storage.keyspace=titan titanmr.ioformat.cf-name=edgestore # # Apache Cassandra InputFormat configuration # cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner cassandra.input.keyspace=titan cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000 cassandra.input.columnfamily=edgestore cassandra.range.batch.size=2147483647 # # SparkGraphComputer Configuration # spark.master=spark:
The answers
This leads to the following answers:
Is this setting correct?
It seems. At least it works with this setting.
Should Titan also be installed on 3 Spark and / or Spark master slaves?
Since this is not required, I would not do it, because I prefer to separate the Spark and Titan servers that the user can access.
Is there any other setting that you would use?
I would be glad to hear from someone else who has a different setting.
Will Spark slaves only read data from the analytic DC, and ideally even from Cassandra on the same node?
Because Cassandra nodes (from DC analytics) are explicitly configured, Spark slaves cannot retrieve data from completely different nodes. But I'm still not sure about the second part. Maybe someone else can give more insight here?