I have time series that are currently stored as a graph (using a time tree structure similar to this ) in Neo4j server, version 2.3.6 (so there is only a REST interface, without Bolt). What I'm trying to do is do some analytics of these time series in a distributed way using PySpark .
Now I know about existing projects for connecting Spark to Neo4j, especially to those listed here . The problem is that they are focused on creating an interface for working with charts. In my case, the graphs are not relevant, since my Neo4j Cypher requests are designed to create arrays of values. Everything that happens in the downstream is related to the processing of these arrays as time series; again, and not as a schedule.
My question is: did someone successfully request a REST-only Neo4j instance in parallel with PySpark, and if so, how did you do it? The py2neo library seemed like a good candidate until I realized that the connection object cannot be shared between partitions (or, if possible, I don't know how to do this). Right now, I am considering that my Spark jobs run independent REST requests on the Neo4j server, but I wanted to see how the community could solve this problem.
Best, AurΓ©lien
python neo4j pyspark
ajmazurie
source share