I am using SparkSQL in a Java application to do some processing of CSV files using Databricks for parsing.
The data that I process comes from different sources (remote URL, local file, Google cloud storage), and I'm used to turning everything into an InputStream so that I can analyze and process the data without knowing where it came from.
All the documentation I saw on Spark read files from a path, for example.
SparkConf conf = new SparkConf().setAppName("spark-sandbox").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); SQLContext sqlc = new SQLContext(sc); DataFrame df = sqlc.read() .format("com.databricks.spark.csv") .option("inferSchema", "true") .option("header", "true") .load("path/to/file.csv"); DataFrame dfGrouped = df.groupBy("varA","varB") .avg("varC","varD"); dfGrouped.show();
And what I would like to do is read from an InputStream or even a string already in memory. Something like the following:
InputStream stream = new URL( "http://www.sample-videos.com/csv/Sample-Spreadsheet-100-rows.csv" ).openStream(); DataFrame dfRemote = sqlc.read() .format("com.databricks.spark.csv") .option("inferSchema", "true") .option("header", "true") .load(stream); String someString = "imagine,some,csv,data,here"; DataFrame dfFromString = sqlc.read() .format("com.databricks.spark.csv") .option("inferSchema", "true") .option("header", "true") .read(someString);
Is there something simple I'm missing here?
I have read several documents about Spark Streaming and user receivers, but as far as I can tell, this is opening a connection that will provide data continuously. Spark Streaming seems to break the data into pieces and process it, expecting more data to flow into the endless stream.
My best guess: Spark, as a descendant of Hadoop, expects large amounts of data that are probably located somewhere in the file system. But since Spark still performs in-memory processing, it was reasonable for me that SparkSQL could analyze the data already in memory.
Any help would be appreciated.