Spark: read input stream instead of file

Question

Spark: read input stream instead of file

I am using SparkSQL in a Java application to do some processing of CSV files using Databricks for parsing.

The data that I process comes from different sources (remote URL, local file, Google cloud storage), and I'm used to turning everything into an InputStream so that I can analyze and process the data without knowing where it came from.

All the documentation I saw on Spark read files from a path, for example.

SparkConf conf = new SparkConf().setAppName("spark-sandbox").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); SQLContext sqlc = new SQLContext(sc); DataFrame df = sqlc.read() .format("com.databricks.spark.csv") .option("inferSchema", "true") .option("header", "true") .load("path/to/file.csv"); DataFrame dfGrouped = df.groupBy("varA","varB") .avg("varC","varD"); dfGrouped.show();

And what I would like to do is read from an InputStream or even a string already in memory. Something like the following:

 InputStream stream = new URL( "http://www.sample-videos.com/csv/Sample-Spreadsheet-100-rows.csv" ).openStream(); DataFrame dfRemote = sqlc.read() .format("com.databricks.spark.csv") .option("inferSchema", "true") .option("header", "true") .load(stream); String someString = "imagine,some,csv,data,here"; DataFrame dfFromString = sqlc.read() .format("com.databricks.spark.csv") .option("inferSchema", "true") .option("header", "true") .read(someString);

Is there something simple I'm missing here?

I have read several documents about Spark Streaming and user receivers, but as far as I can tell, this is opening a connection that will provide data continuously. Spark Streaming seems to break the data into pieces and process it, expecting more data to flow into the endless stream.

My best guess: Spark, as a descendant of Hadoop, expects large amounts of data that are probably located somewhere in the file system. But since Spark still performs in-memory processing, it was reasonable for me that SparkSQL could analyze the data already in memory.

Any help would be appreciated.

+6

java apache-spark apache-spark-sql spark-dataframe databricks

Nate vaughan Jul 20 '16 at 21:13

source share

1 answer

Sim · Accepted Answer · 2016-07-25T20:08:43+0000

You can use at least four different approaches to make your life easier:

Use your input stream, write to a local file (fast with SSD), read with Spark.
Use the Hadoop file system connectors for S3, Google Cloud Storage and turn everything into a file operation. (This will not solve the problem when reading from an arbitrary URL, since there is no HDFS connector for this.)
Think of the different input types as different URIs and create a utility function that checks the URI and starts the corresponding read operation.
Same as (3), but use case classes instead of URIs and just overload based on input type.

Spark: read input stream instead of file

More articles: