Spark: read input stream instead of file

I am using SparkSQL in a Java application to do some processing of CSV files using Databricks for parsing.

The data that I process comes from different sources (remote URL, local file, Google cloud storage), and I'm used to turning everything into an InputStream so that I can analyze and process the data without knowing where it came from.

All the documentation I saw on Spark read files from a path, for example.

SparkConf conf = new SparkConf().setAppName("spark-sandbox").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); SQLContext sqlc = new SQLContext(sc); DataFrame df = sqlc.read() .format("com.databricks.spark.csv") .option("inferSchema", "true") .option("header", "true") .load("path/to/file.csv"); DataFrame dfGrouped = df.groupBy("varA","varB") .avg("varC","varD"); dfGrouped.show(); 

And what I would like to do is read from an InputStream or even a string already in memory. Something like the following:

 InputStream stream = new URL( "http://www.sample-videos.com/csv/Sample-Spreadsheet-100-rows.csv" ).openStream(); DataFrame dfRemote = sqlc.read() .format("com.databricks.spark.csv") .option("inferSchema", "true") .option("header", "true") .load(stream); String someString = "imagine,some,csv,data,here"; DataFrame dfFromString = sqlc.read() .format("com.databricks.spark.csv") .option("inferSchema", "true") .option("header", "true") .read(someString); 

Is there something simple I'm missing here?

I have read several documents about Spark Streaming and user receivers, but as far as I can tell, this is opening a connection that will provide data continuously. Spark Streaming seems to break the data into pieces and process it, expecting more data to flow into the endless stream.

My best guess: Spark, as a descendant of Hadoop, expects large amounts of data that are probably located somewhere in the file system. But since Spark still performs in-memory processing, it was reasonable for me that SparkSQL could analyze the data already in memory.

Any help would be appreciated.

+6
source share
1 answer

You can use at least four different approaches to make your life easier:

  • Use your input stream, write to a local file (fast with SSD), read with Spark.

  • Use the Hadoop file system connectors for S3, Google Cloud Storage and turn everything into a file operation. (This will not solve the problem when reading from an arbitrary URL, since there is no HDFS connector for this.)

  • Think of the different input types as different URIs and create a utility function that checks the URI and starts the corresponding read operation.

  • Same as (3), but use case classes instead of URIs and just overload based on input type.

+1
source

All Articles