Spark SQL: how to use json data from a REST service as a DataFrame

I need to read some JSON data from a web service that provides REST interfaces for querying data from my SPARK SQL code for analysis. I can read the JSON stored in the blob repository and use it.

I was wondering what is the best way to read data from a REST service and use it like any other DataFrame .

BTW I am using SPARK 1.6 of Linux cluster on HD insight if this helps. I would also appreciate it if someone could share any code snippets for the same one, that I am still very new to SPARK.

+5
source share
2 answers

On Spark 1.6:

If you are in Python, use the requests library to get information, and then just create an RDD. There should be some kind of similar library (corresponding thread ) for Scala. Then just do:

 json_str = '{"executorCores": 2, "kind": "pyspark", "driverMemory": 1000}' rdd = sc.parallelize([json_str]) json_df = sqlContext.jsonRDD(rdd) json_df 

Code for Scala:

 val anotherPeopleRDD = sc.parallelize( """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil) val anotherPeople = sqlContext.read.json(anotherPeopleRDD) 

This is from: http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

+5
source

Spark cannot parse arbitrary json into a dataframe, since json is a hierarchical structure and the dataframe is flat. If your json does not create a spark, most likely it does not meet the condition "Each line must contain a separate, standalone valid JSON" and, therefore, will need to be analyzed using your custom code, and then passed to the dataframe as a collection of objects of the class case or sql spark lines.

You can download as:

 import scalaj.http._ val response = Http("proto:///path/to/json") .header("key", "val").method("get") .execute().asString.body 

and then parse your json as shown in this answer . And then create the Seq objects of your case class (say seq) and create a dataframe as

 seq.toDF 
+1
source

All Articles