Creating a Spark DataFrame in Spark Streaming from a JSON Message on Kafka

I am working on implementing Spark Streaming in Scala, where I click on JSON Strings from a Kafka theme and want to load them into a data framework. Is there a way to do this when Spark displays a schema on it from RDD [String]?

+7
scala dataframe apache-spark apache-kafka
source share
3 answers

Yes, you can use the following:

sqlContext.read //.schema(schema) //optional, makes it a bit faster, if you've processed it before you can get the schema using df.schema .json(jsonRDD) //RDD[String] 

I am trying to do the same at the moment. I'm curious how you got RDD [String] from Kafka, although I'm still under the impression that Spark + Kafka is just streams and not "pull out what is now", a one-time batch. :)

+3
source share

In spark 1.4, you can try the following method to create a Dataframe from rdd:

  val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val yourDataFrame = hiveContext.createDataFrame(yourRDD) 
+2
source share

You can use the code below to read in the message flow from Kafka, extract the JSON values ​​and convert them to a DataFrame:

 val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet) messages.foreachRDD { rdd => //extracting the values only val df = sqlContext.read.json(rdd.map(x => x._2)) df.show() } 
+1
source share

All Articles