Using a text file as a Spark stream source for testing purposes

I want to write a test for my spark streaming application that consumes a stream source.

http://mkuthan.imtqy.com/blog/2015/03/01/spark-unit-testing/ suggests using ManualClock, but for now reading the file and checking the outputs will be enough for me.

So I want to use:

JavaStreamingContext streamingContext = ... JavaDStream<String> stream = streamingContext.textFileStream(dataDirectory); stream.print(); streamingContext.awaitTermination(); streamingContext.start(); 

Unfortunately, it does not print anything.

I tried:

  • dataDirectory = "hdfs: // node: port / absolute / path / on / hdfs /"
  • dataDirectory = "file: // C: \\ absolute \\ path \\ on \\ windows \\"
  • adding a text file to the directory BEFORE starting the program
  • adding a text file to the WHILE directory at program startup

Nothing works.

Any suggestion to read from a text file?

Thanks,

Martin

+5
source share
2 answers

The start and wait order is really inverted.

In addition to this, the easiest way to transfer data to the Spark Streaming app for testing is QueueDStream. This is a variable RDD queue of arbitrary data. This means that you can create data programmatically or download it from disk to RDD and transfer it to Spark Streaming code.

Eg. To avoid problems with synchronization with the file computer, you can try the following:

 val rdd = sparkContext.textFile(...) val rddQueue: Queue[RDD[String]] = Queue() rddQueue += rdd val dstream = streamingContext.queueStream(rddQueue) doMyStuffWithDstream(dstream) streamingContext.start() streamingContext.awaitTermination() 
+8
source

I'm so stupid, I flipped start () calls and was waiting for Termination ()

If you want to do the same, you must read from HDFS and add the WHILE file when the program starts.

0
source

All Articles