Recursively track HDFS directory source streams

Question

Recursively track HDFS directory source streams

I need to transfer data from an HDFS directory using a spark stream.

JavaDStream<String> lines = ssc.textFileStream("hdfs://ip:8020/directory");

The above is very good work on monitoring the HDFS directory for new files, but it is limited to the same level of directories, it does not control sub directories.

I come to the following posts that mention adding depth parameter to this API

https://mail-archives.apache.org/mod_mbox/spark-reviews/201502.mbox/% 3C20150220121124.DBB5FE03F7@git1-us-west.apache.org% 3E

https://github.com/apache/spark/pull/2765

The problem is that in spark version 1.6.1 (checked) this parameter is not, so I can’t use it, I don’t want to change the source of the eight

JavaDStream<String> lines = ssc.textFileStream("hdfs://ip:8020/*/*/*/");

some post in the stack overflow mentions using the syntax above that the fighter doesn't work.

Am I missing something?

+4

apache-spark spark-streaming

duck Jun 11 '16 at 12:12

source share

1 answer

chris mathias · Answer 1 · 2016-10-10T23:41:52+0000

It seems that the patch was created, but was not approved due to difficulties with S3 and the depth of the catalog.

https://github.com/apache/spark/pull/6588

Recursively track HDFS directory source streams

More articles: