Reading files from Apache Spark textFileStream

Question

Reading files from Apache Spark textFileStream

I am trying to read / control txt files from a Hadoop file system directory. But I noticed that all the txt files inside this directory are the directories themselves, as shown in the following example below:

/crawlerOutput/b6b95b75148cdac44cd55d93fe2bbaa76aa5cccecf3d723c5e47d361b28663be-1427922269.txt/_SUCCESS   
/crawlerOutput/b6b95b75148cdac44cd55d93fe2bbaa76aa5cccecf3d723c5e47d361b28663be-1427922269.txt/part-00000
/crawlerOutput/b6b95b75148cdac44cd55d93fe2bbaa76aa5cccecf3d723c5e47d361b28663be-1427922269.txt/part-00001

I would like to read all the data inside the part files. I am trying to use the following code as shown in this snippet:

val testData = ssc.textFileStream("/crawlerOutput/*/*")

But unfortunately, he said that does not exist /crawlerOutput/*/*. Does it accept textFileStreamwildcards? What to do to solve this problem?

+2

scala apache-spark spark-streaming

Saulo ricci Apr 01 '15 at 10:15

source share

1 answer

ChristopherB · Answer 1 · 2015-04-02T01:45:21+0000

textFileStream() fileStream() (. https://spark.apache.org/docs/1.3.0/streaming-programming-guide.html).

. , StreamingListener , .

. , , , - textFile(), .

Reading files from Apache Spark textFileStream

More articles: