I am learning Spark (in Scala) and trying to figure out how to count all the words in each line of a file. I work with a data set, where each line contains a tab-delimited document_id and the full text of the document.
doc_1 <full-text>
doc_2 <full-text>
etc..
Here is an example of a toy that I have in the doc.txt file
doc_1 new york city new york state
doc_2 rain rain go away
I think what I need to do is convert to includeig tuples
((doc_id, word), 1)
and then call reduceByKey () to sum 1. I wrote the following:
val file = sc.textFile("docs.txt")
val tuples = file.map(_.split("\t"))
.map( x => (x(1).split("\\s+")
.map(y => ((x(0), y), 1 )) ) )
Which gives me an intermediate view, I think I need:
tuples.collect
res0: Array[Array[((String, String), Int)]] = Array(Array(((doc_1,new),1), ((doc_1,york),1), ((doc_1,city),1), ((doc_1,new),1), ((doc_1,york),1), ((doc_1,state),1)), Array(((doc_2,rain),1), ((doc_2,rain),1), ((doc_2,go),1), ((doc_2,away),1)))
But if calling reduceByKey in tuples causes an error
tuples.reduceByKey(_ + )
<console>:21: error: value reduceByKey is not a member of org.apache.spark.rdd.RDD[Array[((String, String), Int)]]
tuples.reduceByKey(_ + )
, . , . , , , .
/ .
. , https://spark.apache.org/examples.html, , , . , .