There are 3 ways (I came up with a third, the first two are standard Spark built-in functions), the solutions here are in PySpark:
textFile, wholeTextFile and tagged textFile (key = file, value = 1 line from the file. This is a kind of combination between the two specified methods of file analysis).
1.) textFile
input: rdd = sc.textFile('/home/folder_with_text_files/input_file')
output: an array containing 1 line of the file as each record, i.e. [line1, line2, ...]
2.) wholeTextFiles
input: rdd = sc.wholeTextFiles('/home/folder_with_text_files/*')
output: an array of tuples, the first element is the "key" with the file path, the second element contains 1 file of all content, i.e.
[(u'file: / home / folder_with_text_files / ', u'file1_contents'), (u'file: / home / folder_with_text_files /', file2_contents), ...]
3.) "Label" textFile
input:
import glob from pyspark import SparkContext SparkContext.stop(sc) sc = SparkContext("local","example") # if running locally sqlContext = SQLContext(sc) for filename in glob.glob(Data_File + "/*"): Spark_Full += sc.textFile(filename).keyBy(lambda x: filename)
output: an array with each record containing a tuple, using a file name with a key with a value = each line of the file. (Technically, using this method, you can also use a different key, in addition to the actual name of the file path - possibly hashing to save in memory). i.e.
[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'), ('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'), ('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'), ('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'), ...]
You can also recombine either a list of strings:
Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()
[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']), ('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])]
Or recompile entire files back into single lines (in this example, the result will be the same as you obtained from wholeTextFiles, but with the string "file:" removed from the file.):
Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()