How to read the whole file in one line

I want to read a json or xml file in pyspark.lf. My file is split into several lines in

rdd= sc.textFIle(json or xml) 

Enter

 { " employees": [ { "firstName":"John", "lastName":"Doe" }, { "firstName":"Anna" ] } 

The input is spread over several lines.

Expected Result {"employees:[{"firstName:"John",......]}

How to get the full file in one line using pyspark?

Please help me, I'm new to sparks.

+8
source share
4 answers

If your data is not formed into a single line, as textFile expects, use wholeTextFiles .

This will give you the whole file so you can parse it in whatever format you would like.

+5
source

There are 3 ways (I came up with a third, the first two are standard Spark built-in functions), the solutions here are in PySpark:

textFile, wholeTextFile and tagged textFile (key = file, value = 1 line from the file. This is a kind of combination between the two specified methods of file analysis).

1.) textFile

input: rdd = sc.textFile('/home/folder_with_text_files/input_file')

output: an array containing 1 line of the file as each record, i.e. [line1, line2, ...]

2.) wholeTextFiles

input: rdd = sc.wholeTextFiles('/home/folder_with_text_files/*')

output: an array of tuples, the first element is the "key" with the file path, the second element contains 1 file of all content, i.e.

[(u'file: / home / folder_with_text_files / ', u'file1_contents'), (u'file: / home / folder_with_text_files /', file2_contents), ...]

3.) "Label" textFile

input:

 import glob from pyspark import SparkContext SparkContext.stop(sc) sc = SparkContext("local","example") # if running locally sqlContext = SQLContext(sc) for filename in glob.glob(Data_File + "/*"): Spark_Full += sc.textFile(filename).keyBy(lambda x: filename) 

output: an array with each record containing a tuple, using a file name with a key with a value = each line of the file. (Technically, using this method, you can also use a different key, in addition to the actual name of the file path - possibly hashing to save in memory). i.e.

 [('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'), ('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'), ('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'), ('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'), ...] 

You can also recombine either a list of strings:

Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()

 [('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']), ('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])] 

Or recompile entire files back into single lines (in this example, the result will be the same as you obtained from wholeTextFiles, but with the string "file:" removed from the file.):

Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()

+5
source

So you would do in scala

 rdd = sc.wholeTextFiles("hdfs://nameservice1/user/me/test.txt") rdd.collect.foreach(t=>println(t._2)) 
+4
source

"How to read the entire [HDFS] file in one line [in Spark to use as sql]":

eg.

 // Put file to hdfs from edge-node shell... hdfs dfs -put <filename> // Within spark-shell... // 1. Load file as one string val f = sc.wholeTextFiles("hdfs:///user/<username>/<filename>") val hql = f.take(1)(0)._2 // 2. Use string as sql/hql val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val results = hiveContext.sql(hql) 
+3
source

All Articles