Pyspark: how to check if a file exists in hdfs

Question

Pyspark: how to check if a file exists in hdfs

I want to check if several files exist in hdfs before uploading them to SparkContext. I am using pyspark. I tried os.system("hadoop fs -test -e %s" %path) but since I have many ways to check, the work crashed. I also tried sc.wholeTextFiles(parent_path) and then filter by key. but it crashed because parent_path contains many subpaths and files. Could you help me?

+6

filesystems hadoop hdfs apache-spark pyspark

A7med Sep 01 '15 at 14:53

source share

3 answers

Joss · Answer 1 · 2016-07-13T19:05:54+0000

Tell us how he says Tristan Reed :

... (Spark) It can read many formats and supports Hadoop glob expressions, which are extremely useful for reading from several paths in HDFS, but it does not have a built-in tool that I know about moving directories or files, and also does not have utilities, specific to interact with Hadoop or HDFS.

In any case, this is his answer to a related question: Pyspark: get a list of files / directories on the HDFS path

Once you have a list of files in a directory, it is easy to check if any particular file exists.

I hope this can help somehow.

David · Answer 2 · 2016-07-13T19:28:08+0000

Have you tried using pydoop ? exists function should work

Tristan reid · Answer 3 · 2016-07-26T17:01:47+0000

One possibility is that you could use hadoop fs -lsr your_path to get all the paths, and then check if the paths you are interested in are in this set.

As for your failure, perhaps this was the result of all calls to os.system , not specific to the hadoop command. Sometimes invoking an external process can lead to problems with buffers that are never freed, in particular I / O buffers (stdin / stdout).

One solution would be to make one call to the bash script, which will go through all the paths. You can create a script using a string template in your code, fill out an array of paths in a script, write it, and then execute.

It might also be a good idea to switch to the python subprocess module, which gives you more granular control over the processing of subprocesses. Here is the equivalent of os.system :

 process = subprocess.check_output( args=your_script, stdout=PIPE, shell=True )

Note that you can switch stdout to something like a file descriptor if this helps you debug or make the process more reliable. You can also switch this shell=True argument to False , unless you name the actual script or use things like pipes or redirection.

Pyspark: how to check if a file exists in hdfs

More articles: