What is the difference between --archives, --files, py-files in pyspark job arguments

--archives. --files, --py-filesand sc.addFileand are sc.addPyFilepretty confusing, can someone explain this clearly?

+4
source share
1 answer

These options are really scattered everywhere.

In general, add your data files via --filesor --archivesand code files through --py-files. The latter will be added to the classpath (cf, here ) so you can import and use.

As you can imagine, CLI arguments are actually considered by functions addFileand addPyFiles(cf, here )

pyspark spark-submit script.

Python.zip,.egg .py , , , --py-files

--files --archives #, Hadoop. , : -files localtest.txt # appSees.txt, , localtest.txt HDFS, appSees.txt, appSees.txt, YARN.

addFile(path) , Spark node. , HDFS ( , Hadoop), HTTP, HTTPS FTP URI.

addPyFile(path) .py .zip , SparkContext . , HDFS ( , Hadoop), HTTP, HTTPS FTP URI.

+1

All Articles