What is the difference between --archives, --files, py-files in pyspark job arguments

Question

What is the difference between --archives, --files, py-files in pyspark job arguments

--archives. --files, --py-filesand sc.addFileand are sc.addPyFilepretty confusing, can someone explain this clearly?

+4

apache-spark pyspark pyspark-sql

Jasonwayne Jun 28 '16 at 2:56

source share

1 answer

ShuaiYuan · Answer 1 · 2016-06-28T09:55:22+0000

These options are really scattered everywhere.

In general, add your data files via --filesor --archivesand code files through --py-files. The latter will be added to the classpath (cf, here ) so you can import and use.

As you can imagine, CLI arguments are actually considered by functions addFileand addPyFiles(cf, here )

http://spark.apache.org/docs/latest/programming-guide.html

pyspark spark-submit script.
Python.zip,.egg .py , , , --py-files

http://spark.apache.org/docs/latest/running-on-yarn.html

--files --archives #, Hadoop. , : -files localtest.txt # appSees.txt, , localtest.txt HDFS, appSees.txt, appSees.txt, YARN.

http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=addpyfile#pyspark.SparkContext.addPyFile

addFile(path) , Spark node. , HDFS ( , Hadoop), HTTP, HTTPS FTP URI.
addPyFile(path) .py .zip , SparkContext . , HDFS ( , Hadoop), HTTP, HTTPS FTP URI.

What is the difference between --archives, --files, py-files in pyspark job arguments

More articles: