PySpark Logging?

I want my Spark driver program written in Python to output some basic logging information. There are three ways to do this:

  1. Using the PySpark py4j bridge to access the Java log4j logging tool used by Spark.

log4jLogger = sc._jvm.org.apache.log4j LOGGER = log4jLogger.LogManager.getLogger(__name__) LOGGER.info("pyspark script logger initialized")

  1. Just use standard console printing.

  2. logging the Python standard library. This seems like the ideal and most Python approach, however, at least out of the box, it doesn't work, and logged messages don't seem to be recoverable. Of course, this can be configured to log into py4j-> log4j and / or the console.

So, the official programming guide ( https://spark.apache.org/docs/1.6.1/programming-guide.html ) does not mention logging at all. This is disappointing. There should be a standard documented recommended way to log in from the Spark driver program.

looked for this problem and found the following: how do I log in from my Python Spark script

But the content of this topic was unsatisfactory.

In particular, I have the following questions:

  • Am I missing the standard way to log in from the PySpark driver program?
  • Are there any pros / cons when logging into py4j-> log4j against the console?
+16
logging apache-spark pyspark
source share
2 answers

A cleaner solution is to use the standard Python logger with a custom distributed handler to collect log messages from all nodes in the spark cluster.

See "PySpark Login" of this Gist.

+3
source share

In my python dev environment (installing a single Spark machine) I use this:

 import logging def do_my_logging(log_msg): logger = logging.getLogger('__FILE__') logger.warning('log_msg = {}'.format(log_msg)) do_my_logging('Some log message') 

which works using a spark-submit script.

+1
source share

All Articles