Years later, when I gained more knowledge and looked again at this question, I just realized that @alfredox really wants to ask. Therefore, I again reviewed and divided the answer into two parts:
To answer why the native DF function (native Spark-SQL function) is faster:
Essentially, why the built-in Spark function is ALWAYS faster than Spark UDF, regardless of whether your UDF is implemented in Python or Scala.
First, we need to understand what tungsten is , which first appeared in Spark 1.4 .
This is the backend and what it focuses on:
- Off-heap memory management using a binary representation of the data in memory, known as the Tungsten string format, and explicit memory management,
- Cache locality regarding cache-based calculations with cache-based markup for high cache access rates,
- Code generation for the entire stage (aka CodeGen).
One of Spark's biggest productivity killers is the GC. The GC pauses all threads in the JVM until the GC shuts down. That is why off-heap memory management is being implemented.
When performing Spark-SQL native functions, the data remains in the tungsten backend. However, in a Spark UDF script, the data will be moved from tungsten to the JVM (Scala script) or the JVM and Python Process (Python) to execute the actual process, and then rolled back to tungsten. As a result:
- Inevitably, there will be overhead / fine for:
- Deserialize tungsten input.
- Serialize the output back to tungsten.
- Even when using Scala, Spark's first-class citizen, this will increase the amount of memory in the JVM, which can lead to an increase in GC in the JVM. This problem is precisely due to the fact that the tungsten function "Managing memory outside the heap" is trying to solve .
To answer whether Python will be slower than Scala:
Since October 30, 2017, Spark introduced vectorized udf files for pyspark.
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
The reason Python UDF is slow is because PySpark UDF is not implemented in the most optimal way:
According to the paragraph from the link.
Spark added the Python API in version 0.7 with support for custom functions. These user-defined functions run one line at a time and therefore suffer from high costs of serialization and calls.
However, the new vectorized udf files seem to significantly improve performance:
from 3x to 100x.
