How to save partial results of data conversion processes in pyspark?

Question

How to save partial results of data conversion processes in pyspark?

I am working in apache-sparkto do multiple conversions on a single Dataframe using python.

I encoded some functions to simplify various conversions. Imagine that we have features such as:

clearAccents(df,columns)
#lines that remove accents from dataframe with spark functions or 
#udf
    return df

I use these functions to overwrite the dataframe variable to save a new data file, which is converted every time every function is returned. I know that this is not a good practice, and now I see the consequences.

I noticed that every time I add a line as shown below, the runtime is longer:

# Step transformation 1:
df = function1(df,column)
# Step transformation 2.
df = function2(df, column)

, Spark , , . , function1 Spark , function2 Spark function1, function2. , ?

df.cache() df.persist(), .

, stackoverflow.

+4

python apache-spark pyspark

Hugo Reyes 10 '16 16:18

1

Chris Dove · Answer 1 · 2016-10-15T19:45:37+0000

, cache() persist(), , . - :

# Step transformation 1:
df = function1(df,column).cache()

# Now invoke an action
df.count()

# Step transformation 2.
df = function2(df, column)

, , SQL Spark Job .

ML Pipeline API , Transformer. . PySpark ML.

How to save partial results of data conversion processes in pyspark?

More articles: