I am working in apache-sparkto do multiple conversions on a single Dataframe using python.
I encoded some functions to simplify various conversions. Imagine that we have features such as:
clearAccents(df,columns)
return df
I use these functions to overwrite the dataframe variable to save a new data file, which is converted every time every function is returned. I know that this is not a good practice, and now I see the consequences.
I noticed that every time I add a line as shown below, the runtime is longer:
df = function1(df,column)
df = function2(df, column)
, Spark , , . , function1 Spark , function2 Spark function1, function2. , ?
df.cache() df.persist(), .
, stackoverflow.