Let's start with some of the foundations for the underlying spark. This will simplify your understanding. RDD At the core of the spark core is a data structure called RDD, which is lazily evaluated. By lazy evaluation, we understand that the calculation of RDD occurs when an action (for example, calling a counter in RDD or displaying in a data set).
A dataset or Dataframe (in which Dataset [Row]) also uses RDD in the kernel.
This means that each transformation (for example, a filter) will be implemented only when the action is started (shown).
So your question is "When I created a_c, no new data was created, a_c just points to the same data as my_df."
Since there is no data that has been implemented. We must realize this in order to bring it to mind. Your filter runs on the original frame. The only way to force a_c.filter(col("b") == 3).show() is to cache your intermediate framework using dataframe.cache. So the spark will throw out the "main" org.apache.spark.sql.AnalysisException: cannot resolve the column name for example.
val a_c = s_df.select(col("a"), col("c")).cache a_c.filter(col("b") == 3).show()
Thus, the spark will throw a "main" org.apache.spark.sql.AnalysisException: Can not resolve the column name.
Amit joshi
source share