Does Spark SQL include stream optimization for table optimization for joins, and if so, how does it decide which table to stream?
When performing joins, Hive assumes that the last table is the largest. As a connection optimization, he will try to buffer smaller join tables and pass the latter. If the last table in the netlist is not the largest, Hive has a hint /*+ STREAMTABLE(tbl) */ that tells her the table that should be streamed. Starting with version 4.1.1, Spark SQL does not support the STREAMTABLE hint.
This question has been asked for normal RDD processing, outside of Spark SQL, here . The answer does not extend to Spark SQL, where the developer does not control explicit cache operations.
source share