Does Spark SQL include streaming table optimization for joins?

Does Spark SQL include stream optimization for table optimization for joins, and if so, how does it decide which table to stream?

When performing joins, Hive assumes that the last table is the largest. As a connection optimization, he will try to buffer smaller join tables and pass the latter. If the last table in the netlist is not the largest, Hive has a hint /*+ STREAMTABLE(tbl) */ that tells her the table that should be streamed. Starting with version 4.1.1, Spark SQL does not support the STREAMTABLE hint.

This question has been asked for normal RDD processing, outside of Spark SQL, here . The answer does not extend to Spark SQL, where the developer does not control explicit cache operations.

+5
source share
2 answers

I searched for the answer a while ago, and all I could think of was setting the spark.sql.autoBroadcastJoinThreshold parameter, which is 10 MB by default. Then he will try to automatically broadcast all the tables with a size smaller than the limit you set. The connection order does not play any role here.

If you are interested in further improving connection performance, I highly recommend this presentation .

+3
source

This is the upcoming Spark 2.3 here ( RC2 voted for the next version).

Like version 1.4.1, Spark SQL does not support the STREAMTABLE hint.

This is not the last (and most likely will be released soon) Spark 2.3.

There is no support for the STREAMTABLE tooltip, but given the recent change (in SPARK-20857 the Common node tooltip allowed ) to build a tooltip framework that should be fairly simple to write.

You will need to write a Spark optimization and possibly a physical plan (s) that will support STREAMTABLE (which seems like a lot of work), but it is possible. There are tools.

As for connection optimization, in the upcoming Spark 2.3 there are two main logical optimizations:

+1
source

All Articles