Does Spark SQL include streaming table optimization for joins?

Question

Does Spark SQL include streaming table optimization for joins?

Does Spark SQL include stream optimization for table optimization for joins, and if so, how does it decide which table to stream?

When performing joins, Hive assumes that the last table is the largest. As a connection optimization, he will try to buffer smaller join tables and pass the latter. If the last table in the netlist is not the largest, Hive has a hint /*+ STREAMTABLE(tbl) */ that tells her the table that should be streamed. Starting with version 4.1.1, Spark SQL does not support the STREAMTABLE hint.

This question has been asked for normal RDD processing, outside of Spark SQL, here . The answer does not extend to Spark SQL, where the developer does not control explicit cache operations.

+5

apache-spark apache-spark-sql

Sim Aug 20 '15 at 20:15

source share

2 answers

This is the upcoming Spark 2.3 here ( RC2 voted for the next version).

Like version 1.4.1, Spark SQL does not support the STREAMTABLE hint.

This is not the last (and most likely will be released soon) Spark 2.3.

There is no support for the STREAMTABLE tooltip, but given the recent change (in SPARK-20857 the Common node tooltip allowed ) to build a tooltip framework that should be fairly simple to write.

You will need to write a Spark optimization and possibly a physical plan (s) that will support STREAMTABLE (which seems like a lot of work), but it is possible. There are tools.

As for connection optimization, in the upcoming Spark 2.3 there are two main logical optimizations:

Reorderjoin
CostBasedJoinReorder (solely for cost optimization)

+1

Jacek laskowski Jan 23 '18 at 13:57

source share

Niemand · Accepted Answer · 2015-08-21T07:50:03+0000

I searched for the answer a while ago, and all I could think of was setting the spark.sql.autoBroadcastJoinThreshold parameter, which is 10 MB by default. Then he will try to automatically broadcast all the tables with a size smaller than the limit you set. The connection order does not play any role here.

If you are interested in further improving connection performance, I highly recommend this presentation .

Does Spark SQL include streaming table optimization for joins?

More articles: