Spark to replace EXISTS and IN

Question

Spark to replace EXISTS and IN

I am trying to run a query that uses the EXIST clause:

select <...> from A, B, C where A.FK_1 = B.PK and A.FK_2 = C.PK and exists (select A.ID from <subquery 1>) or exists (select A.ID from <subquery 2>)

Unfortunately, this does not seem to be supported. I also tried replacing the EXISTS clause with the IN clause:

 select <...> from A, B, C where A.FK_1 = B.PK and A.FK_2 = C.PK and A.ID in (select ID from ...) or A.ID in (select ID from ...)

Unfortunately, also the IN clause seems unsupported.

Any ideas on how I can write an SQL query that achieves the desired result? I could basically formulate the WHERE as another JOIN , and the second OR clause as UNION , but it seems super awkward ..

EDIT: A list of possible solutions.

Solution 1

 select <...> from A, B, C (select ID from ...) as exist_clause_1, (select ID from ...) as exist_clause_2, where A.FK_1 = B.PK and A.FK_2 = C.PK and A.ID = exist_clause_1.ID or A.ID = exist_clause_2.ID

Decision 2

 select <...> from A, B, C ( (select ID from ...) UNION (select ID from ...) ) as exist_clause, where A.FK_1 = B.PK and A.FK_2 = C.PK and A.ID = exist_clause.ID

+7

sql apache-spark-sql

Radu Jan 18 '16 at 18:23

source share

1 answer

philipxy · Accepted Answer · 2016-01-19T00:58:17+0000

SparkSQL currently does not have EXISTS and IN. (last) Spark SQL / DataFrames and Datasets Guide / Supported Hive Features

EXISTS and IN can always be rewritten using JOIN or LEFT SEMI JOIN. Although Apache Spark SQL does not currently support IN or EXISTS subqueries, you can effectively implement semantics by rewriting queries for using the LEFT SEMI JOIN. OR can always be rewritten with UNION. And can NOT be overwritten with EXCEPT.

The table contains rows that make the predicate (the operator parameterized by column names) true:

DBA gives predicates for each base table T with columns TC,... : T (TC, ...)
A JOIN contains strings that make the predicates AND of their arguments true; for a UNION , OR; for EXCEPT , AND NOT.
SELECT DISTINCT kept columns FROM T contains rows in which EXISTS deleted columns [predicate T].
T LEFT SEMI JOIN U contains rows where EXISTS are U-only columns [predicate T and predicate U].
T WHERE condition contains strings where the predicate of the condition is AND.

(Re querying usually see this answer .)

Thus, bearing in mind the predicate expressions corresponding to SQL, you can use simple rules for rewriting logic to compose and / or reorganize queries. For example, using UNION here does not have to be "awkward" either in terms of readability or execution.

In your initial question, it was stated that you understood that you could use UNION, and you edited the options in your question that cut EXISTS and IN from your original queries. Here is another option that excludes OR.

  select <...> from A, B, C, (select ID from ...) as e where A.FK_1 = B.PK and A.FK_2 = C.PK and A.ID = e.id union select <...> from A, B, C, (select ID from ...) as e where A.FK_1 = B.PK and A.FK_2 = C.PK and A.ID = e.ID

Your decision 1 does not do what you think. If only one of the exists_clause tables exists_clause empty, i.e. Even if the other matches ID , the FROM cross product from the tables is empty, and the rows are not returned. (“A non-intuitive consequence of SQL semantics”: Chapter 6 Sidebar SQL Database Language p. 264 Database Systems: Complete Book of 2nd Edition.) A FROM does not just enter names for table rows, it is CROSS JOINING and / or OUTER JOINING, after bringing ON (for INNER JOINs) and WHERE filters some of them.

Performance usually differs for different expressions returning the same lines. It depends on the DBMS optimization. Many of the details that a DBMS and / or programmer may know, and if possible, may or may not know, and may or may not balance well, affect the best way to evaluate a query and the best way to record it. But doing two ORED subsexes on each row in WHERE (both in your original queries and in your last solution 2) is not necessarily better than running one UNION of two SELECTs (as in my query).

Spark to replace EXISTS and IN

More articles: