I am using Spark 2.0.0, and I have two datasets (Dataset [Row]) as follows.
Dataset 'appUsage':
+----------+-------------------+----------+ |DATE |APP_ID |TIMES_USED| +----------+-------------------+----------+ |2016-08-03|06xgKq10eeq0REK4eAc|null | |2016-08-03|ssads2wsdsf |null | |2016-08-03|testApp |null | |2016-08-03|3222aClie-971837083|5 | |2016-08-03|V2aadingTLV02 |null | |2016-08-03|OurRating-985443645|5 | |2016-08-03|Trdssktin-743439164|null | |2016-08-03|myaa1-app |null | |2016-08-03|123123123-013663450|null | +----------+-------------------+----------+
Dataset 'appDev'
+-------------------+------------------------------------+ |APP_ID |DEVELOPER_ID | +-------------------+------------------------------------+ |OurRating-985443645|5fff25c7-6a70-4d54-ad04-197be4b9a6a9| |Xa11d0-560090096095|5fff25c7-6a70-4d54-ad04-197be4b9a6a9| +-------------------+------------------------------------+
When I make a left join using the following code, everything works as expected.
val result = appUsage.join(appDev, Seq("APP_ID"), "left")
Output:
+-------------------+----------+----------+------------------------------------+ |APP_ID |DATE |TIMES_USED|DEVELOPER_ID | +-------------------+----------+----------+------------------------------------+ |06xgKq10eeq0REK4eAc|2016-08-03|null |null | |ssads2wsdsf |2016-08-03|null |null | |testApp |2016-08-03|null |null | |3222aClie-971837083|2016-08-03|5 |null | |V2aadingTLV02 |2016-08-03|null |null | |OurRating-985443645|2016-08-03|5 |5fff25c7-6a70-4d54-ad04-197be4b9a6a9| |Trdssktin-743439164|2016-08-03|null |null | |myaa1-app |2016-08-03|null |null | |123123123-013663450|2016-08-03|null |null | +-------------------+----------+----------+------------------------------------+
But I want to make an inner join, so that only the rows that are present in both datasets will be part of the result set. However, when I do this using the following code, the output is empty.
val result = appUsage.join(appDev, Seq("APP_ID"), "inner")
Did I miss something?
source share