Below is the data in TestingTable1, sorted by date in descending order.
BUYER_ID | ITEM_ID | CREATED_TIME ----------+-----------------+---------------------- 1345653 110909316904 2012-07-09 21:29:06 1345653 151851771618 2012-07-09 19:57:33 1345653 221065796761 2012-07-09 19:31:48 1345653 400307563710 2012-07-09 18:57:33
And if this is the data below in TestingTable2, sorted by date in descending order, always
USER_ID | PRODUCT_ID | LAST_TIME ---------+----------------+----------------------- 1345653 110909316904 2012-07-09 22:29:06 1345653 150851771618 2012-07-09 19:57:33
Each line in TestingTable1 must match TestingTable2 , if it does not match or there is no data in TestingTable2, then I need to show them in the output, as in TestingTable1 I have this data, but the corresponding TestingTable2 I have this data (which will be incorrect data) so that I can see what the discrepancy is and what data is missing.
I need to compare TestingTable2 with TestingTable1 on BUYER_ID and USER_ID . I need to see if the correspondence BUYER_ID and USER_ID comparable, then I need to compare ITEM_ID with PRODUCT_ID and CREATED_TIME with LAST_TIME and if there is a mismatch in TestingTable2 after comparing with TestingTable1 either in one of them or in both of them, then I need to show the result.
So, if you look at the example above - I have three scenarios basically
- Firstly-In
TestingTable1 , in the first row ITEM_ID maps to PRODUCT_ID in the first row of TestingTable2 , but CREATED_TIME does not match LAST_TIME for the first row in both tables - Secondly. In
TestingTable1 , the second row of CREATED_TIME is mapped to LAST_TIME in the second row of TestingTable2 , but ITEM_ID does not match PRODUCT_ID for the second row in both tables - Thirdly. In
TestingTable1 last two lines (lines) are generally absent in TestingTable2 . This script is not covered in my request that I wrote. I want this script also in my request .
So these are the three cases that I need to cover when comparing TestingTable2 with TestingTable1 always. And TestingTable1 is the MAIN table where comparisons should always be made, so this means that the data in TestingTable1 always accurate.
So, I need to show the result as shown below, if neither of them matches, or there is no data in TestingTable2 on all TestingTable1 data, and then next to it is the same TestingTable2 data, so that I see what value was in TestingTable1 compared with TestingTable2
BUYER_ID | ITEM_ID | CREATED_TIME | USER_ID | PRODUCT_ID | LAST_TIME -----------+-----------------+---------------------------+----------------+--------------------+----------------------- 1345653 110909316904 2012-07-09 21:29:06 1345653 110909316904 2012-07-09 22:29:06 1345653 151851771618 2012-07-09 19:57:33 1345653 150851771618 2012-07-09 19:57:33 1345653 221065796761 2012-07-09 19:31:48 NULL NULL NULL 1345653 400307563710 2012-07-09 18:57:33 NULL NULL NULL
The query below, I only cover my two scenarios , which I mentioned above, and it works fine, and I get the output as above, leaving the last two lines from Output. But I need to add a third scenario also in this (below) request, so that it produces output as described above.
SELECT * FROM (SELECT T2.buyer_id, T2.item_id, T2.created_time AS created_time, subq.user_id, subq.product_id, subq.LAST_TIME FROM TestingTable2 subq JOIN TestingTable1 T2 ON T2.buyer_id = subq.user_id AND subq.LAST_TIME = ( T2.created_time ) WHERE ( subq.product_id <> T2.item_id ) UNION ALL SELECT T2.buyer_id, T2.item_id AS item_id, T2.created_time, subq.user_id, subq.product_id AS product_id, subq.LAST_TIME FROM TestingTable2 subq JOIN TestingTable1 T2 ON T2.buyer_id = subq.user_id AND subq.product_id = T2.item_id WHERE ( subq.LAST_TIME <> ( T2.created_time ) )) finalResult ORDER BY finalResult.BUYER_ID;
Any suggestion would be appreciated.
PS I asked a few questions related to JOIN in the last few days, but this only applies to my two scenarios, and not the third scenario that I need in this request.
Update: - I cannot use NOT IN or NOT EXISTS syntax for SQL, since I work with Hive and Hive, it does NOT IN support NOT IN or NOT EXISTS , so I need another way to handle this.
I need to use my query to modify it to work in the third scenario, since Hive will support SQL syntax.
Below is my SQL script that runs my two scripts above, but not the third script. Can someone help me modify my request for work and for the third scenario?
http://sqlfiddle.com/#!3/102dd/1/0 .
The data in table 1 should be in table 2, if there is none, then I need to show the data mismatch after comparing with table 1, and it is also possible, the data from Table 1 will not be present in Table 2, and I also want to show it.
Updated output for script http://sqlfiddle.com/#!3/102dd/3/0
BUYER_ID | ITEM_ID | CREATED_TIME | USER_ID | PRODUCT_ID | LAST_TIME -----------+-----------------+---------------------------+----------------+--------------------+----------------------- 1345653 151851771618 July, 09 2012 19:57:33 1345653 150851771618 July, 09 2012 19:57:33 1345653 221065796761 July, 09 2012 19:31:48 1345653 221165796761 July, 09 2012 19:31:48 1345653 110909316904 July, 09 2012 21:29:06 1345653 110909316904 July, 09 2012 22:29:06 1345653 400307563710 July, 09 2012 18:57:33 NULL NULL NULL 1345653 310411560125 July, 09 2012 16:09:49 NULL NULL NULL
UPDATED SQL QUERY THAT GIVES ME AN ERROR
I replaced TestingTable1 with this request -
(SELECT BUYER_ID, ITEM_ID, rank(BUYER_ID), CREATED_TIME FROM ( SELECT BUYER_ID, ITEM_ID, CREATED_TIME FROM testingtable1 where to_date(from_unixtime(cast(UNIX_TIMESTAMP(CREATED_TIME) as int))) = '2012-07-09' DISTRIBUTE BY BUYER_ID SORT BY BUYER_ID, CREATED_TIME desc ) T1 WHERE rank(BUYER_ID) < 5)
And TestingTable2 with this request -
(SELECT USER_ID, PROD_AND_TS.PRODUCT_ID as PRODUCT_ID, PROD_AND_TS.TIMESTAMPS as TIMESTAMPS FROM testingtable2 lateral view explode(PURCHASED_ITEM) exploded_table as PROD_AND_TS where to_date(from_unixtime(cast(PROD_AND_TS.TIMESTAMPS as BIGINT))) = '2012-07-09')