Sqoop requires access to table metadata, such as column type information. The $ CONDITIONS placeholder is set to '1 = 0' by default to ensure that sqoop receives only type information. So, after executing the sqoop command, you will see the first query that is launched with unfulfilled $ CONDITIONS. Later it is replaced with different values ββdefining different ranges based on the number of cartographers (-m) or -split-by column or -boundary query, so that the entire data set can be divided into different data slices or pieces and pieces can be imported in parallel with the size of concurrency. Sqoop will automatically replace this placeholder with the generated conditions that determine which piece of data should be transferred by each individual task.
For example, consider the sample_data table with the column name, ID, and salary. You want to get records with a salary> 1k.
sqoop import \ --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \ --username retail_dba --password cloudera \ --query 'select * from sample_data where $CONDITIONS AND salary > 1000' \ --split-by salary \ --target-dir hdfs://quickstart.cloudera/user/cloudera/sqoop_new
Below is the first query that returns an empty set.
SqlManager: Executing SQL statement: select * from sample_data where (1 = 0) AND salary > 1000
Then the next query should get the min and max range.
INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(salary), MAX(salary) FROM (select * from sample_data where (1 = 1) AND salary > 1000) AS t1;
Nikhil Bhide
source share