Sqoop允许您并行导入数据,而--split-by和--boundary-query允许您进行更多控制 . 如果您只是导入一个表,那么它将使用PRIMARY KEY,但是如果您正在进行更高级的查询,则需要指定列来执行并行拆分 .
即,
sqoop import \
--connect 'jdbc:mysql://.../...' \
--direct \
--username uname --password pword \
--hive-import \
--hive-table query_import \
--boundary-query 'SELECT 0, MAX(id) FROM a' \
--query 'SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND $CONDITIONS'\
--num-mappers 3
--split-by a.id \
--target-dir /data/import \
--verbose
边界查询允许您指定优化查询以获取最大值,最小值 . 否则它会尝试在你的--query语句中执行MIN(a.id),MAX(a.id) .
结果将是(如果min = 0,max = 30)是3个并行运行的查询:
SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND a.id BETWEEN 0 AND 10;
SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND a.id BETWEEN 11 AND 20;
SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND a.id BETWEEN 21 AND 30;