Sqoop2的增量导入很简单,根据某个字段值的变化,进行同步。如果要导出到一个存在文件的目录里,必须指定append mode为true。下图为官方文档的截图:
一、全量同步
表中存在3条数据,如下图:
1.1 创建job
sqoop:000> create job -f PostgreSQL_Link -t HDFS_Link
Creating job for links with from name PostgreSQL_Link and to name HDFS_Link
Please fill following values to create new job object
Name: Web_PSQL_To_HDFS
Database source
Schema name: public
Table name: web_customer
SQL statement:
Column names:
There are currently 0 values in the list:
element#
Partition column: id
Partition column nullable:
Boundary query:
Incremental read
Check column:
Last value:
Target configuration
Override null value:
Null value:
File format:
0 : TEXT_FILE
1 : SEQUENCE_FILE
2 : PARQUET_FILE
Choose: 0
Compression codec:
0 : NONE
1 : DEFAULT
2 : DEFLATE
3 : GZIP
4 : BZIP2
5 : LZO
6 : LZ4
7 : SNAPPY
8 : CUSTOM
Choose: 1
Custom codec:
Output directory: /user/DataSource/PostgreSQL/WEB_CUSTOMER
Append mode:
Throttling resources
Extractors:
Loaders:
Classpath configuration
Extra mapper jars:
There are currently 0 values in the list:
element#
New job was successfully created with validation status OK and name Web_PSQL_To_HDFS
1.2 运行job
sqoop:000> start job -s -n Web_PSQL_To_HDFS
Submission details
Job Name: Web_PSQL_To_HDFS
Server URL: http://localhost:12000/sqoop/
Created by: hadoop
Creation date: 2017-12-25 10:10:39 CST
Lastly updated by: hadoop
External ID: job_1513929521869_0014
http://hdp04:8088/proxy/application_1513929521869_0014/
2017-12-25 10:10:39 CST: BOOTING - Progress is not available
2017-12-25 10:10:51 CST: RUNNING - 0.00 %
2017-12-25 10:11:01 CST: RUNNING - 35.00 %
2017-12-25 10:11:14 CST: SUCCEEDED
Counters:
org.apache.hadoop.mapreduce.FileSystemCounter
FILE_LARGE_READ_OPS: 0
FILE_WRITE_OPS: 0
HDFS_READ_OPS: 10
HDFS_BYTES_READ: 1501
HDFS_LARGE_READ_OPS: 0
FILE_READ_OPS: 0
FILE_BYTES_WRITTEN: 3420790
FILE_BYTES_READ: 0
HDFS_WRITE_OPS: 10
HDFS_BYTES_WRITTEN: 297
org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter
BYTES_WRITTEN: 0
org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
BYTES_READ: 0
org.apache.hadoop.mapreduce.JobCounter
TOTAL_LAUNCHED_MAPS: 10
MB_MILLIS_MAPS: 115728384
VCORES_MILLIS_MAPS: 56508
SLOTS_MILLIS_MAPS: 56508
OTHER_LOCAL_MAPS: 10
MILLIS_MAPS: 56508
org.apache.sqoop.submission.counter.SqoopCounters
ROWS_READ: 3 --读取3条数据
ROWS_WRITTEN: 3 --同步3条数据
org.apache.hadoop.mapreduce.TaskCounter
SPILLED_RECORDS: 0
MERGED_MAP_OUTPUTS: 0
VIRTUAL_MEMORY_BYTES: 41340801024
MAP_INPUT_RECORDS: 0
SPLIT_RAW_BYTES: 1501
MAP_OUTPUT_RECORDS: 3
FAILED_SHUFFLE: 0
PHYSICAL_MEMORY_BYTES: 3848200192
GC_TIME_MILLIS: 2249
CPU_MILLISECONDS: 32280
COMMITTED_HEAP_BYTES: 3965714432
Job executed successfully
web_customer表中只有3条数据,已成功同步。
二、增量同步
2.1 表中插入新数据
第4条是新插入的数据,如图:
2.2 创建job
sqoop:000> create job -f PostgreSQL_Link -t HDFS_Link
Creating job for links with from name PostgreSQL_Link and to name HDFS_Link
Please fill following values to create new job object
Name: Web_PSQL_To_HDFS_I
Database source
Schema name: public
Table name: web_customer
SQL statement:
Column names:
There are currently 0 values in the list:
element#
Partition column: id
Partition column nullable:
Boundary query:
Incremental read
Check column: id --指定id字段
Last value: 3 --全量同步后的最后一条记录的id值
Target configuration
Override null value:
Null value:
File format:
0 : TEXT_FILE
1 : SEQUENCE_FILE
2 : PARQUET_FILE
Choose: 0
Compression codec:
0 : NONE
1 : DEFAULT
2 : DEFLATE
3 : GZIP
4 : BZIP2
5 : LZO
6 : LZ4
7 : SNAPPY
8 : CUSTOM
Choose: 1
Custom codec:
Output directory: /user/DataSource/PostgreSQL/WEB_CUSTOMER
Append mode: true --如果要和全量同步的文件放在一起,必须输入true,否则会因目录下有文件而报错。
Throttling resources
Extractors:
Loaders:
Classpath configuration
Extra mapper jars:
There are currently 0 values in the list:
element#
New job was successfully created with validation status OK and name Web_PSQL_To_HDFS_I
2.3 运行job
sqoop:000> start job -s -n Web_PSQL_To_HDFS_I
Submission details
Job Name: Web_PSQL_To_HDFS_I
Server URL: http://localhost:12000/sqoop/
Created by: hadoop
Creation date: 2017-12-25 10:43:47 CST
Lastly updated by: hadoop
External ID: job_1513929521869_0015
http://hdp04:8088/proxy/application_1513929521869_0015/
2017-12-25 10:43:47 CST: BOOTING - Progress is not available
2017-12-25 10:43:59 CST: RUNNING - 0.00 %
2017-12-25 10:44:09 CST: RUNNING - 45.00 %
2017-12-25 10:44:23 CST: SUCCEEDED
Counters:
org.apache.hadoop.mapreduce.FileSystemCounter
FILE_LARGE_READ_OPS: 0
FILE_WRITE_OPS: 0
HDFS_READ_OPS: 10
HDFS_BYTES_READ: 1351
HDFS_LARGE_READ_OPS: 0
FILE_READ_OPS: 0
FILE_BYTES_WRITTEN: 3423100
FILE_BYTES_READ: 0
HDFS_WRITE_OPS: 10
HDFS_BYTES_WRITTEN: 129
org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter
BYTES_WRITTEN: 0
org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
BYTES_READ: 0
org.apache.hadoop.mapreduce.JobCounter
TOTAL_LAUNCHED_MAPS: 10
MB_MILLIS_MAPS: 113948672
VCORES_MILLIS_MAPS: 55639
SLOTS_MILLIS_MAPS: 55639
OTHER_LOCAL_MAPS: 10
MILLIS_MAPS: 55639
org.apache.sqoop.submission.counter.SqoopCounters
ROWS_READ: 1 --读取一条新数据
ROWS_WRITTEN: 1 --同步一条新数据
org.apache.hadoop.mapreduce.TaskCounter
SPILLED_RECORDS: 0
MERGED_MAP_OUTPUTS: 0
VIRTUAL_MEMORY_BYTES: 41336942592
MAP_INPUT_RECORDS: 0
SPLIT_RAW_BYTES: 1351
MAP_OUTPUT_RECORDS: 1
FAILED_SHUFFLE: 0
PHYSICAL_MEMORY_BYTES: 4070748160
GC_TIME_MILLIS: 2363
CPU_MILLISECONDS: 31730
COMMITTED_HEAP_BYTES: 4415553536
Job executed successfully