原帖地址: http://blog.csdn.net/nsrainbow/article/details/41649671
承接上节课,继续讲如何使用sqoop将mysql的数据导入到 Hbase 或者 Hive 里面
数据准备
mysql
在mysql 里面建立表 employee 并插入数据
CREATE TABLE `employee` (
`id` int(11) NOT NULL,
`name` varchar(20) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
insert into employee (id,name) values (1,'michael');
insert into employee (id,name) values (2,'ted');
insert into employee (id,name) values (3,'jack');
Hbase
hbase(main):006:0> create 'employee','info'
0 row(s) in 0.4440 seconds
=> Hbase::Table - employee
Hive
不需要数据准备,等等用--create-hive-table会自动建表
从mysql导入到Hbase
# sqoop import --connect jdbc:mysql://host1:3306/sqoop_test --username root --password root --table employee --hbase-table employee --column-family info --hbase-row-key id -m 1
- --table 数据来源表
- --hbase-table 要导入的hbase表
- --column-family 要导入的列簇
- --hbase-row-key 要指定哪个字段作为row-key的值
- 更详细的参数见 http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html#_importing_data_into_hbase
…………
); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
15/01/23 07:56:49 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
15/01/23 07:57:16 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 1,315.279 seconds (0 bytes/sec)
15/01/23 07:57:16 INFO mapreduce.ImportJobBase: Retrieved 3 records.
最后打印出成功导入3条数据
去检查下hbase
hbase(main):001:0> scan 'employee'
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
ROW COLUMN+CELL
1 column=info:name, timestamp=1417426628685, value=michael
2 column=info:name, timestamp=1417426628685, value=ted
3 column=info:name, timestamp=1417426628685, value=jack
3 row(s) in 0.1630 seconds
成功插入3条数据
从mysql导入hive
# sqoop import --connect jdbc:mysql://host1:3306/sqoop_test --username root --password root --table employee --hive-import --hive-table hive_employee --create-hive-table
- --hive-import 代表是hive的导入
- --hive-table 要导入的hive表
- --create-hive-table 是否同时自动创建该表
- 更详细的参数见 http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html#_importing_data_into_hive
再次说下mysql的jdbc链接不要用localhost,因为这个任务会被分布式的发送不同的hadoop机子上,要那些机子真的可以通过jdbc连到mysql上才行,否则会丢数据
检查下hive
hive> select * from hive_employee;
OK
1 michael
2 ted
3 jack
Time taken: 0.179 seconds, Fetched: 3 row(s)
还有一点要声明下: 目前sqoop无法导入数据到基于hbase建立的hive表
从mysql导入hive分区表
分区表的导入会比较复杂一点
Step1
建立people表,并插入数据
CREATE TABLE `people` (
`id` int(11) NOT NULL,
`name` varchar(20) NOT NULL,
`year` varchar(10),
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
insert into people values (1,'jack','2015','01','02');
insert into people values (2,'ted','2015','01','02');
insert into people values (3,'billy','2015','01','02');
insert into people values (4,'sara','2015','01','03');
Step2
sqoop import --connect jdbc:mysql://xmseapp03:3306/sqoop_test --username sqoop --password sqoop --query 'select id,name from people where year="2015" AND $CONDITIONS' --direct -m 2 --split-by id --hive-import --create-hive-table --hive-table hive_people --target-dir /user/hive_people --hive-partition-key year --hive-partition-value '2015'
- --query 写你要查询的sql,AND $CONDITIONS 这句话不能省,是给sqoop用的
- --split-by 写主键
- --hive-partition-key 定义分区表的键 , --hive-partion-value定义分区表的值
导入成功后查看数据
hive> select * from hive_people;
OK
1 jack 2015
2 ted 2015
3 billy 2015
4 sara 2015
Time taken: 1.041 seconds, Fetched: 4 row(s)
查看下hdfs里面怎么存的
[root@host1 ~]$ hdfs dfs -cat /user/hive/warehouse/hive_people/year=2015/part-m-00000
1jack
2ted
[root@host1 ~]$ hdfs dfs -cat /user/hive/warehouse/hive_people/year=2015/part-m-00001
3billy
4sara