sqoop作用:
sqoop是一个数据交换工具,最常用的两个工具是导入导出。
导入导出的参照物是hadoop,向hadoop导数据就是导入。
sqoop目前有两个版本,1.4.X为sqoop1;1.99.X为sqoop2。两个版本不兼容。一般都用sqoop1
Sqoop安装
1 下载Sqoop-1.4.6-cdh5.7.0.tar.gz包
http://archive.cloudera.com/cdh5/cdh/5/
环境sqoop-1.4.6-cdh5.7.0.tar.gz版本
2 上传到linux上,并解压
[hadoop@hadoop001 software]$ tar -xzvf sqoop-1.4.6-cdh5.7.0.tar.gz –C~/app
如果不是hadoop用户解压,注意修改用户和用户组
3 Sqoop文件目录结构:
[hadoop@hadoop000 app]$cd sqoop-1.4.6-cdh5.7.0/
[hadoop@hadoop000sqoop-1.4.6-cdh5.7.0]$ ll
total 1880
drwxr-xr-x. 2 hadoophadoop 4096 Jun 25 10:10 bin(放脚本)
-rw-rw-r–. 1 hadoophadoop 60569 Mar 24 2016 build.xml
-rw-rw-r–. 1 hadoophadoop 1126 Mar 24 2016 cdh.build.properties
-rw-rw-r–. 1 hadoophadoop 35614 Mar 24 2016 CHANGELOG.txt
drwxr-xr-x. 4 hadoophadoop 4096 Jun 25 10:10 cloudera
-rw-rw-r–. 1 hadoophadoop 6901 Mar 24 2016 cloudera-pom.xml
-rw-rw-r–. 1 hadoophadoop 9880 Mar 24 2016 COMPILING.txt
drwxr-xr-x. 2 hadoophadoop 4096 Jun 25 10:10 conf(配置文件)
drwxr-xr-x. 5 hadoophadoop 4096 Jun 25 10:10 docs(文档)
drwxr-xr-x. 2 hadoophadoop 4096 Jun 25 10:10 ivy
-rw-rw-r–. 1 hadoophadoop 17386 Mar 24 2016 ivy.xml
drwxr-xr-x. 2 hadoophadoop 4096 Jun 25 10:10 lib(lib包依赖包)
-rw-rw-r–. 1 hadoophadoop 15319 Mar 24 2016 LICENSE.txt
-rw-rw-r–. 1 hadoophadoop 505 Mar 24 2016 NOTICE.txt
-rw-rw-r–. 1 hadoophadoop 18772 Mar 24 2016 pom-old.xml
-rw-rw-r–. 1 hadoophadoop 1096 Mar 24 2016 README.txt
-rw-rw-r–. 1 hadoophadoop 1035867 Mar 24 2016 sqoop-1.4.6-cdh5.7.0.jar(完整jar包)
-rw-rw-r–. 1 hadoophadoop 6554 Mar 24 2016 sqoop-patch-review.py
-rw-rw-r–. 1 hadoophadoop 656201 2016 sqoop-test-1.4.6-cdh5.7.0.jar(测试包)
drwxr-xr-x. 7 hadoophadoop 4096 Mar 24 2016 src(源码)
drwxr-xr-x. 4 hadoophadoop 4096 Jun 25 10:10 testdata
[hadoop@hadoop000 software]$ sudo chown hadoop:hadoop sqoop-1.4.6-cdh5.7.0
4 配置环境变量
[hadoop@hadoop001 sqoop]$vim .~/.bash_profile
添加自己的配置
export SQOOP_HOME=/home/hadoop/app/sqoop-1.4.6-cdh5.7.0
export PATH=$SQOOP_HOME/bin:$PATH
5 复制sqoop/conf/sqoop-env-template.sh为sqoop-env.sh,添加配置
[hadoop@hadoop001 conf]$ cp sqoop-env-template.sh sqoop-env.sh
export HADOOP_COMMON_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0(Hadoop安装目录)
export HADOOP_MAPRED_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0(MapRedice目录)
export HIVE_HOME=/home/hadoop/app/hive-1.1.0-cdh5.7.0(Hive目录)
export HIVE_CONF_DIR=/home/hadoop/app/hive-1.1.0-cdh5.7.0/conf (hive的conf目录,这里之所以要配置,是因为若果不配,从mysql数据库向hive中导数数会报错)
6 拷贝mysql的jdbc驱动包mysql-connector-java-5.1.27-bin.jar 和java-json.jar到sqoop/lib目录下
(这里我直接拷贝的hive下面已经下载好的驱动,注意如果mysql驱动版本过高,也可能会导致有各种各样问题)
[hadoop0001 lib]$
cp /home/hadoop/app/hive-1.1.0-cdh5.7.0/lib/mysql-connector-java-5.1.27-bin.jar .
cp /home/hadoop/software/java-json.jar .
为了避免后面将mysql中表的数据导入hive中,最好执行下面的2个操作:
[hadoop@hadoop001 lib]$ cp hive-common-1.1.0-cdh5.7.0.jar ~/app/sqoop-1.4.6-cdh5.7.0/lib/
[hadoop@hadoop001 lib]$ cp hive-shims-*.jar
[hadoop@hadoop001 lib]$ ll hive*
-rw-r--r-- 1 hadoop hadoop 301470 Jul 23 23:51 hive-common-1.1.0-cdh5.7.0.jar
-rw-r--r-- 1 hadoop hadoop 58263 Jul 23 23:46 hive-shims-0.23-1.1.0-cdh5.7.0.jar
-rw-r--r-- 1 hadoop hadoop 9147 Jul 23 23:46 hive-shims-1.1.0-cdh5.7.0.jar
-rw-r--r-- 1 hadoop hadoop 108292 Jul 23 23:46 hive-shims-common-1.1.0-cdh5.7.0.jar
-rw-r--r-- 1 hadoop hadoop 13093 Jul 23 23:46 hive-shims-scheduler-1.1.0-cdh5.7.0.jar
[hadoop@hadoop001 lib]$
7.查看是否安装成功
[hadoop@hadoop001 ~]$ sqoop version //查看sqoop的版本
Warning: /home/hadoop/app/sqoop-1.4.6-cdh5.7.0/../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /home/hadoop/app/sqoop-1.4.6-cdh5.7.0/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /home/hadoop/app/sqoop-1.4.6-cdh5.7.0/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /home/hadoop/app/sqoop-1.4.6-cdh5.7.0/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
19/07/28 17:09:40 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.7.0
Sqoop 1.4.6-cdh5.7.0
git commit id
Compiled by jenkins on Wed Mar 23 11:30:51 PDT 2016
[hadoop@hadoop001 ~]
Sqoop的用法
1.首先看看sqoop有没有报错:
[hadoop@hadoop001 bin]$ sqoop
Warning:/home/hadoop/app/sqoop-1.4.6-cdh5.7.0/../hbase does not exist! HBase importswill fail.
Please set $HBASE_HOME to the root of yourHBase installation.
Warning:/home/hadoop/app/sqoop-1.4.6-cdh5.7.0/../hcatalog does not exist! HCatalog jobswill fail.
Please set $HCAT_HOME to the root of yourHCatalog installation.
Warning:/home/hadoop/app/sqoop-1.4.6-cdh5.7.0/../accumulo does not exist! Accumuloimports will fail.
Please set $ACCUMULO_HOME to the root of yourAccumulo installation.
Warning:/home/hadoop/app/sqoop-1.4.6-cdh5.7.0/../zookeeper does not exist! Accumuloimports will fail.
Please set $ZOOKEEPER_HOME to the root ofyour Zookeeper installation.
Try 'sqoop help' for usage.(这里提示让我用sqoophelp查看)
这里的警告是因为我没有配置hbase,zookeeper,HCatalog
2.sqoop hlep查看命令帮助
[hadoop@hadoop001 ~]$ sqoop help
Warning: /home/hadoop/app/sqoop-1.4.6-cdh5.7.0/../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /home/hadoop/app/sqoop-1.4.6-cdh5.7.0/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /home/hadoop/app/sqoop-1.4.6-cdh5.7.0/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /home/hadoop/app/sqoop-1.4.6-cdh5.7.0/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
19/07/28 17:16:56 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.7.0
usage: sqoop COMMAND [ARGS]
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
import-mainframe Import datasets from a mainframe server to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information
See 'sqoop help COMMAND' for information on a specific command.
3.查看mysql数据库中的所有库
sqoop list-databases \
--connect jdbc:mysql://localhost:3306 \
--username root \
--password root
这样写,千万要注意换号符后面不能有空格,不然执行命令会报错
4.查看mysql数据库中的ruozedata库中的所有表
sqoop list-tables \
--connect jdbc:mysql://localhost:3306/ruozedata \
--username root \
--password root
5.mysql数据库中的表向hadoop中的表导入数据
mysql的ruozedata库中的city_info表向hdfs导入数据
sqoop import \
--connect jdbc:mysql://localhost:3306/ruozedata \
--password root \
--username root \
--table "city_info" \ 从mysql的什么表中导出数据
--mapreduce-job-name MySQL2HDFS_EMP \ 给这个mr作业取个名字
--delete-target-dir \ 当再次执行sqoop语句的时候,会报错,因为 person_mysql文件夹已经存在了,我们需要先删除这个文件夹再运行sqoop语句
--target-dir person_mysql \ 指定hdfs的目录名
--columns city_name,area \ 指定需要导出的字段
--fields-terminated-by '\t' \ 指定导入到hdfs文件的分割符
-m 1 sqoop默认配置就是4个map,用户也可以通过-m参数,自己指定map的数量
[hadoop@hadoop001 shell]$ hadoop fs -cat /user/hadoop/person_mysql/part-m-00000
19/07/28 17:36:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
BEIJING NC
SHANGHAI EC
NANJING EC
GUANGZHOU SC
SANYA SC
WUHAN CC
CHANGSHA CC
XIAN NW
CHENGDU SW
HAERBIN NE
[hadoop@hadoop001 shell]$
6.mysql数据库中的表向hive中的表导入数据
sqoop import \
--connect jdbc:mysql://localhost:3306/ruozedata \
--password root \
--username root \
--table "city_info" \
--mapreduce-job-name city_info \
--delete-target-dir \
--hive-database default \ 导入hive的那个库
--hive-table city_info \ 导入hive的那张表(这个表最好先按照和mysql中的表的格式手动创建,不要用sqoop import --create-hive-table参数来创建 )
--hive-import \ 导入hive的开关,必须要有,默认是关的
--hive-overwrite \ 覆盖hive的表的内容
--fields-terminated-by '\t' \ 指定分割符
--split-by city_id \ 如果msyql表没设置主键,必须指定字段进行拆分
-m 2
注意:这个表最好先按照和mysql中的表的格式手动创建,不要用sqoop import --create-hive-table参数来创建
7.hive中的表数据向mysql数据库中导入数据
sqoop export \
--connect jdbc:mysql://localhost:3306/ruozedata \
--password root \
--username root \
--mapreduce-job-name area_product_click_count_top3 \
--table area_product_click_count_top3 \ 导入到mysql的ruozedata库中的area_product_click_count_top3中
--export-dir /user/hive/warehouse/area_product_click_count_top3 \ 导出的hive的area_product_click_count_top3表所在的hdfs的存放目录
--columns "day,product_id,click_sum,product_name,area,r" \ 定义导出那些字段
--fields-terminated-by '\t' \ 指定分割符
-m 2
注意:
1.导入数据时报错
19/07/23 23:34:59 ERROR tool.ImportTool: Error during import: No primary key could be found for table city_info. Please specify one with --split-by or perform a sequential import with '-m 1'.
因为要么mysql表没有设置主键,要么没有加加上--split-by 参数
如果 –m 不为1说明要么把-m 改为1 要么 加一个 --split-by 参数指定安照什么字段进行拆分要么给mysql表设置一个主键。
2.报错:
19/07/23 23:23:58 INFO hive.HiveImport: Loading uploaded data into Hive
19/07/23 23:23:58 ERROR hive.HiveConfig: Could not load org.apache.hadoop.hive.conf.HiveConf. Make sure HIVE_CONF_DIR is set correctly.
19/07/23 23:23:58 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf
解决办法:1.$SQOOP_HOME/CONF/sqoop-env.sh 配置HIVE_CONF_DIR目录
export HIVE_CONF_DIR=/home/hadoop/app/hive-1.1.0-cdh5.7.0/conf
2.
[hadoop@hadoop001 lib]$ cp hive-common-1.1.0-cdh5.7.0.jar ~/app/sqoop-1.4.6-cdh5.7.0/lib/
[hadoop@hadoop001 lib]$ cp hive-shims-*.jar
[hadoop@hadoop001 lib]$ pwd
/home/hadoop/app/sqoop-1.4.6-cdh5.7.0/lib
这样就能解决该问题。