Sqoop安装

下载sqoop

下载安装包:
wget  https://mirrors.tuna.tsinghua.edu.cn/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
解压安装包:
tar xzvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz    -C sqoop

配置sqoop安装文件

#配置 sqoop-env.sh

# Set HADOOP_HOME to point to a specific hadoop install directory
# HADOOP_HOME=${bin}/../../hadoop
#配置hadoop的参数
HADOOP_HOME=/usr/hadoop273

# Hive Configuration Directory can be controlled by:
# export HIVE_CONF_DIR=
# hive配置参数(使用hive是添加)
export HIVE_CONF_DIR=/usr/hive/conf

# Folder containing extra libraries required for hive compilation/execution can be controlled by:
#自定义hive jar包存放目录
# export HIVE_AUX_JARS_PATH=

将连接MySQL需要的驱动放入sqoop的classpath

cp mysql-connector-java-5.1.47.jar	/usr/sqoop/lib

将hive相关依赖放入sqoop的classpath

cp /usr/hive/lib/hive-common-3.1.2.jar  /usr/sqoop/lib/

验证测试

sqoop增量抽取MySQL 数据表

sqoop import --connect jdbc:mysql://MySQL_HOSTNAME/db_name?useSSL=false \
 --username root \ 
  --password password_str   \  
  --query 'select * from table_name WHERE id > 3  $CONDITIONS' --split-by gai_model_params.model_id --append  --target-dir /user/jerry/sqoop   -m 8

sqoop import --connect jdbc:mysql://MySQL_HOSTNAME:MySQL_PORT/db_name?useSSL=false \
--username mysql_user \
--password mysql_password   \
--query 'select * from table_name WHERE id > 3 and no =3  and  $CONDITIONS' \
--split-by split_id(指定表中某个字段——进行切分map任务) \
--append \
 --target-dir /user/jerry/sqoop(hadoop目录) \
  -m 8(根据split-id确定sqoop map任务数量,适当提高并行度提高执行效率)

sqoop 导入MySQL某个数据库所有数据表到hdfs

  • 执行语句
sqoop import-all-tables --connect jdbc:mysql://MySQL_HOSTNAME/db_name?useSSL=false  \
--direct(直接导入,提高效率) \
--username mysql_user \
--password mysql_password  \
--warehouse-dir /user/jerry/db(指定存放MySQL数据的hdfs目录) \
-m 1

sqoop 导入MySQL某个数据表到hive

bin/sqoop import --connect jdbc:mysql://MySQL_HOSTNAME/db_name?useSSL=false  \
--table table_name \
--username mysql_user \
--password mysql_password \
--hive-import \
--hive-database hivedb \
--create-hive-table \
--hive-table table_name \
--target-dir=/user/jerry/hivedb \
--split-by mysql_table_column(指定表中某个字段——进行切分map任务) \
-m 4(根据split-id确定sqoop map任务数量,适当提高并行度提高执行效率)

sqoop 将MySQL某个数据库所有数据表导入hive

sqoop import-all-tables --connect jdbc:mysql://MySQL_HOSTNAME/db_name?useSSL=false  \
--username mysql_user \
--password mysql_password \
--direct(直连导入) \
--hive-import \
--hive-database alldb(指定导入hive后的数据库) \
--warehse-dir=/user/jerry/alldb(指定导入MySQL数据存放的hdfs目录) \
 -m 4

错误总结:

  • (1)rdbms 数据表中无主键
ERROR tool.ImportAllTablesTool: 
Error during import: 
No primary key could be found for table gai_model_label. Please specify one with --split-by or perform a sequential import with '-m 1'.
  • 产生原因
数据从MySQL导入到hdfs时,MySQL中的某些数据表中没有主键,并且sqoop脚本的并行度大于1,所以无法进行下去
  • 解决
1.降低并行度 -m 1 
2.将MySQL的无主键的表重新设计为含有主键的数据表
  • (2) 未配置提交yarn集群的job 类型
INFO mapreduce.Job: Task Id : attempt_1461808335315_0001_m_000000_1, Status : FAILED
Container launch failed for container_1461808335315_0001_01_000003 : org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168)
        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
        at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:155)
        at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
————————————————
  • 解决
配置HADOOP_HOME/etc/hadoop下的yarn-site.xml

<property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
</property>
  • (3) hive删除数据库
1.删除空数据库
drop database db_name;
2.删除非空数据库
drop database db_name cascade;
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值