1.数据来源与处理
RDBMS(Oracle,Mysql,DB2...) -> Sqoop (SQL to Hadoop)
文件(Apache,Nginx日志数据) -> Flume(实时抽取)
2.任务调度
数据分析任务job比较多
执行时间,执行频率,job之间的相互依赖(工作流)
Oozie框架执行调度任务
3.Hadoop生态新系统监控
统一WEB、UI界面,管理框架,监控框架
Hue框架
4.sqoop功能概述
已Hadoop为主体,RDBMS为客体
sqoop import
将RDBMS数据导入Hadoop中
sqoop export
将Hadoop数据导出到RDBMS中
sqoop依赖于Hadoop
数据的一方,存储在hdfs中
5.sqoop安装
vi sqoop-env.sh
#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/opt/hadoop-2.5.0-cdh5.3.6
#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/opt/hadoop-2.5.0-cdh5.3.6
#set the path to where bin/hbase is available
export HBASE_HOME=
#Set the path to where bin/hive is available
export HIVE_HOME=/opt/hive-0.13.1-cdh5.3.6
#Set the path for where zookeper config dir is
export ZOOCFGDIR=/opt/zookeeper-3.4.5-cdh5.3.6
6.相关命令
sqoop help
sqoop help command
7.数据抽取转换
RDBMS(Mysql)
jdbcurl username password tablename
转换
import export
hdfs
path
hive
tablename
8. 以mysql数据库为例,数据抽取案例,拷贝jdbc驱动包(mysql-connector-java-5.1.38.jar)到$SQOOP_HOME/lib
#显示数据库
bin/sqoop list-databases \
--connect jdbc:mysql://hadoop-senior01.zhangbk.com:3306 \
--username root \
--password password01
#将表导入到hdfs文件系统
bin/sqoop import \
--connect jdbc:mysql://hadoop-senior01.zhangbk.com:3306/test \
--username root \
--password password01 \
--table dm_gy_xzqh
#将表导入到hdfs文件系统,并指定文件路径,map数量
bin/sqoop import \
--connect jdbc:mysql://hadoop-senior01.zhangbk.com:3306/test \
--username root \
--password password01 \
--table dm_gy_xzqh \
--target-dir /user/sqoop/imp_xzqh \
--num-mappers 1
sqoop底层的实现就是mapreduce,import来说,仅仅运行map task
数据存储文件
textfile 默认
orcfile
parquet
#将表导入到hdfs,设置文件格式,默认为textfile
bin/sqoop import \
--connect jdbc:mysql://hadoop-senior01.zhangbk.com:3306/test \
--username root \
--password password01 \
--table dm_gy_xzqh \
--target-dir /user/sqoop/imp_xzqh_parquet \
--num-mappers 1 \
--as-parquetfile
hive中创建一张表
create table default.xzqh_sqoop_parquet(
XZQHSZ_DM string,
XZQHZM_DM string,
XZQHMC string,
SJXZQHSZ_DM string,
XZQHJC string,
XYBZ string,
YXBZ string,
XZQHLMZM_DM string,
SSXZQMC string,
SWJG_DM string,
XZQHLX_DM string,
SJGL_TBSJ string
)
row format delimited fields terminated by ','
stored as parquet ;
load data inpath '/user/sqoop/imp_xzqh_parquet' into table xzqh_sqoop_parquet ;
create table default.xzqh_sqoop(
XZQHSZ_DM string,
XZQHZM_DM string,
XZQHMC string,
SJXZQHSZ_DM string,
XZQHJC string,
XYBZ string,
YXBZ string,
XZQHLMZM_DM string,
SSXZQMC string,
SWJG_DM string,
XZQHLX_DM string,
SJGL_TBSJ string
)
row format delimited fields terminated by ','
stored as TEXTFILE;
load data inpath '/user/root/dm_gy_xzqh' into table xzqh_sqoop ;
#RDBMS -> HDFS,设置导入的字段
bin/sqoop import \
--connect jdbc:mysql://hadoop-senior01.zhangbk.com:3306/test \
--username root \
--password password01 \
--table dm_gy_xzqh \
--target-dir /user/sqoop/imp_xzqh_columns \
--num-mappers 1 \
--columns XZQHZM_DM,XZQHMC
在实际的项目中,要处理的数据,需要进行初步清洗和过滤
某些字段的过滤
条件
join
#RDBMS -> HDFS , --query,过滤数据,必须加where $CONDITIONS
bin/sqoop import \
--connect jdbc:mysql://hadoop-senior01.zhangbk.com:3306/test \
--username root \
--password password01 \
--query 'select XZQHMC from dm_gy_xzqh where $CONDITIONS' \
--target-dir /user/sqoop/imp_xzqh_query \
--num-mappers 1
出现的问题:
ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException:
Query [select XZQHMC from dm_gy_xzqh] must contain '$CONDITIONS' in WHERE clause.
#RDBMS -> HDFS , 压缩文件
bin/sqoop import \
--connect jdbc:mysql://hadoop-senior01.zhangbk.com:3306/test \
--username root \
--password password01 \
--table dm_gy_xzqh \
--target-dir /user/sqoop/imp_xzqh_snappy \
--delete-target-dir \
--num-mappers 1 \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--fields-terminated-by '\t'
---------------------------------------------------------------------------------------
增量导入数据
1.query
2.Incremental import arguments:
--check-column <column> Source column to check for incremental
change
--incremental <import-type> Define an incremental import of type
'append' or 'lastmodified'
--last-value <value> Last imported value in the incremental
check column
import hdfs direct可提到效率
#RDBMS - > HDFS, delete-target-dir删除存在的文件夹,--direct提高效率
bin/sqoop import \
--connect jdbc:mysql://hadoop-senior01.zhangbk.com:3306/test \
--username root \
--password password01 \
--table dm_gy_xzqh \
--target-dir /user/sqoop/imp_xzqh_direct \
--delete-target-dir \
--num-mappers 1 \
--direct
出现问题
Error: java.io.IOException: Cannot run program "mysqldump": error=2,
No such file or directory
解决方法
由于其他节点没有mysqldump命令,所以出现错误,
需要在各节点安装mysql数据库,或者将mysqldump命令拷贝到各节点。并添加环境变量。
-------------------------------------------------------------------------
export command 导出数据从hdfs到RDBMS
hive table
table
hiveserver2进行jdbc方式查询数据
hdfs file
export -> RDBMS
#HDFS -> RDBMS
bin/sqoop export \
--connect jdbc:mysql://hadoop-senior01.zhangbk.com:3306/test \
--username root \
--password password01 \
--table my_user \
--export-dir /user/sqoop/exp \
--num-mappers 1
-------------------------------------------------------------------------------------
Hive数据的导入导出
Hive数据存储在hdfs上
schema
use default;
drop table if exists user_hive
create table user_hive(
id int,
account string
)
row format delimited fields terminated by '\t';
#RDBMS -> Hive, 设置分隔符
bin/sqoop import \
--connect jdbc:mysql://hadoop-senior01.zhangbk.com:3306/test \
--username root \
--password password01 \
--table my_user \
--fields-terminated-by '\t' \
--delete-target-dir \
--num-mappers 1 \
--hive-import \
--hive-database default \
--hive-table user_hive \
--verbose #打印详细信息
#从Oracle--> Hive
bin/sqoop import \
--connect jdbc:oracle:thin:@10.111.30.11:1521/orcl \
--username hx_zs \
--password hx_zs \
--table ZS_JKS \ #表名要大写,否则可能报错,其他用户表加用户名
--fields-terminated-by '\t' \
--delete-target-dir \
--hive-import \
--create-hive-table \ #Fail if the target hive table exists
--hive-database ods \
--hive-table ZS_JKS \
--verbose
需要将hive中的jar包复制到sqoop中的lib下,或者设置软连接。
将hive表中数据到导出到mysql
#Hive -> RDBMS, 实质是hdfs -> RDBMS
bin/sqoop export \
--connect jdbc:mysql://hadoop-senior01.zhangbk.com:3306/test \
--username root \
--password password01 \
--table my_user2 \
--num-mappers 1 \
--export-dir /user/hive/warehouse/user_hive \
--input-fields-terminated-by '\t'
--verbose
#将sqoop命令保存为文件,进行执行。
vi sqoop_import_hdfs.txt
import
--connect
jdbc:mysql://hadoop-senior01.zhangbk.com:3306/test
--username
root
--password
password01
--table
my_user
--fields-terminated-by
'\t'
--delete-target-dir
--num-mappers 1
--hive-import
--hive-database
default
--hive-table
user_hive
bin/sqoop --options-file sqoop_import_hdfs.txt