Sqoop 迁移案例

Echo-Niu

于 2023-01-17 08:08:12 发布

阅读量241

点赞数

分类专栏：人工智能文章标签： hive 数据库 hadoop 推荐算法算法

本文链接：https://blog.csdn.net/weixin_43894652/article/details/128711450

版权

人工智能专栏收录该内容

69 篇文章 1 订阅

订阅专栏

2.2.2.3 Sqoop 迁移案例

避坑指南：
- 导入数据到hive中，需要在创建HIVE表加入 row format delimited fields terminated by ','

hadoop数据在hive中查询就全是NULL,原因： sqoop 导出的 hdfs 分片数据，都是使用逗号 , 分割的，由于 hive 默认的分隔符是 /u0001（Ctrl+A）,为了平滑迁移，需要在创建表格时指定数据的分割符号。

8,false,null,null,2019-01-10 17:44:32.0,2019-01-10 17:44:32.0,null,null,null,null,null,null,null,null
9,false,null,null,2019-01-15 23:41:13.0,2019-01-15 23:41:13.0,null,null,null,null,null,null,null,null

hive中查询

NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL

1、user_profile表:
- 使用lastmodified模式
- mysql数据库中更新时候update_time会修改最近时间，按照user_id合并(增量导入进去会有新的重复的数据，需要合并)
- 指定last_time时间
2、user_basic表：
- 使用lastmodified模式
- last_login作为更新时间
- 指定last_time时间，按照user_id合并

Mysql导入对应hive类型:

MySQL(bigint) --> Hive(bigint) 
MySQL(tinyint) --> Hive(tinyint) 
MySQL(int) --> Hive(int) 
MySQL(double) --> Hive(double) 
MySQL(bit) --> Hive(boolean) 
MySQL(varchar) --> Hive(string) 
MySQL(decimal) --> Hive(double) 
MySQL(date/timestamp) --> Hive(string)

create table user_profile(
user_id BIGINT comment "userID",
gender BOOLEAN comment "gender",
birthday STRING comment "birthday",
real_name STRING comment "real_name",
create_time STRING comment "create_time",
update_time STRING comment "update_time",
register_media_time STRING comment "register_media_time",
id_number STRING comment "id_number",
id_card_front STRING comment "id_card_front",
id_card_back STRING comment "id_card_back",
id_card_handheld STRING comment "id_card_handheld",
area STRING comment "area",
company STRING comment "company",
career STRING comment "career")
COMMENT "toutiao user profile"
row format delimited fields terminated by ','
LOCATION '/user/hive/warehouse/toutiao.db/user_profile';


create table user_basic(
user_id BIGINT comment "user_id",
mobile STRING comment "mobile",
password STRING comment "password",
profile_photo STRING comment "profile_photo",
last_login STRING comment "last_login",
is_media BOOLEAN comment "is_media",
article_count BIGINT comment "article_count",
following_count BIGINT comment "following_count",
fans_count BIGINT comment "fans_count",
like_count BIGINT comment "like_count",
read_count BIGINT comment "read_count",
introduction STRING comment "introduction",
certificate STRING comment "certificate",
is_verified BOOLEAN comment "is_verified")
COMMENT "toutiao user basic"
row format delimited fields terminated by ','
LOCATION '/user/hive/warehouse/toutiao.db/user_basic';

导入脚本，创建一个脚本文件执行import_incremental.sh脚本：

-m：指定同时导入的线程数量
连接地址以及账号密码，表、目录
指定要导入的时间

time=`date +"%Y-%m-%d" -d "-1day"`
declare -A check
check=([user_profile]=update_time [user_basic]=last_login [news_channel]=update_time)
declare -A merge
merge=([user_profile]=user_id [user_basic]=user_id [news_channel]=channel_id)

for k in ${!check[@]}
do
    sqoop import \
        --connect jdbc:mysql://192.168.19.137/toutiao \
        --username root \
        --password password \
        --table $k \
        --m 4 \
        --target-dir /user/hive/warehouse/toutiao.db/$k \
        --incremental lastmodified \
        --check-column ${check[$k]} \
        --merge-key ${merge[$k]} \
        --last-value ${time}
done

当我们运行脚本并成功时候：

避坑指南：
如果运行失败：请检查相关问题
- 1、连接JDBC的IP 地址或者主机名是否错误
- 2、确认mysql数据库打开并且能够sqoop测试成功

运行一段时间完成之后，回到HIVE数据中查询，能查询到内容并且内容无误

hive> show tables;
OK
news_article_basic
news_channel
user_basic
user_profile
Time taken: 0.647 seconds, Fetched: 4 row(s)
hive> select * from user_profile limit 1;
OK
1       false   null    null    2018-12-28 14:26:32.0   2018-12-28 14:26:32.0   null    null    null   null     null    null    null    null
Time taken: 3.516 seconds, Fetched: 1 row(s)

3、文章表导入news_article_basic，news_article_content、news_channel
news_article_basic:
- 按照键review_time更新
- 合并按照article_id
- 指定时间

create table news_article_basic(
article_id BIGINT comment "article_id",
user_id BIGINT comment "user_id",
channel_id BIGINT comment "channel_id",
title STRING comment "title",
status BIGINT comment "status",
update_time STRING comment "update_time")
COMMENT "toutiao news_article_basic"
row format delimited fields terminated by ','
LOCATION '/user/hive/warehouse/toutiao.db/news_article_basic';

原mysql数据库中某些字段的值存在一些特定的字符，如","、"\t", "\n"这些字符都会导致导入到hadoop被hive读取失败，解析时会认为另一条数据或者多一个字段

解决方案：
- 在导入时，加入—query参数，从数据库中选中对应字段，过滤相应内容，使用REPLACE、CHAR(或者CHR)进行替换字符
- 并且mysql表中存在tinyibt必须在connet中加入: ?tinyInt1isBit=false

导入脚本：

# 方案：导入方式，过滤相关字符
sqoop import \
    --connect jdbc:mysql://192.168.19.137/toutiao?tinyInt1isBit=false \
    --username root \
    --password password \
    --m 4 \
    --query 'select article_id, user_id, channel_id, REPLACE(REPLACE(REPLACE(title, CHAR(13),""),CHAR(10),""), ",", " ") title, status, update_time from news_article_basic WHERE $CONDITIONS' \
    --split-by user_id \
    --target-dir /user/hive/warehouse/toutiao.db/news_article_basic \
    --incremental lastmodified \
    --check-column update_time \
    --merge-key article_id \
    --last-value ${time}

4、news_channel
- 跟上面用户的表处理方式相同，按照update_time键更新
- 按照channel_id合并
- 更新时间

create table news_channel(
channel_id BIGINT comment "channel_id",
channel_name STRING comment "channel_name",
create_time STRING comment "create_time",
update_time STRING comment "update_time",
sequence BIGINT comment "sequence",
is_visible BOOLEAN comment "is_visible",
is_default BOOLEAN comment "is_default")
COMMENT "toutiao news_channel"
row format delimited fields terminated by ','
LOCATION '/user/hive/warehouse/toutiao.db/news_channel';

5、由于news_article_content文章内容表中含有过多特殊字符，选择直接全量导入

# 全量导入(表只是看结构，不需要在HIVE中创建，因为是直接导入HIVE，会自动创建news_article_content)
create table news_article_content(
article_id BIGINT comment "article_id",
content STRING comment "content")
COMMENT "toutiao news_article_content"
row format delimited fields terminated by ','
LOCATION '/user/hive/warehouse/toutiao.db/news_article_content';

直接导入到HIVE汇总，导入脚本为：

sqoop import \
    --connect jdbc:mysql://192.168.19.137/toutiao \
    --username root \
    --password password \
    --table news_article_content \
    --m 4 \
    --hive-home /root/bigdata/hive \
    --hive-import \
    --hive-drop-import-delims \
    --hive-table toutiao.news_article_content \
    --hive-overwrite