数据库迁移

最新推荐文章于 2021-04-13 23:24:02 发布

weixin_46361646

最新推荐文章于 2021-04-13 23:24:02 发布

阅读量193

点赞数

分类专栏：学习类

本文链接：https://blog.csdn.net/weixin_46361646/article/details/108002198

版权

学习类专栏收录该内容

5 篇文章 0 订阅

订阅专栏

文章目录

- 应用Sqoop完成业务数据到HIVE的导入

应用Sqoop完成业务数据到HIVE的导入

数据库迁移需求

mysql业务数据库中的数据，同步到hadoop的hive数据仓库中。

为了避免直接连接、操作业务数据
同步一份数据在集群中方便进行数据分析操作
从mysql中迁移5张表到hive数据库中

hive 的shell操作

# 启动hive后 
hive> show databases;
OK
default
profile
toutiao
Time taken: 0.017 seconds, Fetched: 3 row(s)

创建hive业务数据库toutiao

hive> create database if not exists toutiao comment "user,news information of 136 mysql" location '/user/hive/warehouse/toutiao.db/';

知识补充：

   # hive中创建数据库
    create database hive_db　　//创建数据库hive_db
　　 create table hive_db.test（字段内容及其格式省略）　　//在数据库hive_db中创建test表

　　create database student_db location '/user/hive/student.db'　　//创建数据库student_db，但是在hdfs中显示student.db，在hive控制端中显示student_db（在有location的情况下）

　　create database if not exists hive_db
　　show databases like 'hive*'　　//结果为hive_db
　　drop database hive_db　　//这种方式只能删除空数据库
　　drop database studentdb casecade　　//强制删除非空数据库

　　describe database hive_db　　//显示数据库的信息
　　create database teacherdb comment "数据库teacherdb的备注"

避坑注意：
导入数据到hive中，需要在创建HIVE表加入 row format delimited fields terminated by ','
fields-terminated-by 要是不指定值的话，默认分隔符为'\001'，并且以后每次导入数据都要设置 --fields-terminated-by '\001'，不然导入的数据为NULL。建议手动设置 --fields-terminated-by的值
原因： sqoop 导出的 hdfs 分片数据，都是使用逗号 , 分割的，由于 hive 默认的分隔符是 /u0001（Ctrl+A）,为了平滑迁移，需要在创建表格时指定数据的分割符号。

导数据前进行测试 sqoop 与mysql的链接测试

xx ~]# sqoop list-databases --connect jdbc:mysql://192.168.19.137:3306/ --username root -P

链接成功以后，会显示连接到的mysql数据库

业务数据导入存在的问题

问题：1、新增的用户、文章
2、修改用户信息，文章信息
-解决办法：
两种导入形式选择增量，定期导入数据

sqoop 全量导入（但不是我们的选择）

#!/bin/bash

# 建立数组
array=(user_profile user_basic news_user_channel news_channel user_follow user_blacklist user_search news_collection news_article_basic news_article_content news_read news_article_statistic user_material)

for table_name in ${array[@]};
do
    sqoop import \
        --connect jdbc:mysql://192.168.19.137/toutiao \
        --username root \
        --password password \
        --table $table_name \  
        --m 5 \
        --hive-home /root/bigdata/hive \    # hive path
        --hive-import \                      # 导入形式
        --create-hive-table  \                # 创建表
        --hive-drop-import-delims \
        --warehouse-dir /user/hive/warehouse/toutiao.db \
        --hive-table toutiao.$table_name
done

sqoop 增量导入（推荐）

两种方式

append：即通过指定一个递增的列，如：
–incremental append --check-column num_iid --last-value 0
last-value 0 指的是开始导入节点
这种方式也有问题，如数据库中没有递增的列，还有就是在分布式的情况下，会出现id重复的情况
incremental：时间戳，比如：（推荐用）

--incremental lastmodified \
--check-column column \
--merge-key key \
--last-value '2012-02-01 11:0:00'

lastmodified 最后修改时间
column 时间列
key 检查列，合并列
last-value 是导入数据的起点

就是只导入check-column的列比’2012-02-01 11:0:00’更大的数据,按照key合并
（此处采用的–increamental append方式，这种方式需注意主键或split-colunm是递增，否则建议在关系表中增加一个createTime字段，采用lastmodified方式。）

导入最终结果两种形式，选择后者

直接sqoop导入到hive(–incremental lastmodified模式不支持导入Hive ) （全量导入）
sqoop导入到hdfs，然后建立hive表关联
–target-dir /user/hive/warehouse/toutiao.db/

–targrt-dir的值设置成hive表数据文件存储的路径。假如你的hive表为外部表，则–targrt-dir要指向外部表的存储路径
–last-value 3，意味mysql中id为3的数据不会被导入

sqoop import --connect jdbc:mysql://192.168.0.8:3306/hxy \
--username root \
--password 123456 \
--table data \
--target-dir '/soft/hive/warehouse/data' \
--incremental append \
--check-column id \
--last-value 3 \
-m 1

sqoop import --connect jdbc:mysql://192.168.0.8:3306/hxy \
--username root \
--password 123456 \
--table data \
--target-dir '/soft/hive/warehouse/data' \
--check-column last_mod \
--incremental lastmodified \
--last-value '2019-08-30 16:49:12' \
--m 1 \
--append

Sqoop迁移实例

1、user_profile表:
使用lastmodified模式
mysql数据库中更新时候update_time会修改最近时间，按照user_id合并(增量导入进去会有新的重复的数据，需要合并)
指定last_time时间

# 在hive中创建表
# user_profile
create table user_profile(
user_id BIGINT comment "userID",
gender BOOLEAN comment "gender",
birthday STRING comment "birthday",
real_name STRING comment "real_name",
create_time STRING comment "create_time",
update_time STRING comment "update_time",
register_media_time STRING comment "register_media_time",
id_number STRING comment "id_number",
id_card_front STRING comment "id_card_front",
id_card_back STRING comment "id_card_back",
id_card_handheld STRING comment "id_card_handheld",
area STRING comment "area",
company STRING comment "company",
career STRING comment "career")
COMMENT "toutiao user profile"
row format delimited fields terminated by ','   # 注意这里
LOCATION '/user/hive/warehouse/toutiao.db/user_profile';

Mysql导入对应hive类型:

MySQL(bigint) --> Hive(bigint)
MySQL(tinyint) --> Hive(boolean)
MySQL(int) --> Hive(int)
MySQL(double) --> Hive(double)
MySQL(bit) --> Hive(boolean)
MySQL(varchar) --> Hive(string)
MySQL(decimal) --> Hive(double)
MySQL(date/timestamp) --> Hive(string)

# 导入hive
    sqoop import \
        --connect jdbc:mysql://192.168.19.137/toutiao \
        --username root \
        --password password \
        --table $k \
        --m 4 \
        --target-dir /user/hive/warehouse/toutiao.db/$k \
        --incremental lastmodified \
        --check-column ${check[$k]} \
        --merge-key ${merge[$k]} \
        --last-value ${time}

在这里插入图片描述
m 4
Sqoop并行化是启多个map task实现的，-m(或–num-mappers)参数指定map task数，默认是四个。
–last-value 是指的开始导入时间一直到现在，对吗？

避坑指南：
如果运行失败：请检查相关问题
1、连接JDBC的IP 地址或者主机名是否错误
2、确认mysql数据库打开并且能够sqoop测试成功
hive 中 show tables；
select * from user_profile limit 1

2、文章表导入news_article_basic，news_article_content、news_channel

news_article_basic:
按照键review_time更新
合并按照article_id
指定时间
（找到合并的列（唯一值），递增列(时间列)）
这张表的特殊性：
原mysql数据库中某些字段的值存在一些特定的字符，如","、"\t", "\n"这些字符都会导致导入到hadoop被hive读取失败，解析时会认为另一条数据或者多一个字段
在这里插入图片描述

sqoop在迁移一些有特殊字符的数据时
- 先做查询替换
解决方案：
1在导入时，加入—query参数，从数据库中选中对应字段，过滤相应内容，使用REPLACE、CHAR(或者CHR)进行替换字符

```bash
--query 'select article_id, user_id, channel_id, REPLACE(REPLACE(REPLACE(title, CHAR(13),""),CHAR(10),""), ",", " ") title, status, update_time from news_article_basic WHERE $CONDITIONS' \

CHAR(13) 表示
在这里插入图片描述
WHERE $CONDITIONS’ 表示sqoop 的什么条件都能去导入‘’

2并且mysql表中存在tinyint,必须在connet中加入: ?tinyInt1isBit=false
防止默认到HIVE中,字段默认会被转化为boolean数据类型

在mysql 中用desc table1 查看表信息
在MySQL(tinyint) --> Hive(boolean) 
如状态值 01 变成布尔

导入语句是

```bash
# 方案：导入方式，过滤相关字符
sqoop import \
    --connect jdbc:mysql://192.168.19.137/toutiao?tinyInt1isBit=false \
    --username root \
    --password password \
    --m 4 \
    --query 'select article_id, user_id, channel_id, REPLACE(REPLACE(REPLACE(title, CHAR(13),""),CHAR(10),""), ",", " ") title, status, update_time from news_article_basic WHERE $CONDITIONS' \
    --split-by user_id \
    --target-dir /user/hive/warehouse/toutiao.db/news_article_basic \
    --incremental lastmodified \
    --check-column update_time \
    --merge-key article_id \
    --last-value ${time}

–connect jdbc:mysql://192.168.19.137/toutiao?tinyInt1isBit=false
这里的原因
在这里插入图片描述

迁移脚本定时运行

crontab-shell 脚本定时运行
每个用户都要自己的crontab文件，位置在/var/spool/cron/ 目录中，通过crontab命令来编辑

crontab -e

在这里插入图片描述

weixin_46361646

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据库迁移

文章目录应用Sqoop完成业务数据到HIVE的导入数据库迁移需求hive 的shell操作业务数据导入存在的问题两种方式Sqoop迁移实例迁移脚本定时运行应用Sqoop完成业务数据到HIVE的导入数据库迁移需求mysql业务数据库中的数据，同步到hadoop的hive数据仓库中。为了避免直接连接、操作业务数据同步一份数据在集群中方便进行数据分析操作从mysql中迁移5张表到hive数据库中hive 的shell操作# 启动hive后 hive> show databases;O
复制链接

扫一扫