黑马头条推荐（—）——数据准备

weixin_53293028

于 2023-05-25 15:03:44 发布

阅读量156

点赞数

分类专栏： python 文章标签：深度学习机器学习神经网络 hadoop spark

本文链接：https://blog.csdn.net/weixin_53293028/article/details/130867252

版权

python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

数据库迁移

正常情况是从数据库中导出数据到hadoop的hive中，此项目直接把hadoop中数据放入hadoop对应目录即可：/user/hive/warehouse/

需求

增量更新
保存到hadoop的hive中
- 为了避免直接连接、操作业务数据
- 同步一份数据在集群中方便进行数据分析操作
用户资料信息呢两张表：user_profile，user_basic
文章内容基本信息、频道三张表：news_article_basic，news_article_content，news_channel
创建hive数据库

# location：数据保存地址
create database if not exists toutiao comment "user,news information of 136 mysql" location '/user/hive/warehouse/toutiao.db/';

Sqoop 迁移介绍

测试sqoop和mysql是否连接成功

sqoop list-databases --connect jdbc:mysql://192.168.19.137:3306/ --username root -P

# 如果成功，会显示MySQL中的数据库
information_schema
mysql
performance_schema
sys
toutiao

两种导入方式，使用第二种

全量倒入(.sh脚本)

#!/bin/bash

# 所有需要导入的数据库名称
array=(user_profile user_basic news_user_channel news_channel user_follow user_blacklist user_search news_collection news_article_basic news_article_content news_read news_article_statistic user_material)


for table_name in ${array[@]};
do
    sqoop import \
        --connect jdbc:mysql://192.168.19.137/toutiao \
        --username root \
        --password password \
        --table $table_name \
        --m 5 \                                           # 线程数量
        --hive-home /root/bigdata/hive \                  # hive路径
        --hive-import \                                   #hive形式
        --create-hive-table  \                            # 创建表
        --hive-drop-import-delims \
        --warehouse-dir /user/hive/warehouse/toutiao.db \ #倒入的地址
        --hive-table toutiao.$table_name
done

增量导入使用第二种

append：即通过指定一个递增的列（不建议使用，可能没有递增列或者更改列值，比如：分布式会有id冲突）
```
# 固定写法              固定写法      根据这一字段  从0值开始
--incremental append --check-column num_iid --last-value 0
```
incremental：时间戳

--incremental lastmodified \  # 最后修改时间
--check-column column \       # 时间列
--merge-key key \             # 检查列，合并
--last-value '2012-02-01 11:00:00'

# 就是只导入check-column的列比'2012-02-01 11:00:00'更大的数据,按照key合并

导入最终结果两种形式，使用第二种
- 直接sqoop导入到hive，在hive中会自动创建表并存入数据(–incremental lastmodified模式不支持导入Hive )
- sqoop导入到hadoop的hdfs，然后建立hive表关联
  - --target-dir /user/hive/warehouse/toutiao.db/

MySQL和Hive字段对应关系

MySQL(bigint) --> Hive(bigint) 
MySQL(tinyint) --> Hive(boolean) 
MySQL(int) --> Hive(int) 
MySQL(double) --> Hive(double) 
MySQL(bit) --> Hive(boolean) 
MySQL(varchar) --> Hive(string) 
MySQL(decimal) --> Hive(double) 
MySQL(date/timestamp) --> Hive(string)

Sqoop正式迁移

user_profile

# 如果需要从外部导入Mysql
create table user_profile(
  user_id BIGINT comment "userID",
  gender bit comment "gender",
  birthday VARCHAR(255) comment "birthday",
  real_name VARCHAR(255) comment "real_name",
  create_time datetime comment "create_time",
  update_time datetime comment "update_time",
  register_media_time datetime comment "register_media_time",
  id_number VARCHAR(255) comment "id_number",
  id_card_front VARCHAR(255) comment "id_card_front",
  id_card_back VARCHAR(255) comment "id_card_back",
  id_card_handheld VARCHAR(255) comment "id_card_handheld",
  area VARCHAR(255) comment "area",
  company VARCHAR(255) comment "company",
  career VARCHAR(255) comment "career") ENGINE=InnoDB DEFAULT CHARSET=utf8;


# 在hive中，创建user_profile表，指定以‘，’分割，指定存放位置
create table user_profile(
user_id BIGINT comment "userID",
gender BOOLEAN comment "gender",
birthday STRING comment "birthday",
real_name STRING comment "real_name",
create_time STRING comment "create_time",
update_time STRING comment "update_time",
register_media_time STRING comment "register_media_time",
id_number STRING comment "id_number",
id_card_front STRING comment "id_card_front",
id_card_back STRING comment "id_card_back",
id_card_handheld STRING comment "id_card_handheld",
area STRING comment "area",
company STRING comment "company",
career STRING comment "career")
COMMENT "toutiao user profile"
row format delimited fields terminated by ','
LOCATION '/user/hive/warehouse/toutiao.db/user_profile';



# 在linux中，写一个shell脚本，进行数据迁移

sqoop import \
    --connect jdbc:mysql://192.168.19.137/toutiao \
    --username root \
    --password password \
    --table user_profile \
    --m 4 \   # 线程数量
    --target-dir /user/hive/warehouse/toutiao.db/user_profile \   # 通过sqoop把数据导入这个路径
    --incremental lastmodified \
    --check-column update_time \  # 这个字段是Mysql中的字段，表示通过这个字段去对比
    --merge-key user_id  \    # 如果已存在相同user_id的数据，会合并相同数据（比如用户修改了住址，不会产生新数据，而是在元数据上合并）
    --last-value '2018-01-01 00:00:00'    # 导入这个时间以后的所有数据

user_basic

# 创建MySQL表
create table user_basic(
user_id BIGINT comment "user_id",
mobile VARCHAR comment "mobile",
password VARCHAR comment "password",
profile_photo VARCHAR comment "profile_photo",
last_login VARCHAR comment "last_login",
is_media bit comment "is_media",
article_count BIGINT comment "article_count",
following_count BIGINT comment "following_count",
fans_count BIGINT comment "fans_count",
like_count BIGINT comment "like_count",
read_count BIGINT comment "read_count",
introduction VARCHAR comment "introduction",
certificate VARCHAR comment "certificate",
is_verified bit comment "is_verified") ENGINE=InnoDB DEFAULT CHARSET=utf8;


# 在hive中，user_basic，指定以‘，’分割，指定存放位置
create table user_basic(
user_id BIGINT comment "user_id",
mobile STRING comment "mobile",
password STRING comment "password",
profile_photo STRING comment "profile_photo",
last_login STRING comment "last_login",
is_media BOOLEAN comment "is_media",
article_count BIGINT comment "article_count",
following_count BIGINT comment "following_count",
fans_count BIGINT comment "fans_count",
like_count BIGINT comment "like_count",
read_count BIGINT comment "read_count",
introduction STRING comment "introduction",
certificate STRING comment "certificate",
is_verified BOOLEAN comment "is_verified")
COMMENT "toutiao user basic"
row format delimited fields terminated by ','
LOCATION '/user/hive/warehouse/toutiao.db/user_basic';



# 在linux中，写一个shell脚本，进行数据迁移

sqoop import \
    --connect jdbc:mysql://192.168.19.137/toutiao \
    --username root \
    --password password \
    --table user_basic \
    --m 4 \
    --target-dir /user/hive/warehouse/toutiao.db/user_basic \
    --incremental lastmodified \
    --check-column last_login \ # 这个字段是Mysql中的字段
    --merge-key user_id  \
    --last-value '2018-01-01 00:00:00'

news_article_basic

数据库中含有一些特殊字符，如"," "\t" "\n"这些字符都会导致导入到hadoop被hive读取失败，解析时会认为另一条数据或者多一个字段

解决方案：
- 在导入时，加入—query参数，从数据库中选中对应字段，过滤相应内容，使用REPLACE、CHAR(或者CHR)进行替换字符
- 并且mysql表中存在tinyibt必须在connet中加入: ?tinyInt1isBit=false

# MySQL
create table news_article_basic(
article_id BIGINT comment "article_id",
user_id BIGINT comment "user_id",
channel_id BIGINT comment "channel_id",
title VARCHAR comment "title",
status BIGINT comment "status",
update_time DATETIME comment "update_time") ENGINE=InnoDB DEFAULT CHARSET=utf8;


# 在hive中，user_basic，指定以‘，’分割，指定存放位置
create table news_article_basic(
article_id BIGINT comment "article_id",
user_id BIGINT comment "user_id",
channel_id BIGINT comment "channel_id",
title STRING comment "title",
status BIGINT comment "status",
update_time STRING comment "update_time")
COMMENT "toutiao news_article_basic"
row format delimited fields terminated by ','
LOCATION '/user/hive/warehouse/toutiao.db/news_article_basic';



# 在linux中，写一个shell脚本，进行数据迁移
# 方案：导入方式，过滤相关字符，CHAR（13）、CHAR（10）表示\n \t
# tinyInt1isBit=false：表示Mysql中tinyInt字段导入Hive中不进行转换成布尔类型，默认会转为布尔类型

sqoop import \
    --connect jdbc:mysql://192.168.19.137/toutiao?tinyInt1isBit=false \
    --username root \
    --password password \
    --table news_article_basic \
    --m 4 \
    --query 'select article_id, user_id, channel_id, REPLACE(REPLACE(REPLACE(title, CHAR(13),""),CHAR(10),""), ",", " ") title, status, update_time from news_article_basic WHERE $CONDITIONS' \
    --split-by user_id \
    --target-dir /user/hive/warehouse/toutiao.db/news_article_basic \
    --incremental lastmodified \
    --check-column update_time \
    --merge-key article_id \
    --last-value '2018-01-01 00:00:00'

news_channel

# MySQL
create table news_channel(
channel_id BIGINT comment "channel_id",
channel_name VARCHAR comment "channel_name",
create_time DATETIME comment "create_time",
update_time DATETIME comment "update_time",
sequence BIGINT comment "sequence",
is_visible BOOLEAN comment "is_visible",
is_default BOOLEAN comment "is_default") ENGINE=InnoDB DEFAULT CHARSET=utf8;


# Hive
create table news_channel(
channel_id BIGINT comment "channel_id",
channel_name STRING comment "channel_name",
create_time STRING comment "create_time",
update_time STRING comment "update_time",
sequence BIGINT comment "sequence",
is_visible BOOLEAN comment "is_visible",
is_default BOOLEAN comment "is_default")
COMMENT "toutiao news_channel"
row format delimited fields terminated by ','
LOCATION '/user/hive/warehouse/toutiao.db/news_channel';

sqoop import \
    --connect jdbc:mysql://192.168.19.137/toutiao \
    --username root \
    --password password \
    --table news_channel \
    --m 4 \
    --target-dir /user/hive/warehouse/toutiao.db/news_channel \
    --incremental lastmodified \
    --check-column update_time \
    --merge-key channel_id \
    --last-value '2018-01-01 00:00:00'

news_article_content

由于news_article_content文章内容表中含有过多特殊字符，选择直接全量导入，因为是通过爬虫获取的，会用html标识符等等

# MySQL
create table news_article_content(
article_id BIGINT comment "article_id",
content VARCHAR comment "content") ENGINE=InnoDB DEFAULT CHARSET=utf8;



# 全量导入(表只是看结构，不需要在HIVE中创建，因为是直接导入HIVE，会自动创建news_article_content)
# 正常不用运行,但我们是从hadoop数据直接拷贝过来的,所以需要执行

create table news_article_content(
article_id BIGINT comment "article_id",
content STRING comment "content")
COMMENT "toutiao news_article_content"
row format delimited fields terminated by ','
LOCATION '/user/hive/warehouse/toutiao.db/news_article_content';

sqoop import \
    --connect jdbc:mysql://192.168.19.137/toutiao \
    --username root \
    --password password \
    --table news_article_content \
    --m 4 \
    --hive-home /root/bigdata/hive \
    --hive-import \
    --hive-drop-import-delims \
    --hive-table toutiao.news_article_content \ # 表名
    --hive-overwrite      # 重写

创建定时迁移脚本

使用crontab -e

使用的软件

hadoop
hive
hbase
zookeeper
spark
sqoop

大体逻辑

离线部分

文章画像和用户画像的更新，是为了后面提取特征，训练模型用的。
用户召回更新的目的是为了能够后面快速的产生推荐候选集。
用户文章特征中心，又直接保存着用户和文章的特征，这些特征又可以直接供模型预测。这几个进行配合就能够很快的进行推荐。

比如， 我们先根据已有的用户画像和文章画像，抽取出特征来，去训练一个逻辑回归模型并保存起来。

这时候如果新来了某个用户， 我们拿到他的用户id， 根据用户召回更新的结果集，能够快速拿出召回回来的几百篇的候选文章， 而根据保存好的特征中心的结果， 对于这几百篇候选文章，根据用户id，文章id就立即能拿出模型可以用来预测的特征， 那么我们又有之前训练好的逻辑回归模型，就可以直接对每篇候选文章进行预测用户的点击概率，根据这个概率从大到小排序，把靠前的N篇推给用户即可。

这就完成了线下推荐的逻辑。