大数据项目之电商数仓(2业务数据采集平台)V6.1.2

1章 电商业务简介

1.1 电商业务流程

1.2 电商常识(SKUSPU)

 SKU=Stock Keeping Unit(库存量基本单位)。现在已经被引申为产品统一编号的简称,每种产品均对应有唯一的SKU号。

 SPU(Standard Product Unit):是商品信息聚合的最小单位,是一组可复用、易检索的标准化信息集合。

例如:iPhoneX手机就是SPU。一台银色、128G内存的、支持联通网络的iPhoneX,就是SKU。

 

SPU表示一类商品。好处就是:可以共用商品图片,海报、销售属性等。

1.3 电商业务表结构

 

1.3.1 订单(order_info)

标签

含义

id

订单编号

consignee

收货人

consignee_tel

收件人电话

final_total_amount

总金额

order_status

订单状态

user_id

用户id

delivery_address

送货地址

order_comment

订单备注

out_trade_no

订单交易编号(第三方支付用)

trade_body

订单描述(第三方支付用)

create_time

创建时间

operate_time

操作时间

expire_time

失效时间

tracking_no

物流单编号

parent_order_id

父订单编号

img_url

图片路径

province_id

地区

benefit_reduce_amount

优惠金额

original_total_amount

原价金额

feight_fee

运费

1.3.2 订单详情(order_detail)

标签

含义

id

订单编号

order_id

订单号

sku_id

商品id

sku_name

sku名称(冗余)

img_url

图片名称(冗余)

order_price

商品价格(下单时sku价格)

sku_num

商品数量

create_time

创建时间

1.3.3 SKU商品(sku_info)

标签

含义

id

skuId

spu_id

spuid

price

价格

sku_name

商品名称

sku_desc

商品描述

weight

重量

tm_id

品牌id

category3_id

品类id

sku_default_img

默认显示图片(冗余)

create_time

创建时间

1.3.4 用户(user_info)

标签

含义

id

用户id

login_name

用户名称

nick_name

用户昵称

passwd

用户密码

name

姓名

phone_num

手机号

email

邮箱

head_img

头像

user_level

用户级别

birthday

生日

gender

性别M男,F女

create_time

创建时间

operate_time

操作时间

1.3.5 商品一级分类表(base_category1)

标签

含义

id

id

name

名称

1.3.6 商品二级分类表(base_category2)

标签

含义

id

id

name

名称

category1_id

一级品类id

1.3.7 商品三级分类表(base_category3)

标签

含义

id

id

name

名称

Category2_id

二级品类id

1.3.8 支付流水(payment_info)

标签

含义

id

编号

out_trade_no

对外业务编号

order_id

订单编号

user_id

用户编号

alipay_trade_no

支付宝交易流水编号

total_amount

支付金额

subject

交易内容

payment_type

支付类型

payment_time

支付时间

1.3.9 省份表(base_province)

标签

含义

id

id

name

省份名称

region_id

地区ID

area_code

地区编码

iso_code

国际编码

1.3.10 地区表(base_region)

标签

含义

id

大区id

region_name

大区名称

1.3.11 品牌表(base_trademark)

标签

含义

tm_id

品牌id

tm_name

品牌名称

1.3.12 订单状态表(order_status_log)

标签

含义

id

编号

order_id

订单编号

order_status

订单状态

operate_time

操作时间

1.3.13 SPU商品表(spu_info)

标签

含义

id

商品id

spu_name

spu商品名称

description

商品描述(后台简述)

category3_id

三级分类id

tm_id

品牌id

1.3.14 商品评论表(comment_info)

标签

含义

id

编号

user_id

用户id

sku_id

商品id

spu_id

spu_id

order_id

订单编号

appraise

评价 1 好评 2 中评 3 差评

comment_txt

评价内容

create_time

创建时间

1.3.15 退单表(order_refund_info)

标签

含义

id

编号

order_id

订单编号

sku_id

sku_id

refund_type

退款类型

refund_amount

退款金额

refund_reason_type

原因类型

refund_reason_txt

原因内容

create_time

创建时间

1.3.16 加购表(cart_info)

标签

含义

id

编号

user_id

用户id

sku_id

SKU商品

cart_price

放入购物车时价格

sku_num

数量

img_url

图片文件

sku_name

sku名称 (冗余)

create_time

创建时间

operate_time

修改时间

is_ordered

是否已经下单

order_time

下单时间

1.3.17 商品收藏表(favor_info)

标签

含义

id

编号

user_id

用户名称

sku_id

商品id

spu_id

spu_id

is_cancel

是否已取消 0 正常 1 已取消

create_time

创建时间

cancel_time

修改时间

1.3.18 优惠券领用表(coupon_use)

标签

含义

id

编号

coupon_id

购物券ID

user_id

用户ID

order_id

订单ID

coupon_status

购物券状态

get_time

领券时间

using_time

使用时间

used_time

支付时间

expire_time

过期时间

1.3.19 优惠券表(coupon_info)

标签

含义

id

购物券编号

coupon_name

购物券名称

coupon_type

购物券类型 1 现金券 2 折扣券 3 满减券 4 满件打折券

condition_amount

满额数

condition_num

满件数

activity_id

活动编号

benefit_amount

减金额

benefit_discount

折扣

create_time

创建时间

range_type

范围类型 1、商品 2、品类 3、品牌

spu_id

商品id

tm_id

品牌id

category3_id

品类id

limit_num

最多领用次数

operate_time

修改时间

expire_time

过期时间

1.3.20 活动表(activity_info)

标签

含义

id

活动id

activity_name

活动名称

activity_type

活动类型

activity_desc

活动描述

start_time

开始时间

end_time

结束时间

create_time

创建时间

1.3.21 活动订单关联表(activity_order)

标签

含义

id

编号

activity_id

活动id

order_id

订单编号

create_time

发生日期

1.3.22 优惠规则表(activity_rule)

标签

含义

id

编号

activity_id

活动id

condition_amount

满减金额

condition_num

满减件数

benefit_amount

优惠金额

benefit_discount

优惠折扣

benefit_level

优惠级别

1.3.23 编码字典表(base_dic)

标签

含义

dic_code

编号

dic_name

编码名称

parent_code

父编号

create_time

创建日期

operate_time

修改日期

1.3.24 活动参与商品表(activity_sku)(暂不导入)

标签

含义

id

编号

activity_id

活动id

sku_id

sku_id

create_time

创建时间

1.4 时间相关表

1.4.1 时间表(date_info)

标签

含义

date_id

日期id

week_id

周id

week_day

day

month

quarter

季度

year

is_workday

是否是周末

holiday_id

假期id

1.4.2 假期表(holiday_info)

标签

含义

holiday_id

假期id

holiday_name

假期名称

1.4.3 假期年表(holiday_year)

标签

含义

holiday_id

假期id

holiday_name

假期名称

start_date_id

假期开始时间

end_date_id

假期结束时间

2章 业务数据采集模块

 

2.1 MySQL安装

2.1.1 安装包准备

1)卸载自带的Mysql-libs(如果之前安装过mysql,要全都卸载掉)

[atguigu@hadoop102 software]$ rpm -qa | grep -i -E mysql\|mariadb | xargs -n1 sudo rpm -e --nodeps

2)将安装包和JDBC驱动上传到/opt/software,共计6个

01_mysql-community-common-5.7.29-1.el7.x86_64.rpm

02_mysql-community-libs-5.7.29-1.el7.x86_64.rpm

03_mysql-community-libs-compat-5.7.29-1.el7.x86_64.rpm

04_mysql-community-client-5.7.29-1.el7.x86_64.rpm

05_mysql-community-server-5.7.29-1.el7.x86_64.rpm

mysql-connector-java-5.1.48.jar

2.1.2 安装MySQL

1)安装mysql依赖

[atguigu@hadoop102 software]$ sudo rpm -ivh 01_mysql-community-common-5.7.29-1.el7.x86_64.rpm

[atguigu@hadoop102 software]$ sudo rpm -ivh 02_mysql-community-libs-5.7.29-1.el7.x86_64.rpm

[atguigu@hadoop102 software]$ sudo rpm -ivh 03_mysql-community-libs-compat-5.7.29-1.el7.x86_64.rpm

2)安装mysql-client

[atguigu@hadoop102 software]$ sudo rpm -ivh 04_mysql-community-client-5.7.29-1.el7.x86_64.rpm

3)安装mysql-server

[atguigu@hadoop102 software]$ sudo rpm -ivh 05_mysql-community-server-5.7.29-1.el7.x86_64.rpm

4)启动mysql

[atguigu@hadoop102 software]$ sudo systemctl start mysqld

5)查看mysql密码

[atguigu@hadoop102 software]$ sudo cat /var/log/mysqld.log | grep password

2.1.3 配置MySQL

配置只要是root用户+密码,在任何主机上都能登录MySQL数据库。

1)用刚刚查到的密码进入mysql(如果报错,给密码加单引号)

[atguigu@hadoop102 software]$ mysql -uroot -p’password’

2)设置复杂密码(由于mysql密码策略,此密码必须足够复杂)

mysql> set password=password("Qs23=zs32");

3)更改mysql密码策略

mysql> set global validate_password_length=4;

mysql> set global validate_password_policy=0;

4)设置简单好记的密码

mysql> set password=password("000000");

5)进入msyql库

mysql> use mysql

6)查询user表

mysql> select user, host from user;

7)修改user表,把Host表内容修改为%

mysql> update user set host="%" where user="root";

8)刷新

mysql> flush privileges;

9)退出

mysql> quit;

2.2 Sqoop安装

2.2.1 下载并解压

1)下载地址:http://mirrors.hust.edu.cn/apache/sqoop/1.4.6/

2)上传安装包sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz到hadoop102的/opt/software路径中

3)解压sqoop安装包到指定目录,如:

[atguigu@hadoop102 software]$ tar -zxf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz -C /opt/module/

4)解压sqoop安装包到指定目录,如:

[atguigu@hadoop102 module]$ mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha/ sqoop

2.2.2 修改配置文件

1) 进入到/opt/module/sqoop/conf目录,重命名配置文件

[atguigu@hadoop102 conf]$ mv sqoop-env-template.sh sqoop-env.sh

2) 修改配置文件

[atguigu@hadoop102 conf]$ vim sqoop-env.sh

增加如下内容

export HADOOP_COMMON_HOME=/opt/module/hadoop-3.1.3

export HADOOP_MAPRED_HOME=/opt/module/hadoop-3.1.3

export HIVE_HOME=/opt/module/hive

export ZOOKEEPER_HOME=/opt/module/zookeeper-3.5.7

export ZOOCFGDIR=/opt/module/zookeeper-3.5.7/conf

2.2.3 拷贝JDBC驱动

1)将mysql-connector-java-5.1.48.jar 上传到/opt/software路径

2)进入到/opt/software/路径,拷贝jdbc驱动到sqoop的lib目录下。

[atguigu@hadoop102 software]$ cp mysql-connector-java-5.1.48.jar /opt/module/sqoop/lib/

2.2.4 验证Sqoop

我们可以通过某一个command来验证sqoop配置是否正确:

[atguigu@hadoop102 sqoop]$ bin/sqoop help

出现一些Warning警告(警告信息已省略),并伴随着帮助命令的输出:

Available commands:

  codegen            Generate code to interact with database records

  create-hive-table     Import a table definition into Hive

  eval               Evaluate a SQL statement and display the results

  export             Export an HDFS directory to a database table

  help               List available commands

  import             Import a table from a database to HDFS

  import-all-tables     Import tables from a database to HDFS

  import-mainframe    Import datasets from a mainframe server to HDFS

  job                Work with saved jobs

  list-databases        List available databases on a server

  list-tables           List available tables in a database

  merge              Merge results of incremental imports

  metastore           Run a standalone Sqoop metastore

  version            Display version information

2.2.5 测试Sqoop是否能够成功连接数据库

[atguigu@hadoop102 sqoop]$ bin/sqoop list-databases --connect jdbc:mysql://hadoop102:3306/ --username root --password 000000

出现如下输出:

information_schema

metastore

mysql

oozie

performance_schema

2.3 业务数据生成

2.3.1 连接MySQL

通过MySQL操作可视化工具SQLyog连接MySQL。

 

2.3.2 建表语句

1)通过SQLyog创建数据库gmall

2)设置数据库编码

 

3)导入数据库结构脚本(gmall2020-03-16.sql)

2.3.2 生成业务数据

1)在hadoop102的/opt/module/目录下创建db_log文件夹

[atguigu@hadoop102 module]$ mkdir db_log/

2)把gmall-mock-db-2020-03-16-SNAPSHOT.jar和 application.properties上传到hadoop102的/opt/module/db_log路径上。

3)根据需求修改application.properties相关配置

logging.level.root=info

spring.datasource.driver-class-name=com.mysql.jdbc.Driver

spring.datasource.url=jdbc:mysql://hadoop102:3306/gmall?characterEncoding=utf-8&useSSL=false&serverTimezone=GMT%2B8

spring.datasource.username=root

spring.datasource.password=000000

logging.pattern.console=%m%n

mybatis-plus.global-config.db-config.field-strategy=not_null

#业务日期

mock.date=2020-03-10

#是否重置

mock.clear=1

#是否生成新用户

mock.user.count=50

#男性比例

mock.user.male-rate=20

#收藏取消比例

mock.favor.cancel-rate=10

#收藏数量

mock.favor.count=100

#购物车数量

mock.cart.count=10

#每个商品最多购物个数

mock.cart.sku-maxcount-per-cart=3

#用户下单比例

mock.order.user-rate=80

#用户从购物中购买商品比例

mock.order.sku-rate=70

#是否参加活动

mock.order.join-activity=1

#是否使用购物券

mock.order.use-coupon=1

#购物券领取人数

mock.coupon.user-count=10

#支付比例

mock.payment.rate=70

#支付方式 支付宝:微信 :银联

mock.payment.payment-type=30:60:10

#评价比例 好:中:差:自动

mock.comment.appraise-rate=30:10:10:50

#退款原因比例:质量问题 商品描述与实际描述不一致 缺货 号码不合适 拍错 不想买了 其他

mock.refund.reason-rate=30:10:20:5:15:5:5

4)并在该目录下执行,如下命令,生成2020-03-10日期数据:

[atguigu@hadoop102 db_log]$ java -jar gmall-mock-db-2020-03-16-SNAPSHOT.jar

5)在配置文件application.properties中修改

mock.date=2020-03-11

mock.clear=0

6)再次执行命令,生成2020-03-11日期数据:

[atguigu@hadoop102 db_log]$ java -jar gmall-mock-db-2020-03-16-SNAPSHOT.jar

2.4 同步策略

数据同步策略的类型包括:全量表、增量表、新增及变化表、特殊表

  • 全量表:存储完整的数据。
  • 增量表:存储新增加的数据。
  • 新增及变化表:存储新增加的数据和变化的数据。
  • 特殊表:只需要存储一次。

2.4.1 全量同步策略

 

2.4.2 增量同步策略

 

2.4.3 新增及变化策略

每日新增及变化,就是存储创建时间和操作时间都是今天的数据。

适用场景为,表的数据量大,既会有新增,又会有变化

例如:用户表、订单表、优惠卷领用表。

2.4.4 特殊策略

某些特殊的维度表,可不必遵循上述同步策略。

1)客观世界维度

没变化的客观世界的维度(比如性别,地区,民族,政治成分,鞋子尺码)可以只存一份固定值。

2)日期维度

日期维度可以一次性导入一年或若干年的数据。

3)地区维度

省份表、地区表

2.5 业务数据导入HDFS

2.5.1 分析表同步策略

 

2.5.2 脚本编写

1)在/home/atguigu/bin目录下创建

[atguigu@hadoop102 bin]$ vim gmall_mysql_to_hdfs.sh

添加如下内容:

#! /bin/bash

sqoop=/opt/module/sqoop/bin/sqoop

do_date=`date -d '-1 day' +%F`

if [[ -n "$2" ]]; then

    do_date=$2

fi

import_data(){

$sqoop import \

--connect jdbc:mysql://hadoop102:3306/gmall \

--username root \

--password 000000 \

--target-dir /origin_data/gmall/db/$1/$do_date \

--delete-target-dir \

--query "$2 and  \$CONDITIONS" \

--num-mappers 1 \

--fields-terminated-by '\t' \

--compress \

--compression-codec lzop \

--null-string '\\N' \

--null-non-string '\\N'

hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /origin_data/gmall/db/$1/$do_date

}

import_order_info(){

  import_data order_info "select

                            id,

                            final_total_amount,

                            order_status,

                            user_id,

                            out_trade_no,

                            create_time,

                            operate_time,

                            province_id,

                            benefit_reduce_amount,

                            original_total_amount,

                            feight_fee      

                        from order_info

                        where (date_format(create_time,'%Y-%m-%d')='$do_date'

                        or date_format(operate_time,'%Y-%m-%d')='$do_date')"

}

import_coupon_use(){

  import_data coupon_use "select

                          id,

                          coupon_id,

                          user_id,

                          order_id,

                          coupon_status,

                          get_time,

                          using_time,

                          used_time

                        from coupon_use

                        where (date_format(get_time,'%Y-%m-%d')='$do_date'

                        or date_format(using_time,'%Y-%m-%d')='$do_date'

                        or date_format(used_time,'%Y-%m-%d')='$do_date')"

}

import_order_status_log(){

  import_data order_status_log "select

                                  id,

                                  order_id,

                                  order_status,

                                  operate_time

                                from order_status_log

                                where date_format(operate_time,'%Y-%m-%d')='$do_date'"

}

import_activity_order(){

  import_data activity_order "select

                                id,

                                activity_id,

                                order_id,

                                create_time

                              from activity_order

                              where date_format(create_time,'%Y-%m-%d')='$do_date'"

}

import_user_info(){

  import_data "user_info" "select

                            id,

                            name,

                            birthday,

                            gender,

                            email,

                            user_level,

                            create_time,

                            operate_time

                          from user_info

                          where (DATE_FORMAT(create_time,'%Y-%m-%d')='$do_date'

                          or DATE_FORMAT(operate_time,'%Y-%m-%d')='$do_date')"

}

import_order_detail(){

  import_data order_detail "select

                              od.id,

                              order_id,

                              user_id,

                              sku_id,

                              sku_name,

                              order_price,

                              sku_num,

                              od.create_time  

                            from order_detail od

                            join order_info oi

                            on od.order_id=oi.id

                            where DATE_FORMAT(od.create_time,'%Y-%m-%d')='$do_date'"

}

import_payment_info(){

  import_data "payment_info"  "select

                                id,  

                                out_trade_no,

                                order_id,

                                user_id,

                                alipay_trade_no,

                                total_amount,  

                                subject,

                                payment_type,

                                payment_time

                              from payment_info

                              where DATE_FORMAT(payment_time,'%Y-%m-%d')='$do_date'"

}

import_comment_info(){

  import_data comment_info "select

                              id,

                              user_id,

                              sku_id,

                              spu_id,

                              order_id,

                              appraise,

                              comment_txt,

                              create_time

                            from comment_info

                            where date_format(create_time,'%Y-%m-%d')='$do_date'"

}

import_order_refund_info(){

  import_data order_refund_info "select

                                id,

                                user_id,

                                order_id,

                                sku_id,

                                refund_type,

                                refund_num,

                                refund_amount,

                                refund_reason_type,

                                create_time

                              from order_refund_info

                              where date_format(create_time,'%Y-%m-%d')='$do_date'"

}

import_sku_info(){

  import_data sku_info "select

                          id,

                          spu_id,

                          price,

                          sku_name,

                          sku_desc,

                          weight,

                          tm_id,

                          category3_id,

                          create_time

                        from sku_info where 1=1"

}

import_base_category1(){

  import_data "base_category1" "select

                                  id,

                                  name

                                from base_category1 where 1=1"

}

import_base_category2(){

  import_data "base_category2" "select

                                  id,

                                  name,

                                  category1_id

                                from base_category2 where 1=1"

}

import_base_category3(){

  import_data "base_category3" "select

                                  id,

                                  name,

                                  category2_id

                                from base_category3 where 1=1"

}

import_base_province(){

  import_data base_province "select

                              id,

                              name,

                              region_id,

                              area_code,

                              iso_code

                            from base_province

                            where 1=1"

}

import_base_region(){

  import_data base_region "select

                              id,

                              region_name

                            from base_region

                            where 1=1"

}

import_base_trademark(){

  import_data base_trademark "select

                                tm_id,

                                tm_name

                              from base_trademark

                              where 1=1"

}

import_spu_info(){

  import_data spu_info "select

                            id,

                            spu_name,

                            category3_id,

                            tm_id

                          from spu_info

                          where 1=1"

}

import_favor_info(){

  import_data favor_info "select

                          id,

                          user_id,

                          sku_id,

                          spu_id,

                          is_cancel,

                          create_time,

                          cancel_time

                        from favor_info

                        where 1=1"

}

import_cart_info(){

  import_data cart_info "select

                        id,

                        user_id,

                        sku_id,

                        cart_price,

                        sku_num,

                        sku_name,

                        create_time,

                        operate_time,

                        is_ordered,

                        order_time

                      from cart_info

                      where 1=1"

}

import_coupon_info(){

  import_data coupon_info "select

                          id,

                          coupon_name,

                          coupon_type,

                          condition_amount,

                          condition_num,

                          activity_id,

                          benefit_amount,

                          benefit_discount,

                          create_time,

                          range_type,

                          spu_id,

                          tm_id,

                          category3_id,

                          limit_num,

                          operate_time,

                          expire_time

                        from coupon_info

                        where 1=1"

}

import_activity_info(){

  import_data activity_info "select

                              id,

                              activity_name,

                              activity_type,

                              start_time,

                              end_time,

                              create_time

                            from activity_info

                            where 1=1"

}

import_activity_rule(){

    import_data activity_rule "select

                                    id,

                                    activity_id,

                                    condition_amount,

                                    condition_num,

                                    benefit_amount,

                                    benefit_discount,

                                    benefit_level

                                from activity_rule

                                where 1=1"

}

import_base_dic(){

    import_data base_dic "select

                            dic_code,

                            dic_name,

                            parent_code,

                            create_time,

                            operate_time

                          from base_dic

                          where 1=1"

}

case $1 in

  "order_info")

     import_order_info

;;

  "base_category1")

     import_base_category1

;;

  "base_category2")

     import_base_category2

;;

  "base_category3")

     import_base_category3

;;

  "order_detail")

     import_order_detail

;;

  "sku_info")

     import_sku_info

;;

  "user_info")

     import_user_info

;;

  "payment_info")

     import_payment_info

;;

  "base_province")

     import_base_province

;;

  "base_region")

     import_base_region

;;

  "base_trademark")

     import_base_trademark

;;

  "activity_info")

      import_activity_info

;;

  "activity_order")

      import_activity_order

;;

  "cart_info")

      import_cart_info

;;

  "comment_info")

      import_comment_info

;;

  "coupon_info")

      import_coupon_info

;;

  "coupon_use")

      import_coupon_use

;;

  "favor_info")

      import_favor_info

;;

  "order_refund_info")

      import_order_refund_info

;;

  "order_status_log")

      import_order_status_log

;;

  "spu_info")

      import_spu_info

;;

  "activity_rule")

      import_activity_rule

;;

  "base_dic")

      import_base_dic

;;

"first")

   import_base_category1

   import_base_category2

   import_base_category3

   import_order_info

   import_order_detail

   import_sku_info

   import_user_info

   import_payment_info

   import_base_province

   import_base_region

   import_base_trademark

   import_activity_info

   import_activity_order

   import_cart_info

   import_comment_info

   import_coupon_use

   import_coupon_info

   import_favor_info

   import_order_refund_info

   import_order_status_log

   import_spu_info

   import_activity_rule

   import_base_dic

;;

"all")

   import_base_category1

   import_base_category2

   import_base_category3

   import_order_info

   import_order_detail

   import_sku_info

   import_user_info

   import_payment_info

   import_base_trademark

   import_activity_info

   import_activity_order

   import_cart_info

   import_comment_info

   import_coupon_use

   import_coupon_info

   import_favor_info

   import_order_refund_info

   import_order_status_log

   import_spu_info

   import_activity_rule

   import_base_dic

;;

esac

说明1:

[ -n 变量值 ] 判断变量的值,是否为空

-- 变量的值,非空,返回true

-- 变量的值,为空,返回false

说明2:

查看date命令的使用,[atguigu@hadoop102 ~]$ date --help

2)修改脚本权限

[atguigu@hadoop102 bin]$ chmod 777 gmall_mysql_to_hdfs.sh

3)初次导入

[atguigu@hadoop102 bin]$ gmall_mysql_to_hdfs.sh first 2020-03-10

4)每日导入

[atguigu@hadoop102 bin]$ gmall_mysql_to_hdfs.sh all 2020-03-11

2.5.3 项目经验

Hive中的Null在底层是以“\N”来存储,而MySQL中的Null在底层就是Null,为了保证数据两端的一致性。在导出数据时采用--input-null-string和--input-null-non-string两个参数。导入数据时采用--null-string和--null-non-string。

3章 数据环境准备

3.1 Hive安装部署

1)把apache-hive-3.1.2-bin.tar.gz上传到linux的/opt/software目录下

2)解压apache-hive-3.1.2-bin.tar.gz到/opt/module/目录下面

[atguigu@hadoop102 software]$ tar -zxvf /opt/software/apache-hive-3.1.2-bin.tar.gz -C /opt/module/

3)修改apache-hive-3.1.2-bin.tar.gz的名称为hive

[atguigu@hadoop102 software]$ mv /opt/module/apache-hive-3.1.2-bin/ /opt/module/hive

4)修改/etc/profile.d/my_env.sh,添加环境变量

[atguigu@hadoop102 software]$ sudo vim /etc/profile.d/my_env.sh

5)添加内容

#HIVE_HOME

export HIVE_HOME=/opt/module/hive

export PATH=$PATH:$HIVE_HOME/bin

重启Xshell对话框使环境变量生效或者

[atguigu@hadoop102 software]$ source /etc/profile.d/my_env.sh

6)解决日志Jar包冲突,进入/opt/module/hive/lib目录

[atguigu@hadoop102 lib]$ mv log4j-slf4j-impl-2.10.0.jar log4j-slf4j-impl-2.10.0.jar.bak

3.2 Hive元数据配置到MySql

3.2.1 拷贝驱动

将MySQL的JDBC驱动拷贝到Hive的lib目录下

[atguigu@hadoop102 lib]$ cp /opt/software/mysql-connector-java-5.1.48.jar /opt/module/hive/lib/

3.2.2 配置Metastore到MySql

在$HIVE_HOME/conf目录下新建hive-site.xml文件

[atguigu@hadoop102 conf]$ vim hive-site.xml

添加如下内容

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

    <property>

        <name>javax.jdo.option.ConnectionURL</name>

        <value>jdbc:mysql://hadoop102:3306/metastore?useSSL=false</value>

    </property>

    <property>

        <name>javax.jdo.option.ConnectionDriverName</name>

        <value>com.mysql.jdbc.Driver</value>

    </property>

    <property>

        <name>javax.jdo.option.ConnectionUserName</name>

        <value>root</value>

    </property>

    <property>

        <name>javax.jdo.option.ConnectionPassword</name>

        <value>000000</value>

    </property>

    <property>

        <name>hive.metastore.warehouse.dir</name>

        <value>/user/hive/warehouse</value>

    </property>

    <property>

        <name>hive.metastore.schema.verification</name>

        <value>false</value>

    </property>

    <property>

        <name>hive.metastore.uris</name>

        <value>thrift://hadoop102:9083</value>

    </property>

    <property>

    <name>hive.server2.thrift.port</name>

    <value>10000</value>

    </property>

    <property>

        <name>hive.server2.thrift.bind.host</name>

        <value>hadoop102</value>

    </property>

    <property>

        <name>hive.metastore.event.db.notification.api.auth</name>

        <value>false</value>

    </property>

    <property>

        <name>hive.cli.print.header</name>

        <value>true</value>

    </property>

    <property>

        <name>hive.cli.print.current.db</name>

        <value>true</value>

    </property>

</configuration>

3.3 启动Hive

3.3.1 初始化元数据库

1)登陆MySQL

[atguigu@hadoop102 conf]$ mysql -uroot -p000000

2)新建Hive元数据库

mysql> create database metastore;

mysql> quit;

3)初始化Hive元数据库

[atguigu@hadoop102 conf]$ schematool -initSchema -dbType mysql -verbose

3.3.2 启动metastore和hiveserver2

1)Hive 2.x以上版本,要先启动这两个服务,否则会报错:

FAILED: HiveException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2)在/opt/module/hive/bin目录编写hive服务启动脚本

[atguigu@hadoop102 bin]$ vim hiveservices.sh

内容如下:

#!/bin/bash

HIVE_LOG_DIR=$HIVE_HOME/logs

mkdir -p $HIVE_LOG_DIR

#检查进程是否运行正常,参数1为进程名,参数2为进程端口

function check_process()

{

    pid=$(ps -ef 2>/dev/null | grep -v grep | grep -i $1 | awk '{print $2}')

    ppid=$(netstat -nltp 2>/dev/null | grep $2 | awk '{print $7}' | cut -d '/' -f 1)

    echo $pid

    [[ "$pid" =~ "$ppid" ]] && [ "$ppid" ] && return 0 || return 1

}

function hive_start()

{

    metapid=$(check_process HiveMetastore 9083)

    cmd="nohup hive --service metastore >$HIVE_LOG_DIR/metastore.log 2>&1 &"

    cmd=$cmd" sleep 4; hdfs dfsadmin -safemode wait >/dev/null 2>&1"

    [ -z "$metapid" ] && eval $cmd || echo "Metastroe服务已启动"

    server2pid=$(check_process HiveServer2 10000)

    cmd="nohup hive --service hiveserver2 >$HIVE_LOG_DIR/hiveServer2.log 2>&1 &"

    [ -z "$server2pid" ] && eval $cmd || echo "HiveServer2服务已启动"

}

function hive_stop()

{

    metapid=$(check_process HiveMetastore 9083)

    [ "$metapid" ] && kill $metapid || echo "Metastore服务未启动"

    server2pid=$(check_process HiveServer2 10000)

    [ "$server2pid" ] && kill $server2pid || echo "HiveServer2服务未启动"

}

case $1 in

"start")

    hive_start

    ;;

"stop")

    hive_stop

    ;;

"restart")

    hive_stop

    sleep 2

    hive_start

    ;;

"status")

    check_process HiveMetastore 9083 >/dev/null && echo "Metastore服务运行正常" || echo "Metastore服务运行异常"

    check_process HiveServer2 10000 >/dev/null && echo "HiveServer2服务运行正常" || echo "HiveServer2服务运行异常"

    ;;

*)

    echo Invalid Args!

    echo 'Usage: '$(basename $0)' start|stop|restart|status'

    ;;

esac

3)添加执行权限

[atguigu@hadoop102 bin]$ chmod +x hiveservices.sh

4)启动Hive后台服务

[atguigu@hadoop102 bin]$ hiveservices.sh start

5)查看Hive后台服务运行情况

[atguigu@hadoop102 bin]$ hiveservices.sh status

Metastore服务运行正常

HiveServer2服务运行异常

6)启动Hive客户端

[atguigu@hadoop102 hive]$ bin/hive

  • 1
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值