文章目录
1 Kerberos概述
1.1 什么是Kerberos
Kerberos是一种计算机网络授权协议,用来在非安全网络中,对个人通信以安全的手段进行身份认证。这个词又指麻省理工学院为这个协议开发的一套计算机软件。软件设计上采用客户端/服务器结构,并且能够进行相互认证,即客户端和服务器端均可对对方进行身份认证。可以用于防止窃听、防止重放攻击、保护数据完整性等场合,是一种应用对称密钥体制进行密钥管理的系统。
Kerberos不是k8s,Kubernetes简称k8s,是一个开源的,用于管理云平台中多个主机上的容器化的应用,跟我们今天学的Kerberos是两回事,大家切记一定不要搞混。
1.2 Kerberos概念
Kerberos中有以下一些概念需要了解:
1)KDC:密钥分发中心,负责管理发放票据,记录授权。
2)Realm:Kerberos管理领域的标识。
3)principal:当每添加一个用户或服务的时候都需要向kdc添加一条principal,principl的形式为:主名称/实例名@领域名。
4)主名称:主名称可以是用户名或服务名,表示是用于提供各种网络服务(如hdfs,yarn,hive)的主体。
5)实例名:实例名简单理解为主机名。
关于 Kerberos 更多的原理讲解可参考以下链接:
https://cloud.tencent.com/developer/article/1496451
1.3 Kerberos认证原理
2 Kerberos安装
2.1 server节点安装kerberos相关软件
[root@hadoop102 ~]# yum install -y krb5-server krb5-workstation krb5-libs
#查看结果
[root@hadoop102 ~]# rpm -qa | grep krb5
krb5-workstation-1.10.3-65.el6.x86_64
krb5-libs-1.10.3-65.el6.x86_64
krb5-server-1.10.3-65.el6.x86_64
2.2 client节点安装
[root@hadoop103 ~]# yum install -y krb5-workstation krb5-libs
[root@hadoop104 ~]# yum install -y krb5-workstation krb5-libs
#查看结果
[root@hadoop103 ~]# rpm -qa | grep krb5
krb5-workstation-1.10.3-65.el6.x86_64
krb5-libs-1.10.3-65.el6.x86_64
2.3 配置kerberos
需要配置的文件有两个为kdc.conf和krb5.conf , kdc配置只是需要Server服务节点配置,即hadoop102.
1) kdc配置
[root@hadoop102 ~]# vim /var/kerberos/krb5kdc/kdc.conf
[kdcdefaults]
kdc_ports = 88
kdc_tcp_ports = 88
[realms]
HADOOP.COM = {
#master_key_type = aes256-cts
acl_file = /var/kerberos/krb5kdc/kadm5.acl
dict_file = /usr/share/dict/words
admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab
max_life = 1d
max_renewable_life = 7d
supported_enctypes = aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal camellia256-cts:normal camellia128-cts:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal
}
说明:
HADOOP.COM:realm名称,Kerberos支持多个realm,一般全用大写。
acl_file:admin的用户权。
admin_keytab:KDC进行校验的keytab。
supported_enctypes:支持的校验方式,注意把aes256-cts去掉,JAVA使用aes256-cts验证方式需要安装额外的jar包,所有这里不用。
2) krb5文件配置
[root@hadoop102 ~]# vim /etc/krb5.conf
includedir /etc/krb5.conf.d/
[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
[libdefaults]
dns_lookup_realm = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
rdns = false
pkinit_anchors = /etc/pki/tls/certs/ca-bundle.crt
default_realm = HADOOP.COM
#default_ccache_name = KEYRING:persistent:%{uid}
udp_preference_limit = 1
[realms]
HADOOP.COM = {
kdc = hadoop102
admin_server = hadoop102
}
[domain_realm]
# .example.com = EXAMPLE.COM
# example.com = EXAMPLE.COM
说明:
default_realm:默认的realm,设置 Kerberos 应用程序的默认领域,必须跟要配置的realm的名称一致。
ticket_lifetime:表明凭证生效的时限,一般为24小时。
renew_lifetime : 表明凭证最长可以被延期的时限,一般为一个礼拜。当凭证过期之后,对安全认证的服务的后续访问则会失败。
udp_preference_limit= 1:禁止使用 udp,可以防止一个 Hadoop 中的错误。
realms:配置使用的 realm,如果有多个领域,只需向 [realms] 节添加其他的语句。
domain_realm:集群域名与Kerberos realm的映射关系,单个realm的情况下,可忽略。
3)同步krb5到Client节点
[root@hadoop102 ~]# xsync /etc/krb5.conf
2.4 生成Kerberos数据库
在server节点执行
[root@hadoop102 ~]# kdb5_util create -s
Loading random data
Initializing database '/var/kerberos/krb5kdc/principal' for realm 'HADOOP.COM',
master key name 'K/M@HADOOP.COM'
You will be prompted for the database Master Password.
It is important that you NOT FORGET this password.
Enter KDC database master key: (输入密码)
Re-enter KDC database master key to verify:(确认密码)
创建完成后/var/kerberos/krb5kdc目录下会生成对应的文件
[root@hadoop102 ~]# ls /var/kerberos/krb5kdc/
kadm5.acl kdc.conf principal principal.kadm5 principal.kadm5.lock principal.ok
2.5 赋予Kerberos管理员所有权限
[root@hadoop102 ~]# vim /var/kerberos/krb5kdc/kadm5.acl
#修改为以下内容:
*/admin@HADOOP.COM *
说明:
*/admin:admin实例的全部主体
@HADOOP.COM:realm
*:全部权限
这个授权的意思:就是授予admin实例的全部主体对应HADOOP.COM领域的全部权限。也就是创建Kerberos主体的时候如果实例为admin,就具有HADOOP.COM领域的全部权限,比如创建如下的主体user1/admin就拥有全部的HADOOP.COM领域的权限。
2.6 启动Kerberos服务
#启动krb5kdc
[root@hadoop102 ~]# systemctl start krb5kdc
正在启动 Kerberos 5 KDC: [确定]
#启动kadmin
[root@hadoop102 ~]# systemctl start kadmin
正在启动 Kerberos 5 Admin Server: [确定]
#设置开机自启
[root@hadoop102 ~]# systemctl enable krb5kdc
#查看是否设置为开机自启
[root@hadoop102 ~]# systemctl is-enabled krb5kdc
[root@hadoop102 ~]# systemctl enable kadmin
#查看是否设置为开机自启
[root@hadoop102 ~]# systemctl is-enabled kadmin
注意:启动失败时可以通过/var/log/krb5kdc.log和/var/log/kadmind.log来查看。
2.7 创建管理员主体/实例
[root@hadoop102 ~]# kadmin.local -q "addprinc admin/admin"
Authenticating as principal root/admin@HADOOP.COM with password.
WARNING: no policy specified for admin/admin@HADOOP.COM; defaulting to no policy
Enter password for principal "admin/admin@HADOOP.COM": (输入密码)
Re-enter password for principal "admin/admin@HADOOP.COM": (确认密码)
Principal "admin/admin@HADOOP.COM" created.
2.8 kinit管理员验证
[root@hadoop102 ~]# kinit admin/admin
Password for admin/admin@HADOOP.COM: (输入密码)
[root@hadoop102 ~]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: admin/admin@HADOOP.COM
Valid starting Expires Service principal
08/27/19 14:41:39 08/28/19 14:41:39 krbtgt/HADOOP.COM@HADOOP.COM
renew until 08/27/19 14:41:39
出现以上红色admin/admin@HADOOP.COM说明没问题
在其他机器尝试:
[root@hadoop103 ~]# kinit admin/admin
Password for admin/admin@HADOOP.COM: (输入密码)
[root@hadoop103 ~]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: admin/admin@HADOOP.COM
Valid starting Expires Service principal
08/27/19 14:41:39 08/28/19 14:41:39 krbtgt/HADOOP.COM@HADOOP.COM
renew until 08/27/19 14:41:39
如果出现:kadmin: GSS-API (or Kerberos) error while initializing kadmin interface,则重启ntp服务:
[root@hadoop103 ~]# service ntpd restart
关闭 ntpd: [确定]
正在启动 ntpd:
3 Kerberos数据库操作
3.1 登录Kerberos数据库
1)本地登录(无需认证)
[root@hadoop102 ~]# kadmin.local
Authenticating as principal root/admin@HADOOP.COM with password.
kadmin.local:
2)远程登录(需进行主体认证,先认证刚刚创建的管理员主体)
[root@hadoop103 ~]# kadmin
Authenticating as principal admin/admin@HADOOP.COM with password.
Password for admin/admin@HADOOP.COM:
kadmin:
退出输入:exit
3.2 创建Kerberos主体
[root@hadoop102 ~]# kadmin.local -q "addprinc zhouchen/zhouchen"
Authenticating as principal root/admin@HADOOP.COM with password.
WARNING: no policy specified for zhouchen/zhouchen@HADOOP.COM; defaulting to no policy
Enter password for principal "zhouchen/zhouchen@HADOOP.COM": (输入密码)
Re-enter password for principal "zhouchen/zhouchen@HADOOP.COM": (输入密码)
Principal "admin/admin@HADOOP.COM" created.
3.3 修改主体密码
[root@hadoop102 ~]# kadmin.local -q "cpw zhouchen/zhouchen"
Authenticating as principal root/admin@HADOOP.COM with password.
Enter password for principal "zhouchen/zhouchen@HADOOP.COM": (输入密码)
Re-enter password for principal "zhouchen/zhouchen@HADOOP.COM": (输入密码)
Password for "zhouchen/zhouchen@HADOOP.COM" changed.
3.4 查看所有主体
[root@hadoop102 ~]# kadmin.local -q "list_principals"
Authenticating as principal root/admin@HADOOP.COM with password.
K/M@HADOOP.COM
admin/admin@HADOOP.COM
zhouchen/zhouchen@HADOOP.COM
kadmin/admin@HADOOP.COM
kadmin/changepw@HADOOP.COM
kadmin/hadoop105@HADOOP.COM
kiprop/hadoop105@HADOOP.COM
krbtgt/HADOOP.COM@HADOOP.COM
4 Kerberos主体认证
Kerberos提供了两种认证方式,一种是通过输入密码认证,另一种是通过keytab密钥文件认证,但两种方式不可同时使用。
4.1 密码认证
1)使用kinit进行主体认证
[root@hadoop102 ~]# kinit zhouchen/zhouchen
Password for admin/admin@HADOOP.COM:
2)查看认证凭证
[root@hadoop102 ~]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: zhouchen/zhouchen@HADOOP.COM
Valid starting Expires Service principal
10/27/2019 18:23:57 10/28/2019 18:23:57 krbtgt/HADOOP.COM@HADOOP.COM
renew until 11/03/2019 18:23:57
4.2 keytab密钥文件认证
1)生成主体admin/admin的keytab文件到指定目录/root/admin.keytab
[root@hadoop102 ~]#
kadmin.local -q "xst -k /root/zhouchen.keytab zhouchen/zhouchen@HADOOP.COM"
2)使用keytab进行认证
[root@hadoop102 ~]# kinit -kt /root/zhouchen.keytab zhouchen/zhouchen
3)查看认证凭证
[root@hadoop102 ~]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: zhouchen/zhouchen@HADOOP.COM
Valid starting Expires Service principal
08/27/19 15:41:28 08/28/19 15:41:28 krbtgt/HADOOP.COM@HADOOP.COM
renew until 08/27/19 15:41:28
4.3 销毁凭证
[root@hadoop102 ~]# kdestroy
[root@hadoop102 ~]# klist
klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_0)
5 CDH启用Kerberos安全认证
5.1 为CM创建管理员主体/实例
[root@hadoop102 ~]# kadmin.local -q "addprinc cloudera-scm/admin"
Authenticating as principal root/admin@HADOOP.COM with password.
WARNING: no policy specified for cloudera-scm/admin @HADOOP.COM; defaulting to no policy
Enter password for principal " cloudera-scm/admin @HADOOP.COM": (输入密码)
Re-enter password for principal " cloudera-scm/admin @HADOOP.COM": (确认密码)
Principal " cloudera-scm/admin @HADOOP.COM" created.
5.2 启用Kerberos
5.3 环境确认(勾选全部)
5.4 填写配置
Kerberos 加密类型:aes128-cts、des3-hmac-sha1、arcfour-hmac
5.5 继续
5.6 填写主体名和密码
5.7 等待导入KDC
5.8 准备重启集群
5.9 等待完成
5.10 查看主体
[root@hadoop102 ~]# kadmin.local -q "list_principals"
Authenticating as principal cloudera-scm/admin@HADOOP.COM with password.
HTTP/hadoop102@HADOOP.COM
HTTP/hadoop103@HADOOP.COM
HTTP/hadoop104@HADOOP.COM
K/M@HADOOP.COM
admin/admin@HADOOP.COM
zhouchen@HADOOP.COM
cloudera-scm/admin@HADOOP.COM
hdfs/hadoop102@HADOOP.COM
hdfs/hadoop103@HADOOP.COM
hdfs/hadoop104@HADOOP.COM
hive/hadoop102@HADOOP.COM
hue/hadoop102@HADOOP.COM
kadmin/admin@HADOOP.COM
kadmin/changepw@HADOOP.COM
kadmin/hadoop102@HADOOP.COM
krbtgt/HADOOP.COM@HADOOP.COM
mapred/hadoop102@HADOOP.COM
oozie/hadoop102@HADOOP.COM
sentry/hadoop102@HADOOP.COM
yarn/hadoop102@HADOOP.COM
yarn/hadoop103@HADOOP.COM
yarn/hadoop104@HADOOP.COM
zookeeper/hadoop102@HADOOP.COM
zookeeper/hadoop103@HADOOP.COM
zookeeper/hadoop104@HADOOP.COM
6 Kerberos安全环境实操
在启用Kerberos之后,系统与系统(flume-kafka)之间的通讯,以及用户与系统(user-hdfs)之间的通讯都需要先进行安全认证,认证通过之后方可进行通讯。
故在启用Kerberos后,数仓中使用的脚本等,均需要加入一步安全认证的操作,才能正常工作。
6.1 用户访问服务认证
开启Kerberos安全认证之后,日常的访问服务(例如访问HDFS,消费Kafka topic等)都需要先进行安全认证
1)在Kerberos数据库中创建用户主体/实例
[root@hadoop102 ~]# kadmin.local -q "addprinc hive/hive@HADOOP.COM"
2)进行用户认证
[root@hadoop102 ~]# kinit hive/hive@HADOOP.COM
3)访问HDFS
[root@hadoop102 ~]# hadoop fs -ls /
Found 4 items
drwxr-xr-x - hive hive 0 2019-10-02 01:29 /origin_data
drwxrwxrwt - hdfs supergroup 0 2019-10-03 00:20 /tmp
drwxr-xr-x - hdfs supergroup 0 2019-10-02 01:35 /user
drwxr-xr-x - hive hive 0 2019-10-02 01:38 /warehouse
4)hive查询
[root@hadoop102 ~]# hive
WARNING: Use "yarn jar" to launch YARN applications.
Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/jars/hive-common-2.1.1-cdh6.2.1.jar!/hive-log4j2.properties Async: false
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive>
5)消费Kafka topic
(1)修改Kafka配置
○1在Kafka的配置项搜索“security.inter.broker.protocol”,设置为SALS_PLAINTEXT。
○2在Kafka的配置项搜索“ssl.client.auth”,设置为none。
(2)创建jaas.conf文件
[root@hadoop102 hive]# vim /var/lib/hive/jaas.conf
#文件内容如下
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useTicketCache=true;
};
(3)创建consumer.properties文件
[root@hadoop102 conf]# vim /etc/kafka/conf/consumer.properties
#文件内容如下
security.protocol=SASL_PLAINTEXT
sasl.kerberos.service.name=kafka
(4)声明jass.conf文件路径
[root@hadoop102 conf]# export KAFKA_OPTS="-Djava.security.auth.login.config=/var/lib/hive/jaas.conf"
(5)使用kafka-console-consumer消费Kafka topic数据
[root@hadoop102 ~]# kafka-console-consumer --bootstrap-server hadoop102:9092 --topic topic_start --from-beginning --consumer.config /etc/kafka/conf/consumer.properties
6)HDFS WebUI浏览器认证
我们设置CDH支持kerberos后会出现下图所示的情况:
可以登录9870,但是不能查看目录及文件,这是由于我们本地环境没有通过认证。
接下来我们设置本地验证。
注意:由于浏览器限制问题,我们这里使用火狐浏览器,其他如:谷歌,ie等均会出现问题。
(1) 下载火狐
(2)设置浏览器
○1打开火狐浏览器,在地址栏输入:about:config,进入设置页面。
○2搜索“network.negotiate-auth.trusted-uris”,修改值为自己的服务器主机名。
○3搜索“network.auth.use-sspi”,双击将值变为false。
(3)安装kfw
○1安装提供的kfw-4.1-amd64.msi。
○2将集群的/etc/krb5.conf文件的内容复制到C:\ProgramData\MIT\Kerberos5\krb.ini中,删除和路径相关的配置。
[logging]
[libdefaults]
default_realm = HADOOP.COM
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
udp_preference_limit = 1
[realms]
HADOOP.COM = {
kdc = hadoop102
admin_server = hadoop102
}
[domain_realm]
○3打开MIT,输入主体名和密码:
(4)测试
6.2 用户行为数仓
1)日志采集Flume配置
日志采集Flume,数据被发送到了Kafka,该Flume相当于一个Kafka生产者。所以需要我们进行上述Kafka客户端的安全认证。但是此处不需要我们进行手动配置,在开启Kerberos后,CM会自动进行配置。
2)消费Kafka Flume配置
消费Kafka Flume,将数据从Kafka传输到HDFS,该Flume相当于一个Kafka消费者。所以也需要我们进行上述Kafka客户端的安全认证(无需手动认证,CM会自动进行配置)。除此之外,我们还需要进行HDFS客户端的安全认证,这需要我们手动配置。
(1)生成hive用户的keytab文件
用户认证的方式有“输入密码”和“使用keytab密钥文件”两种方式,此处需使用keytab密钥文件进行认证。
[root@hadoop102 hive]# kadmin.local -q "xst -k /var/lib/hive/hive.keytab hive/hive@HADOOP.COM"
(2)增加读权限
chmod +r /var/lib/hive/hive.keytab
(3)分发keytab文件
xsync /var/lib/hive/hive.keytab
(4)修改flume agent配置文件
## 组件
a1.sources=r1 r2
a1.channels=c1 c2
a1.sinks=k1 k2
## source1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.sources.r1.kafka.topics=topic_start
## source2
a1.sources.r2.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r2.batchSize = 5000
a1.sources.r2.batchDurationMillis = 2000
a1.sources.r2.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.sources.r2.kafka.topics=topic_event
## channel1
a1.channels.c1.type=memory
a1.channels.c1.capacity=100000
a1.channels.c1.transactionCapacity=10000
## channel2
a1.channels.c2.type=memory
a1.channels.c2.capacity=100000
a1.channels.c2.transactionCapacity=10000
## sink1
a1.sinks.k1.type = hdfs
#a1.sinks.k1.hdfs.proxyUser=hive
a1.sinks.k1.hdfs.kerberosPrincipal=hive/hive@HADOOP.COM
a1.sinks.k1.hdfs.kerberosKeytab=/var/lib/hive/hive.keytab
a1.sinks.k1.hdfs.path = /origin_data/gmall/log/topic_start/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = logstart-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = second
##sink2
a1.sinks.k2.type = hdfs
#a1.sinks.k2.hdfs.proxyUser=hive
a1.sinks.k2.hdfs.kerberosPrincipal=hive/hive@HADOOP.COM
a1.sinks.k2.hdfs.kerberosKeytab=/var/lib/hive/hive.keytab
a1.sinks.k2.hdfs.path = /origin_data/gmall/log/topic_event/%Y-%m-%d
a1.sinks.k2.hdfs.filePrefix = logevent-
a1.sinks.k2.hdfs.round = true
a1.sinks.k2.hdfs.roundValue = 10
a1.sinks.k2.hdfs.roundUnit = second
## 不要产生大量小文件
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k2.hdfs.rollInterval = 10
a1.sinks.k2.hdfs.rollSize = 134217728
a1.sinks.k2.hdfs.rollCount = 0
## 控制输出文件是原生文件。
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k2.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = lzop
a1.sinks.k2.hdfs.codeC = lzop
## 拼装
a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1
a1.sources.r2.channels = c2
a1.sinks.k2.channel= c2
3)ods层
编辑ods层数据导入脚本
[root@hadoop102 bin]# vim ods_db.sh
#内容如下
#!/bin/bash
kinit -kt /var/lib/hive/hive.keytab hive/hive
# 定义变量方便修改
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
echo "===日志日期为 $do_date==="
sql="
load data inpath '/origin_data/gmall/log/topic_start/$do_date' into table "$APP".ods_start_log partition(dt='$do_date');
"
beeline -u "jdbc:hive2://hadoop102:10000/;principal=hive/hadoop102@HADOOP.COM" -n hive -e "$sql"
4)dwd层
编辑dwd数据导入脚本
[root@hadoop102 bin]# vim dwd_start_log.sh
#内容如下
#!/bin/bash
kinit -kt /var/lib/hive/hive.keytab hive/hive
# 定义变量方便修改
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table "$APP".dwd_start_log
PARTITION (dt='$do_date')
select
get_json_object(line,'$.mid') mid_id,
get_json_object(line,'$.uid') user_id,
get_json_object(line,'$.vc') version_code,
get_json_object(line,'$.vn') version_name,
get_json_object(line,'$.l') lang,
get_json_object(line,'$.sr') source,
get_json_object(line,'$.os') os,
get_json_object(line,'$.ar') area,
get_json_object(line,'$.md') model,
get_json_object(line,'$.ba') brand,
get_json_object(line,'$.sv') sdk_version,
get_json_object(line,'$.g') gmail,
get_json_object(line,'$.hw') height_width,
get_json_object(line,'$.t') app_time,
get_json_object(line,'$.nw') network,
get_json_object(line,'$.ln') lng,
get_json_object(line,'$.la') lat,
get_json_object(line,'$.entry') entry,
get_json_object(line,'$.open_ad_type') open_ad_type,
get_json_object(line,'$.action') action,
get_json_object(line,'$.loading_time') loading_time,
get_json_object(line,'$.detail') detail,
get_json_object(line,'$.extend1') extend1
from "$APP".ods_start_log
where dt='$do_date';
"
beeline -u "jdbc:hive2://hadoop102:10000/;principal=hive/hadoop102@HADOOP.COM" -n hive -e "$sql"
5)dws层
编辑dws数据导入脚本
[root@hadoop102 bin]$ vim dws_log.sh
#内容如下
#!/bin/bash
kinit -kt /var/lib/hive/hive.keytab hive/hive
# 定义变量方便修改
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table "$APP".dws_uv_detail_day partition(dt='$do_date')
select
mid_id,
concat_ws('|', collect_set(user_id)) user_id,
concat_ws('|', collect_set(version_code)) version_code,
concat_ws('|', collect_set(version_name)) version_name,
concat_ws('|', collect_set(lang)) lang,
concat_ws('|', collect_set(source)) source,
concat_ws('|', collect_set(os)) os,
concat_ws('|', collect_set(area)) area,
concat_ws('|', collect_set(model)) model,
concat_ws('|', collect_set(brand)) brand,
concat_ws('|', collect_set(sdk_version)) sdk_version,
concat_ws('|', collect_set(gmail)) gmail,
concat_ws('|', collect_set(height_width)) height_width,
concat_ws('|', collect_set(app_time)) app_time,
concat_ws('|', collect_set(network)) network,
concat_ws('|', collect_set(lng)) lng,
concat_ws('|', collect_set(lat)) lat
from "$APP".dwd_start_log
where dt='$do_date'
group by mid_id;
"
beeline -u "jdbc:hive2://hadoop102:10000/;principal=hive/hadoop102@HADOOP.COM" -n hive -e "$sql"
6)ads层
编辑ads数据导入脚本
[root@hadoop102 bin]# vim ads_uv_log.sh
#内容如下
#!/bin/bash
kinit -kt /var/lib/hive/hive.keytab hive/hive
# 定义变量方便修改
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table "$APP".ads_uv_count
select
'$do_date' dt,
daycount.ct
from
(
select
'$do_date' dt,
count(*) ct
from "$APP".dws_uv_detail_day
where dt='$do_date'
)daycount;
"
beeline -u "jdbc:hive2://hadoop102:10000/;principal=hive/hadoop102@HADOOP.COM" -n hive -e "$sql"
6.3 业务数仓
1)sqoop导入
编辑sqoop导入脚本
[root@hadoop102 bin]# vim sqoop_import.sh
#内容如下
#!/bin/bash
kinit -kt /var/lib/hive/hive.keytab hive/hive
#export HADOOP_USER_NAME=hive
db_date=$2
echo $db_date
db_name=gmall
import_data() {
sqoop import \
--connect jdbc:mysql://hadoop102:3306/$db_name \
--username root \
--password Zhouchen.123456 \
--target-dir /origin_data/$db_name/db/$1/$db_date \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--query "$2"' and $CONDITIONS;'
}
import_sku_info(){
import_data "sku_info" "select
id, spu_id, price, sku_name, sku_desc, weight, tm_id,
category3_id, create_time
from sku_info where 1=1"
}
import_user_info(){
import_data "user_info" "select
id, name, birthday, gender, email, user_level,
create_time
from user_info where 1=1"
}
import_base_category1(){
import_data "base_category1" "select
id, name from base_category1 where 1=1"
}
import_base_category2(){
import_data "base_category2" "select
id, name, category1_id from base_category2 where 1=1"
}
import_base_category3(){
import_data "base_category3" "select id, name, category2_id from base_category3 where 1=1"
}
import_order_detail(){
import_data "order_detail" "select
od.id,
order_id,
user_id,
sku_id,
sku_name,
order_price,
sku_num,
o.create_time
from order_info o , order_detail od
where o.id=od.order_id
and DATE_FORMAT(create_time,'%Y-%m-%d')='$db_date'"
}
import_payment_info(){
import_data "payment_info" "select
id,
out_trade_no,
order_id,
user_id,
alipay_trade_no,
total_amount,
subject,
payment_type,
payment_time
from payment_info
where DATE_FORMAT(payment_time,'%Y-%m-%d')='$db_date'"
}
import_order_info(){
import_data "order_info" "select
id,
total_amount,
order_status,
user_id,
payment_way,
out_trade_no,
create_time,
operate_time
from order_info
where (DATE_FORMAT(create_time,'%Y-%m-%d')='$db_date' or DATE_FORMAT(operate_time,'%Y-%m-%d')='$db_date')"
}
case $1 in
"base_category1")
import_base_category1
;;
"base_category2")
import_base_category2
;;
"base_category3")
import_base_category3
;;
"order_info")
import_order_info
;;
"order_detail")
import_order_detail
;;
"sku_info")
import_sku_info
;;
"user_info")
import_user_info
;;
"payment_info")
import_payment_info
;;
"all")
import_base_category1
import_base_category2
import_base_category3
import_order_info
import_order_detail
import_sku_info
import_user_info
import_payment_info
;;
esac
2)ods层
编辑ods层导入脚本
[root@hadoop102 bin]# vim ods_db.sh
#内容如下
#!/bin/bash
kinit -kt /var/lib/hive/hive.keytab hive/hive
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
load data inpath '/origin_data/$APP/db/order_info/$do_date' OVERWRITE into table "$APP".ods_order_info partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/order_detail/$do_date' OVERWRITE into table "$APP".ods_order_detail partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/sku_info/$do_date' OVERWRITE into table "$APP".ods_sku_info partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/user_info/$do_date' OVERWRITE into table "$APP".ods_user_info partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/payment_info/$do_date' OVERWRITE into table "$APP".ods_payment_info partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/base_category1/$do_date' OVERWRITE into table "$APP".ods_base_category1 partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/base_category2/$do_date' OVERWRITE into table "$APP".ods_base_category2 partition(dt='$do_date');
load data inpath '/origin_data/$APP/db/base_category3/$do_date' OVERWRITE into table "$APP".ods_base_category3 partition(dt='$do_date');
"
beeline -u " jdbc:hive2://hadoop102:10000/;principal=hive/hadoop102@HADOOP.COM" -n hive -e "$sql"
3)dwd层
编辑dwd层导入脚本
[root@hadoop102 bin]# vim dwd_db.sh
#内容如下
#!/bin/bash
kinit -kt /var/lib/hive/hive.keytab hive/hive
# 定义变量方便修改
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table "$APP".dwd_order_info partition(dt)
select * from "$APP".ods_order_info
where dt='$do_date' and id is not null;
insert overwrite table "$APP".dwd_order_detail partition(dt)
select * from "$APP".ods_order_detail
where dt='$do_date' and id is not null;
insert overwrite table "$APP".dwd_user_info partition(dt)
select * from "$APP".ods_user_info
where dt='$do_date' and id is not null;
insert overwrite table "$APP".dwd_payment_info partition(dt)
select * from "$APP".ods_payment_info
where dt='$do_date' and id is not null;
insert overwrite table "$APP".dwd_sku_info partition(dt)
select
sku.id,
sku.spu_id,
sku.price,
sku.sku_name,
sku.sku_desc,
sku.weight,
sku.tm_id,
sku.category3_id,
c2.id category2_id ,
c1.id category1_id,
c3.name category3_name,
c2.name category2_name,
c1.name category1_name,
sku.create_time,
sku.dt
from
"$APP".ods_sku_info sku
join "$APP".ods_base_category3 c3 on sku.category3_id=c3.id
join "$APP".ods_base_category2 c2 on c3.category2_id=c2.id
join "$APP".ods_base_category1 c1 on c2.category1_id=c1.id
where sku.dt='$do_date' and c2.dt='$do_date'
and c3.dt='$do_date' and c1.dt='$do_date'
and sku.id is not null;
"
beeline -u " jdbc:hive2://hadoop102:10000/;principal=hive/hadoop102@HADOOP.COM" -n hive -e "$sql"
4)dws层
编辑dws层导入脚本
[root@hadoop102 bin]# vim dws_db_wide.sh
#内容如下
#!/bin/bash
kinit -kt /var/lib/hive/hive.keytab hive/hive
# 定义变量方便修改
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
with
tmp_order as
(
select
user_id,
count(*) order_count,
sum(oi.total_amount) order_amount
from "$APP".dwd_order_info oi
where date_format(oi.create_time,'yyyy-MM-dd')='$do_date'
group by user_id
) ,
tmp_payment as
(
select
user_id,
sum(pi.total_amount) payment_amount,
count(*) payment_count
from "$APP".dwd_payment_info pi
where date_format(pi.payment_time,'yyyy-MM-dd')='$do_date'
group by user_id
)
insert overwrite table "$APP".dws_user_action partition(dt='$do_date')
select
user_actions.user_id,
sum(user_actions.order_count),
sum(user_actions.order_amount),
sum(user_actions.payment_count),
sum(user_actions.payment_amount)
from
(
select
user_id,
order_count,
order_amount,
0 payment_count,
0 payment_amount
from tmp_order
union all
select
user_id,
0 order_count,
0 order_amount,
payment_count,
payment_amount
from tmp_payment
) user_actions
group by user_id;
"
beeline -u " jdbc:hive2://hadoop102:10000/;principal=hive/hadoop102@HADOOP.COM" -n hive -e "$sql"
5)ads层
编辑ads层导入脚本
[root@hadoop102 bin]# vim ads_db_gmv.sh
#内容如下
#!/bin/bash
kinit -kt /var/lib/hive/hive.keytab hive/hive
# 定义变量方便修改
APP=gmall
# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert into table "$APP".ads_gmv_sum_day
select
'$do_date' dt,
sum(order_count) gmv_count,
sum(order_amount) gmv_amount,
sum(payment_amount) payment_amount
from "$APP".dws_user_action
where dt ='$do_date'
group by dt;
"
beeline -u " jdbc:hive2://hadoop102:10000/;principal=hive/hadoop102@HADOOP.COM" -n hive -e "$sql"
6)sqoop导出
编辑sqoop导出脚本
[root@hadoop102 bin]# vim sqoop_export.sh
#内容如下
#!/bin/bash
kinit -kt /var/lib/hive/hive.keytab hive/hive
db_name=gmall
export_data() {
sqoop export \
--connect "jdbc:mysql://hadoop102:3306/${db_name}?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password Zhouchen.123456 \
--table $1 \
--num-mappers 1 \
--export-dir /warehouse/$db_name/ads/$1 \
--input-fields-terminated-by "\t" \
--update-mode allowinsert \
--update-key $2 \
--input-null-string '\\N' \
--input-null-non-string '\\N'
}
case $1 in
"ads_uv_count")
export_data "ads_uv_count" "dt"
;;
"ads_user_action_convert_day")
export_data "ads_user_action_convert_day" "dt"
;;
"ads_gmv_sum_day")
export_data "ads_gmv_sum_day" "dt"
;;
"all")
export_data "ads_uv_count" "dt"
export_data "ads_user_action_convert_day" "dt"
export_data "ads_gmv_sum_day" "dt"
;;
esac
7 Kerberos常见问题
7.1 Kerberos启动后台日志提示异常:No such file or directory - while initializing database for realm HADOOP.COM
在/var/log/krb5kdc.log中发现No such file or directory - while initializing database for realm HADOOP.COM。
解决方法:
1)检查kdc.conf和krb5.conf文件是否配置正确,修改配置,注意:配置文件的[kdcdefaults],[logging]、[libdefaults]等的里面不能有空格
2)停止服务
service krb5kdc stop
service kadmin stop
3)删除Kerberos数据库重新创建数据库
rm -rf /var/kerberos/krb5kdc/*principal*
kdb5_util create -s -r HADOOP.COM
4)创建管理员
kadmin.local -q "addprinc admin/admin"
5)启动服务
service krb5kdc start
service kadmin start
7.2 kinit通过keytab认证出现异常
通过Linux的zhouchen主体执行kinit -kt /root/zhouchen.keytab zhouchen出现下面情况:
kinit: ???? while getting initial credentials
或者
kinit: Permission denied while getting initial credentials
解决方式:
1)使用root用户修改zhouchen.keytab的所属用户:
chown zhouchen /root/zhouchen.keytab
- 修改zhouchen.keytab的权限为660
chmod 660 /root/zhouchen.keytab
7.3 kinit认证时密码输入正确却提示密码错误
[root@hadoop102 ~]# kinit zhouchen
Password for zhouchen@HADOOP.COM:
kinit: Password incorrect while getting initial credentials
这是因为zhouchen已经生成了keytab,所以此时通过这种方式不能认证,需要通过keytab文件来认证,或者修改密码后再认证(修改密码后之前的keytab文件会失效)。
[root@hadoop102 ~]# kinit -kt /root/zhouchen.keytab zhouchen
[root@hadoop102 ~]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: zhouchen@HADOOP.COM
Valid starting Expires Service principal
08/27/19 16:00:39 08/28/19 16:00:39 krbtgt/HADOOP.COM@HADOOP.COM
renew until 08/27/19 16:00:39
7.4 创建数据库异常
Bad krb5 admin server hostname while initializing kadmin interface
原因:在 krb5.conf 文件中为 admin_server 配置了无效的主机名。
解决办法:修改krb5文件,检查admin_server配置是否正确的主机名,修改后,重启执行创建数据库的命令。
7.5 Zookeeper集群启动异常
异常信息:Could not configure server because SASL configuration did not allow the ZooKeeper server to authenticate itself properly: javax.security.auth.login.LoginException: No password provided
问题:属于认证凭证更新问题
解决办法:ZooKeeper 启动安全机制,重新生成Krb5凭证。
每个主机重新更新凭证方法:全部服务停止和Cloudera Manager服务停止,修改Kerberos配置中的 Kerberos 加密类型随便添加一个,就可以主机更新Krb5凭证,然后在改回去在重新生成一次
7.6 Hue启动,Kerberos Ticket Renewer起不来
[root@hadoop102 ~]# kinit hue
Password for hue@HADOOP.COM:
[root@hadoop102 ~]# kadmin.local
Authenticating as principal hue/admin@HADOOP.COM with password.
kadmin.local: modprinc -maxrenewlife 90day krbtgt/HADOOP.COM@HADOOP.COM
Principal "krbtgt/HADOOP.COM@HADOOP.COM" modified.
kadmin.local: modprinc -maxrenewlife 90day +allow_renewable krbtgt/HADOOP.COM@HADOOP.COM
Principal "krbtgt/HADOOP.COM@HADOOP.COM" modified.