Day15[20200727]

一、Hive练习

1.分区表

我们有这样的一组数据:

6,APPLE,北京,35

2,APPLE,上海,47

数据脱敏,将原来的真实数据中敏感的信息进行模糊化,或者部分数据直接删除。

id-》uuid-》变成了自动增长

手机型号-》具体到什么品牌什么型号-》只保留了手机品牌

城市-》 国-县-市-区(数字表示)-》只保留城市

年龄-》身份证号码-》经过脱敏后,转换为年龄

继续脱敏,删除部分数据,去除字段 。。。。。

数据量:55Wx2 -》 超过百万的数据量

我们在hive中执行如下的命令

select count(*) from table where city=‘上海’

hive会从第一条数据开始匹配,一直匹配到最后一条,匹配了100万次。

有没有更好的解决方案?分区表(非常“傻”的方法来解决这个问题)

原来Hive仓库中的结构:

/user/hive/warehouse/

|-db_lianjia.db/

​ |- tb_lj/

​ |-2nd.csv

类似这样的结构目录对于超过百万的数据,能否应对业务需求,进行优化?

appleinbeijing.txt 55w+

appleinshanghai.txt 55w+

/user/hive/warehouse/db_phone.db/tb_phone_info/appleinbeijing.txt

/user/hive/warehouse/db_phone.db/tb_phone_info/appleinshanghai.txt

在执行where city='上海’的时候,很明显,appleinbeijing.txt这个文件内没有1条数据是符合要求的。

/user/hive/warehouse/

|-db_phone/

​ |-tb_phone_info/

​ |-dir_city=shanghai/

​ appleinshanghai.txt

​ |-dir_city=beijing/

​ appleinbeijing.txt

where dir_city=shanghai 的时候,hive只会去 |-dir_city=shanghai/目录中查找数据。

苹果手机 在 北京 的销售记录

苹果手机 在 上海 的销售记录

华为手机 在 广州 的销售记录

|-phone=apple

​ |-city=shanghai

​ |-appleinshanghai1.txt

​ |-appleinshanghai2.txt

​ |-appleinshanghai3.txt

​ |-…

​ |-city=beijing

​ |-city=guangzhou

|-phone=huawei

|-phone=xiaomi

实践出真知:

hive (default)> show databases;
OK
database_name
db_hive
db_lianjia
default
empmanager
Time taken: 6.617 seconds, Fetched: 4 row(s)
hive (default)> create database db_phone;
OK
Time taken: 0.233 seconds
hive (default)> use db_phone;
OK
Time taken: 0.072 seconds
hive (db_phone)> create table db_phone.tb_phone_info(
               > id string,
               > phone string,
               > city string,
               > age int
               > )
               > partitioned by (citystr string,phonestr string)
               > row format delimited fields terminated by ',';
OK
Time taken: 0.215 seconds
hive (db_phone)> load data local inpath '/opt/datas/appleinbeijing.txt' overwrite into table db_phone.t
b_phone_info partition(citystr='beijing',phonestr='apple');Loading data to table db_phone.tb_phone_info partition (citystr=beijing, phonestr=apple)
Partition db_phone.tb_phone_info{citystr=beijing, phonestr=apple} stats: [numFiles=1, numRows=0, totalS
ize=13767756, rawDataSize=0]OK
Time taken: 2.096 seconds
hive (db_phone)> load data local inpath '/opt/datas/appleinshanghai.txt' overwrite into table db_phone.
tb_phone_info partition(citystr='shanghai',phonestr='apple');Loading data to table db_phone.tb_phone_info partition (citystr=shanghai, phonestr=apple)
Partition db_phone.tb_phone_info{citystr=shanghai, phonestr=apple} stats: [numFiles=1, numRows=0, total
Size=13832187, rawDataSize=0]OK
Time taken: 1.258 seconds
hive (db_phone)> select * from db_phone.tb_phone_info limit 5;
OK
tb_phone_info.id	tb_phone_info.phone	tb_phone_info.city	tb_phone_info.age	tb_phon
e_info.citystr	tb_phone_info.phonestr6	APPLE	北京	35	beijing	apple
23	APPLE	北京	39	beijing	apple
28	APPLE	北京	26	beijing	apple
30	APPLE	北京	25	beijing	apple
42	APPLE	北京	42	beijing	apple
Time taken: 0.36 seconds, Fetched: 5 row(s)
hive (db_phone)> set hive.mapred.mode;
hive.mapred.mode=nonstrict
hive (db_phone)> set hive.mapred.mode=strict;
hive (db_phone)> select * from db_phone.tb_phone_info limit 5;
FAILED: SemanticException [Error 10041]: No partition predicate found for Alias "tb_phone_info" Table "
tb_phone_info"hive (db_phone)> select count(*) from db_phone.tb_phone_info where citystr='beijing' and phonestr='appl
e';Query ID = hadoop_20200727084444_708191e1-9c37-45a4-b9d2-223772ec82fa
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1595808781664_0001, Tracking URL = http://hadoop212:8088/proxy/application_159580878
1664_0001/Kill Command = /opt/modules/hadoop-2.6.0-cdh5.7.6/bin/hadoop job  -kill job_1595808781664_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-07-27 08:44:21,093 Stage-1 map = 0%,  reduce = 0%
2020-07-27 08:44:30,635 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.95 sec
2020-07-27 08:44:39,215 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.34 sec
MapReduce Total cumulative CPU time: 7 seconds 340 msec
Ended Job = job_1595808781664_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.34 sec   HDFS Read: 13774913 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 340 msec
OK
_c0
553154
Time taken: 32.698 seconds, Fetched: 1 row(s)

2.练习(Hive和SQOOP)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UeMunimd-1596276760486)(0727_随堂笔记.assets/image-20200727084544320.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9HHYaDYh-1596276760493)(0727_随堂笔记.assets/image-20200727084603907.png)]

(1)练习要求

建表核心代码

create  table tb_yhd_0000(
id string, url string, referer string, keyword string, type string, guid string, pageId string, moduleId string, linkId string, attachedInfo string, sessionId string, trackerU string, trackerType string, ip string, trackerSrc string, cookie string, orderCode string, trackTime string, endUserId string, firstLink string, sessionViewNo string, productId string, curMerchantId string, provinceId string, cityId string, fee string, edmActivity string, edmEmail string, edmJobId string, ieVersion string, platform string, internalKeyword string, resultSum string, currentPage string, linkPosition string, buttonPosition string
)
partitioned by (daystr string,hourstr string)
row format delimited fields terminated by '\t';
load data ...'../..2818' ..(dayset='20150828' ,hourstr='18')
load data ...'../..2819' ..(dayset='20150828' ,hourstr='19')

查询核心代码

pv
select count(case when length(url)=0 then null else url end) as pv ...
uv
select count(distinct guid) as uv from ...

查询结果:

datastr   hourstr pv    uv 
20150828  18      8876  2135
20150828  19      8421  3126

最后使用sqoop保存至MySQL数据库

yhd.sql -> 执行hive命令

yhd.sqoop -> 执行sqoop命令

yhd.sh -> 调用 bin/hive -f /opt/datas/yhd.sql 和 bin/sqoop --options-file /opt/datas/yhd.sqoop

(2)完成练习

  • 完成MySQL中的建库建表

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-He0IoWKy-1596276760496)(0727_随堂笔记.assets/image-20200727133502953.png)]

  • yhd.sql
create database if not exists db_yhd;
use db_yhd;

drop table if exists tb_yhd;
create table tb_yhd(
id string, url string, referer string, keyword string, type string, guid string, pageId string, moduleId string, linkId string, attachedInfo string, sessionId string, trackerU string, trackerType string, ip string, trackerSrc string, cookie string, orderCode string, trackTime string, endUserId string, firstLink string, sessionViewNo string, productId string, curMerchantId string, provinceId string, cityId string, fee string, edmActivity string, edmEmail string, edmJobId string, ieVersion string, platform string, internalKeyword string, resultSum string, currentPage string, linkPosition string, buttonPosition string)
partitioned by (daystr string,hourstr string)
row format delimited fields terminated by '\t';

load data local inpath '/opt/datas/2015082818' into table tb_yhd partition (daystr='20150828',hourstr='18');
load data local inpath '/opt/datas/2015082819' into table tb_yhd partition (daystr='20150828',hourstr='19');

set hive.exec.model.local.auto=true;

select count(case when length(url)=0 then null else url end) as pv from tb_yhd group by daystr,hourstr;
select count(distinct guid) as uv from tb_yhd group by daystr,hourstr;
drop table if exists tb_yhd_result;

create table tb_yhd_result(
daystr string,
hourstr string,
pv bigint,
uv bigint
)
row format delimited fields terminated by '\t';

insert overwrite table tb_yhd_result 
select daystr,hourstr , count(case when length(url)=0 then null else url end) as pv , count(distinct guid) as uv from tb_yhd group by daystr,hourstr;
  • yhd.sqoop
export
--connect
jdbc:mysql://hadoop212:3306/db_yhd
--username
root
--password
Mysql_1234
--table
tb_yhd_result
--export-dir
'/user/hive/warehouse/db_yhd.db/tb_yhd_result'
--input-fields-terminated-by
'\t'
-m
1
  • yhd.sh
#! /bin/bash

# HIVE_HOME & SQOOP_HOME
HIVE_HOME=/opt/modules/hive-1.1.0-cdh5.7.6
SQOOP_HOME=/opt/modules/sqoop-1.4.6-cdh5.7.6

# run!
$HIVE_HOME/bin/hive -f /opt/datas/yhd.sql
$SQOOP_HOME/bin/sqoop --options-file /opt/datas/yhd.sqoop

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-p9KHp907-1596276760501)(0727_随堂笔记.assets/image-20200727142029315.png)]

二、flume日志采集系统

1.flume介绍

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-B4I45SFU-1596276760503)(0727_随堂笔记.assets/image-20200727143823544.png)]

到今天为止,我们发现,日志文件数据,也是非常重要的!

日志文件包括很多种,系统运行日志(hive.log),用户行为日志(2018091718),等,都成为日志数据。

每天企业的运转会在各个框架,集群,业务点,各个维度产生非常多的日志文件。那么对于重要的日志文件,我们需要一个工具来将整个生成运营过程中的不同地方的日志文件统一采集处理存放在指定地点!

flume就是一个使用Java编写的日志采集框架软件。

在这里插入图片描述

2.让flume跑起来

  • 解压
[hadoop@hadoop212 tools]$ tar -zxf flume-ng-1.6.0-cdh5.7.6.tar.gz -C /opt/modules/
[hadoop@hadoop212 modules]$ mv apache-flume-1.6.0-cdh5.7.6-bin/ flume-1.6.0-cdh5.7.6-bin/

  • 配置文件
[hadoop@hadoop212 conf]$ mv flume-env.sh.template flume-env.sh
[hadoop@hadoop212 conf]$ vim flume-env.sh 

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MOeYObVc-1596276760508)(0727_随堂笔记.assets/image-20200727145948503.png)]

拷贝hdfs的配置文件到conf目录,flume会自动读取!

[hadoop@hadoop212 conf]$ cp /opt/modules/hadoop-2.6.0-cdh5.7.6/etc/hadoop/core-site.xml /opt/modules/hadoop-2.6.0-cdh5.7.6/etc/hadoop/hdfs-site.xml ./

如果要操作hdfs,需要hdfs相关的jar包,简单的方法是将hadoop下的所有jar包复制过来,我这里将需要的jar包挑出来了,只需要复制这一些即可。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hh6AfK2n-1596276760509)(0727_随堂笔记.assets/image-20200727150501698.png)]

如果要找jar包,maven仓库里都有

  • 第一个flume程序
[hadoop@hadoop212 conf]$ vim a-hive-men-log.properties

内容:,注意中文不要输入:

# 声明3个组件(给3个组件起个名字)
# 任务名字叫a1,这个任务中有3个组件,source起名s1,channel起名c1,sink起名k1
a1.sources = s1
a1.channels = c1
a1.sinks = k1

# 现在要开始定义这3个组件
# exec表示执行一条命令
a1.sources.s1.type = exec
a1.sources.s1.command = tail -f /opt/modules/hive-1.1.0-cdh5.7.6/logs/hive.log
# 使用内存作为数据的中转站,capacity定义的是能够存储多少条数据
# transactionCapacity定义每次可以从我这里拿走多少数据`
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 让输出直接打印到控制台
a1.sinks.k1.type = logger

# 绑定上面3个组件
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

properties文件,key = value 对的形式的文件。注意 等号两边的空格。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BwpFj2J8-1596276760510)(0727_随堂笔记.assets/image-20200727152113559.png)]

执行agent:

[hadoop@hadoop212 flume-1.6.0-cdh5.7.6-bin]$ bin/flume-ng agent --conf /opt/modules/flume-1.6.0-cdh5.7.6-bin/conf/ --conf-file ./conf/a-hive-men-log.properties --name a1 -Dflume.root.logger=INFO,console
![ ](https://img-blog.csdnimg.cn/20200801181539574.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lyb25tYW4zMjE=,size_16,color_FFFFFF,t_70)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6NaurGAD-1596276760512)(0727_随堂笔记.assets/image-20200727152705587.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RFWaMntn-1596276760513)(0727_随堂笔记.assets/image-20200727153120308.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-P0U3rsDj-1596276760514)(0727_随堂笔记.assets/image-20200727161621290.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5bGTGUfF-1596276760515)(0727_随堂笔记.assets/image-20200727161827204.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4xgQXtFk-1596276760516)(0727_随堂笔记.assets/image-20200727161912677.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1dtLXQHS-1596276760517)(0727_随堂笔记.assets/image-20200727161945668.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PIN9rtsq-1596276760517)(0727_随堂笔记.assets/image-20200727162026069.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ASG01Zqf-1596276760518)(0727_随堂笔记.assets/image-20200727162238392.png)]

3.第2个案例

  • b-hive-file-log.properties
a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = exec
a1.sources.s1.command = tail -f /opt/modules/hive-1.1.0-cdh5.7.6/logs/hive.log

a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/datas/checkpoint

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

执行

[hadoop@hadoop212 flume-1.6.0-cdh5.7.6-bin]$ bin/flume-ng agent --conf /opt/modules/flume-1.6.0-cdh5.7.6-bin/conf/ --conf-file case/b-hive-file-log.properties --name a1 -Dflume.root.logger=INFO,console

这个案例,我需要学会 ,自己从官网查看需要哪些配置!

  • 检查点的作用

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pT9tUwiD-1596276760519)(0727_随堂笔记.assets/image-20200727163924952.png)]

4.第3个案例

  • c-spool-mem-hdfs.properties
a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /opt/datas/flume/spool

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.type = hdfs
# 文件夹必须事先存在, 没有调用mr程序
a1.sinks.k1.hdfs.path = /flume/spool


a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1 

执行

友情提醒,写配置要仔细,因为运行不成功没有抛出异常信息。

[hadoop@hadoop212 flume-1.6.0-cdh5.7.6-bin]$ bin/flume-ng agent --conf /opt/modules/flume-1.6.0-cdh5.7.6-bin/conf/ --conf-file case/c-spool-mem-hdfs.properties --name a1 -Dflume.root.logger=INFO,console

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6VwKw9RL-1596276760521)(0727_随堂笔记.assets/image-20200727164910144.png)]

flume会将已经上传的数据文件名字修改为添加后缀名.COMPLETED

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QEMfOxJU-1596276760522)(0727_随堂笔记.assets/image-20200727165731536.png)]

相对的,flume进程没有停,那么集群上的文件又后缀名.tmp

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XfexNbeo-1596276760524)(0727_随堂笔记.assets/image-20200727165755430.png)]

等到进程停止后,自动会将集群上的文件后缀名.tmp去掉

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QhAkrpKT-1596276760525)(0727_随堂笔记.assets/image-20200727165856720.png)]

5.扩展案例

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7yIDbBeI-1596276760527)(0727_随堂笔记.assets/image-20200727170711193.png)]

6.官网案例

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wXB8T28H-1596276760528)(0727_随堂笔记.assets/image-20200727171345946.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-M5HU9hLo-1596276760528)(0727_随堂笔记.assets/image-20200727171739260.png)]

补充作业

每个同学 , 在纸上写一下 学习感想和老师评价,写好你的班级和名字,然后拍照私发给我!谢谢。

尽量 今天完成, 最晚30日之前。感想和评价写一起,一张纸,字数不限,个人建议50字够了。

稍微认真写,最终要给学校的。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值