Day15[20200727]

最新推荐文章于 2024-09-18 10:21:19 发布

lies@

最新推荐文章于 2024-09-18 10:21:19 发布

阅读量250

点赞数

分类专栏：大数据文章标签：大数据

本文链接：https://blog.csdn.net/ironman321/article/details/107734686

版权

大数据专栏收录该内容

17 篇文章 1 订阅

订阅专栏

一、Hive练习

1.分区表

我们有这样的一组数据：

6,APPLE,北京,35

2,APPLE,上海,47

数据脱敏，将原来的真实数据中敏感的信息进行模糊化，或者部分数据直接删除。

id-》uuid-》变成了自动增长

手机型号-》具体到什么品牌什么型号-》只保留了手机品牌

城市-》国-县-市-区（数字表示）-》只保留城市

年龄-》身份证号码-》经过脱敏后，转换为年龄

继续脱敏，删除部分数据，去除字段。。。。。

数据量：55Wx2 -》超过百万的数据量

我们在hive中执行如下的命令

select count(*) from table where city=‘上海’

hive会从第一条数据开始匹配，一直匹配到最后一条，匹配了100万次。

有没有更好的解决方案？分区表（非常“傻”的方法来解决这个问题）

原来Hive仓库中的结构：

/user/hive/warehouse/

|-db_lianjia.db/

|- tb_lj/

|-2nd.csv

类似这样的结构目录对于超过百万的数据，能否应对业务需求，进行优化？

appleinbeijing.txt 55w+

appleinshanghai.txt 55w+

/user/hive/warehouse/db_phone.db/tb_phone_info/appleinbeijing.txt

/user/hive/warehouse/db_phone.db/tb_phone_info/appleinshanghai.txt

在执行where city='上海’的时候，很明显，appleinbeijing.txt这个文件内没有1条数据是符合要求的。

/user/hive/warehouse/

|-db_phone/

|-tb_phone_info/

|-dir_city=shanghai/

appleinshanghai.txt

|-dir_city=beijing/

appleinbeijing.txt

where dir_city=shanghai 的时候，hive只会去 |-dir_city=shanghai/目录中查找数据。

苹果手机在北京的销售记录

苹果手机在上海的销售记录

华为手机在广州的销售记录

|-phone=apple

|-city=shanghai

|-appleinshanghai1.txt

|-appleinshanghai2.txt

|-appleinshanghai3.txt

|-…

|-city=beijing

|-city=guangzhou

|-phone=huawei

|-phone=xiaomi

…

实践出真知：

hive (default)> show databases;
OK
database_name
db_hive
db_lianjia
default
empmanager
Time taken: 6.617 seconds, Fetched: 4 row(s)
hive (default)> create database db_phone;
OK
Time taken: 0.233 seconds
hive (default)> use db_phone;
OK
Time taken: 0.072 seconds
hive (db_phone)> create table db_phone.tb_phone_info(
               > id string,
               > phone string,
               > city string,
               > age int
               > )
               > partitioned by (citystr string,phonestr string)
               > row format delimited fields terminated by ',';
OK
Time taken: 0.215 seconds
hive (db_phone)> load data local inpath '/opt/datas/appleinbeijing.txt' overwrite into table db_phone.t
b_phone_info partition(citystr='beijing',phonestr='apple');Loading data to table db_phone.tb_phone_info partition (citystr=beijing, phonestr=apple)
Partition db_phone.tb_phone_info{citystr=beijing, phonestr=apple} stats: [numFiles=1, numRows=0, totalS
ize=13767756, rawDataSize=0]OK
Time taken: 2.096 seconds
hive (db_phone)> load data local inpath '/opt/datas/appleinshanghai.txt' overwrite into table db_phone.
tb_phone_info partition(citystr='shanghai',phonestr='apple');Loading data to table db_phone.tb_phone_info partition (citystr=shanghai, phonestr=apple)
Partition db_phone.tb_phone_info{citystr=shanghai, phonestr=apple} stats: [numFiles=1, numRows=0, total
Size=13832187, rawDataSize=0]OK
Time taken: 1.258 seconds
hive (db_phone)> select * from db_phone.tb_phone_info limit 5;
OK
tb_phone_info.id	tb_phone_info.phone	tb_phone_info.city	tb_phone_info.age	tb_phon
e_info.citystr	tb_phone_info.phonestr6	APPLE	北京	35	beijing	apple
23	APPLE	北京	39	beijing	apple
28	APPLE	北京	26	beijing	apple
30	APPLE	北京	25	beijing	apple
42	APPLE	北京	42	beijing	apple
Time taken: 0.36 seconds, Fetched: 5 row(s)
hive (db_phone)> set hive.mapred.mode;
hive.mapred.mode=nonstrict
hive (db_phone)> set hive.mapred.mode=strict;
hive (db_phone)> select * from db_phone.tb_phone_info limit 5;
FAILED: SemanticException [Error 10041]: No partition predicate found for Alias "tb_phone_info" Table "
tb_phone_info"hive (db_phone)> select count(*) from db_phone.tb_phone_info where citystr='beijing' and phonestr='appl
e';Query ID = hadoop_20200727084444_708191e1-9c37-45a4-b9d2-223772ec82fa
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1595808781664_0001, Tracking URL = http://hadoop212:8088/proxy/application_159580878
1664_0001/Kill Command = /opt/modules/hadoop-2.6.0-cdh5.7.6/bin/hadoop job  -kill job_1595808781664_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-07-27 08:44:21,093 Stage-1 map = 0%,  reduce = 0%
2020-07-27 08:44:30,635 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.95 sec
2020-07-27 08:44:39,215 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.34 sec
MapReduce Total cumulative CPU time: 7 seconds 340 msec
Ended Job = job_1595808781664_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.34 sec   HDFS Read: 13774913 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 340 msec
OK
_c0
553154
Time taken: 32.698 seconds, Fetched: 1 row(s)

2.练习（Hive和SQOOP）

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UeMunimd-1596276760486)(0727_随堂笔记.assets/image-20200727084544320.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9HHYaDYh-1596276760493)(0727_随堂笔记.assets/image-20200727084603907.png)]

（1）练习要求

建表核心代码

create  table tb_yhd_0000(
id string, url string, referer string, keyword string, type string, guid string, pageId string, moduleId string, linkId string, attachedInfo string, sessionId string, trackerU string, trackerType string, ip string, trackerSrc string, cookie string, orderCode string, trackTime string, endUserId string, firstLink string, sessionViewNo string, productId string, curMerchantId string, provinceId string, cityId string, fee string, edmActivity string, edmEmail string, edmJobId string, ieVersion string, platform string, internalKeyword string, resultSum string, currentPage string, linkPosition string, buttonPosition string
)
partitioned by (daystr string,hourstr string)
row format delimited fields terminated by '\t';
load data ...'../..2818' ..(dayset='20150828' ,hourstr='18')
load data ...'../..2819' ..(dayset='20150828' ,hourstr='19')

查询核心代码

pv
select count(case when length(url)=0 then null else url end) as pv ...
uv
select count(distinct guid) as uv from ...

查询结果：

datastr   hourstr pv    uv 
20150828  18      8876  2135
20150828  19      8421  3126

最后使用sqoop保存至MySQL数据库

yhd.sql -> 执行hive命令

yhd.sqoop -> 执行sqoop命令

yhd.sh -> 调用 bin/hive -f /opt/datas/yhd.sql 和 bin/sqoop --options-file /opt/datas/yhd.sqoop

（2）完成练习

完成MySQL中的建库建表

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-He0IoWKy-1596276760496)(0727_随堂笔记.assets/image-20200727133502953.png)]

yhd.sql

create database if not exists db_yhd;
use db_yhd;

drop table if exists tb_yhd;
create table tb_yhd(
id string, url string, referer string, keyword string, type string, guid string, pageId string, moduleId string, linkId string, attachedInfo string, sessionId string, trackerU string, trackerType string, ip string, trackerSrc string, cookie string, orderCode string, trackTime string, endUserId string, firstLink string, sessionViewNo string, productId string, curMerchantId string, provinceId string, cityId string, fee string, edmActivity string, edmEmail string, edmJobId string, ieVersion string, platform string, internalKeyword string, resultSum string, currentPage string, linkPosition string, buttonPosition string)
partitioned by (daystr string,hourstr string)
row format delimited fields terminated by '\t';

load data local inpath '/opt/datas/2015082818' into table tb_yhd partition (daystr='20150828',hourstr='18');
load data local inpath '/opt/datas/2015082819' into table tb_yhd partition (daystr='20150828',hourstr='19');

set hive.exec.model.local.auto=true;

select count(case when length(url)=0 then null else url end) as pv from tb_yhd group by daystr,hourstr;
select count(distinct guid) as uv from tb_yhd group by daystr,hourstr;
drop table if exists tb_yhd_result;

create table tb_yhd_result(
daystr string,
hourstr string,
pv bigint,
uv bigint
)
row format delimited fields terminated by '\t';

insert overwrite table tb_yhd_result 
select daystr,hourstr , count(case when length(url)=0 then null else url end) as pv , count(distinct guid) as uv from tb_yhd group by daystr,hourstr;

yhd.sqoop

export
--connect
jdbc:mysql://hadoop212:3306/db_yhd
--username
root
--password
Mysql_1234
--table
tb_yhd_result
--export-dir
'/user/hive/warehouse/db_yhd.db/tb_yhd_result'
--input-fields-terminated-by
'\t'
-m
1

yhd.sh

#! /bin/bash

# HIVE_HOME & SQOOP_HOME
HIVE_HOME=/opt/modules/hive-1.1.0-cdh5.7.6
SQOOP_HOME=/opt/modules/sqoop-1.4.6-cdh5.7.6

# run!
$HIVE_HOME/bin/hive -f /opt/datas/yhd.sql
$SQOOP_HOME/bin/sqoop --options-file /opt/datas/yhd.sqoop

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-p9KHp907-1596276760501)(0727_随堂笔记.assets/image-20200727142029315.png)]

二、flume日志采集系统

1.flume介绍

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-B4I45SFU-1596276760503)(0727_随堂笔记.assets/image-20200727143823544.png)]

到今天为止，我们发现，日志文件数据，也是非常重要的！

日志文件包括很多种，系统运行日志（hive.log），用户行为日志（2018091718），等，都成为日志数据。

每天企业的运转会在各个框架，集群，业务点，各个维度产生非常多的日志文件。那么对于重要的日志文件，我们需要一个工具来将整个生成运营过程中的不同地方的日志文件统一采集处理存放在指定地点！

flume就是一个使用Java编写的日志采集框架软件。

在这里插入图片描述

2.让flume跑起来

解压

[hadoop@hadoop212 tools]$ tar -zxf flume-ng-1.6.0-cdh5.7.6.tar.gz -C /opt/modules/
[hadoop@hadoop212 modules]$ mv apache-flume-1.6.0-cdh5.7.6-bin/ flume-1.6.0-cdh5.7.6-bin/

配置文件

[hadoop@hadoop212 conf]$ mv flume-env.sh.template flume-env.sh
[hadoop@hadoop212 conf]$ vim flume-env.sh

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MOeYObVc-1596276760508)(0727_随堂笔记.assets/image-20200727145948503.png)]

拷贝hdfs的配置文件到conf目录，flume会自动读取！

[hadoop@hadoop212 conf]$ cp /opt/modules/hadoop-2.6.0-cdh5.7.6/etc/hadoop/core-site.xml /opt/modules/hadoop-2.6.0-cdh5.7.6/etc/hadoop/hdfs-site.xml ./

如果要操作hdfs，需要hdfs相关的jar包，简单的方法是将hadoop下的所有jar包复制过来，我这里将需要的jar包挑出来了，只需要复制这一些即可。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hh6AfK2n-1596276760509)(0727_随堂笔记.assets/image-20200727150501698.png)]

如果要找jar包，maven仓库里都有

第一个flume程序

[hadoop@hadoop212 conf]$ vim a-hive-men-log.properties

内容：，注意中文不要输入：

# 声明3个组件（给3个组件起个名字）
# 任务名字叫a1，这个任务中有3个组件，source起名s1，channel起名c1，sink起名k1
a1.sources = s1
a1.channels = c1
a1.sinks = k1

# 现在要开始定义这3个组件
# exec表示执行一条命令
a1.sources.s1.type = exec
a1.sources.s1.command = tail -f /opt/modules/hive-1.1.0-cdh5.7.6/logs/hive.log
# 使用内存作为数据的中转站，capacity定义的是能够存储多少条数据
# transactionCapacity定义每次可以从我这里拿走多少数据`
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 让输出直接打印到控制台
a1.sinks.k1.type = logger

# 绑定上面3个组件
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

properties文件，key = value 对的形式的文件。注意等号两边的空格。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BwpFj2J8-1596276760510)(0727_随堂笔记.assets/image-20200727152113559.png)]

执行agent：

[hadoop@hadoop212 flume-1.6.0-cdh5.7.6-bin]$ bin/flume-ng agent --conf /opt/modules/flume-1.6.0-cdh5.7.6-bin/conf/ --conf-file ./conf/a-hive-men-log.properties --name a1 -Dflume.root.logger=INFO,console

![ ](https://img-blog.csdnimg.cn/20200801181539574.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lyb25tYW4zMjE=,size_16,color_FFFFFF,t_70)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6NaurGAD-1596276760512)(0727_随堂笔记.assets/image-20200727152705587.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RFWaMntn-1596276760513)(0727_随堂笔记.assets/image-20200727153120308.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-P0U3rsDj-1596276760514)(0727_随堂笔记.assets/image-20200727161621290.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5bGTGUfF-1596276760515)(0727_随堂笔记.assets/image-20200727161827204.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4xgQXtFk-1596276760516)(0727_随堂笔记.assets/image-20200727161912677.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1dtLXQHS-1596276760517)(0727_随堂笔记.assets/image-20200727161945668.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PIN9rtsq-1596276760517)(0727_随堂笔记.assets/image-20200727162026069.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ASG01Zqf-1596276760518)(0727_随堂笔记.assets/image-20200727162238392.png)]

3.第2个案例

b-hive-file-log.properties

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = exec
a1.sources.s1.command = tail -f /opt/modules/hive-1.1.0-cdh5.7.6/logs/hive.log

a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/datas/checkpoint

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

执行

[hadoop@hadoop212 flume-1.6.0-cdh5.7.6-bin]$ bin/flume-ng agent --conf /opt/modules/flume-1.6.0-cdh5.7.6-bin/conf/ --conf-file case/b-hive-file-log.properties --name a1 -Dflume.root.logger=INFO,console

这个案例，我需要学会，自己从官网查看需要哪些配置！

检查点的作用

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pT9tUwiD-1596276760519)(0727_随堂笔记.assets/image-20200727163924952.png)]

4.第3个案例

c-spool-mem-hdfs.properties

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /opt/datas/flume/spool

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.type = hdfs
# 文件夹必须事先存在， 没有调用mr程序
a1.sinks.k1.hdfs.path = /flume/spool


a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

执行

友情提醒，写配置要仔细，因为运行不成功没有抛出异常信息。

[hadoop@hadoop212 flume-1.6.0-cdh5.7.6-bin]$ bin/flume-ng agent --conf /opt/modules/flume-1.6.0-cdh5.7.6-bin/conf/ --conf-file case/c-spool-mem-hdfs.properties --name a1 -Dflume.root.logger=INFO,console

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6VwKw9RL-1596276760521)(0727_随堂笔记.assets/image-20200727164910144.png)]