day 05 & day06 --- 日志接收处理，离线数据分析，dbvis的安装-CSDN博客

本文链接：https://blog.csdn.net/qq_45736735/article/details/123222442

日志接收处理

在这里插入图片描述

2.日志数据的提交方式

Get请求：https://www.jd.com/?cu=true&utm_source=baidu-pinzhuan&utm_medium=cpc&utm_campaign=t_288551095_baidupinzhuan&utm_term=0f3d30c8dba7459bb52f2eb5eba8ac7d_0_bd79f916377147b6aef8164d97d9abac

3.在哪里发起Get请求？

跨域请求

img标签中src实现跨域访问，将该图片插入到真个页面最后，像素1，边框0，用户看不到。

4.系统架构设计

在这里插入图片描述

flume 收集数据
此收集方案每条数据之间会有空行

a1.sources  =  r1
a1.sinks  =  k1
a1.channels  =  c1

a1.sources.r1.type  =  avro
a1.sources.r1.bind  =  0.0.0.0
a1.sources.r1.port  =  22222

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/jtlog/
a1.sinks.k1.hdfs.fileType = DataStream

a1.channels.c1.type  =  memory
a1.channels.c1.capacity  =  1000
a1.channels.c1.transactionCapacity  =  100
 
a1.sources.r1.channels  =  c1
a1.sinks.k1.channel  =  c1

在这里插入图片描述

解决空行问题

a1.sources  =  r1
a1.sinks  =  k1
a1.channels  =  c1

a1.sources.r1.type  =  avro
a1.sources.r1.bind  =  0.0.0.0
a1.sources.r1.port  =  22222


a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop01:8020/flux/reportTime=%Y-%m-%d
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.serializer = text
a1.sinks.k1.serializer.appendNewline = false #是否添加新的一行
a1.sinks.k1.hdfs.useLocalTimeStamp = true

a1.channels.c1.type  =  memory
a1.channels.c1.capacity  =  1000
a1.channels.c1.transactionCapacity  =  100

a1.sources.r1.channels  =  c1
a1.sinks.k1.channel  =  c1

bin/flume-ng agent -c conf/ -f conf/jt2_avor_hdfs.conf -n a1 -Dflume.root.logger=INFO,console

在这里插入图片描述

离线数据分析

安装dbvis

dbvis
下载后安装dbvis

3. 将hive的目录放在已安装好的dbvis的jdbc的目录下。

在这里插入图片描述

4. hive-site.xml中增加属性

/opt/servers/hive-2.3.6/conf

<!-- 这是hiveserver2 -->
	<property>
       		 <name>hive.server2.thrift.port</name>
     		 <value>10000</value>
	</property>

    <property>
       		<name>hive.server2.thrift.bind.host</name>
       		<value>192.168.64.101</value>
    </property>

在这里插入图片描述

5. hadoop的core-site.xml中增加属性

停掉所有hadoop 的服务
stop-all.sh

cd /opt/servers/hadoop-2.7.7/etc/hadoop/
vim core-site.xml

<property>     
	<name>hadoop.proxyuser.root.hosts</name>     
	<value>*</value>
 </property> 
<property>     
	<name>hadoop.proxyuser.root.groups</name>    
    <value>*</value> 
</property>

分发到02，03节点

scp core-site.xml hadoop02:$PWD
scp core-site.xml hadoop03:$PWD

启动所有服务

在这里插入图片描述

6. 启动hive的服务hiveserver2

cd /opt/servers/hive-2.3.6/
bin/hiveserver2
#前端启动

7. 测试连接

另起一个连接窗口

cd /opt/servers/hive-2.3.6/
bin/beeline

在这里插入图片描述

8. dbvis中创建hive连接即可

添加hive的驱动 tools – Driver Manager
在这里插入图片描述

在这里插入图片描述

create external table flux (url string,urlname string,title string,chset string,src string,col string,lg string, je string,ec string,fv string,cn string,ref string,uagent string,stat_uv string,stat_ss string,cip string) partitioned by (reportTime string) row format delimited fields terminated by '|' location '/flux';

在这里插入图片描述

添加分区信息

alter table flux add partition (reportTime='2022-03-02') location '/flux/reportTime=2022-03-02';

数据清洗数据去噪声

明细宽表：将原始表按照业务需求拆分成更细粒度的表。

需要的数据字段

reportTime 产生日期
url 访问路径
urlname 页面名称
uvid 访客id
ssid 会话id
sscount 会话编号
sstime 会话时间戳
cip 访客ip
创建数据清洗表：

create table dataclear(reportTime string,url string,urlname string,uvid string,ssid string,sscount string,sstime string,cip string) row format delimited fields terminated by '|';

需要注意的是，在hive中将一个表内的数据导入另一个表中时，两个表的创建结构必须相同，包括分隔符！否则可能会发生数据错乱。

清洗并导入数据：

insert overwrite table dataclear 
select 
reportTime,url,urlname,stat_uv,split(stat_ss,"_")[0],
split(stat_ss,"_")[1],split(stat_ss,"_")[2],cip 
from flux where url!='';

split(拆分哪个字段，按什么拆分)[取第几个元素
在这里插入图片描述

select *from dataclear;

在这里插入图片描述

数据处理

PV：访问量

select count(*) as pv 
from dataclear 
where reportTime='2022-03-02';

在这里插入图片描述

UV：独立访客数

select count(distinct uvid) as uv 
from dataclear 
where reportTime='2022-03-02';

SV：独立会话数

select count(distinct ssid) as sv from dataclear where reportTime='2022-11-19';

session即会话，浏览器用cookie存储sessionid所以不同的cookie就代表不同的会话，其中我们使用了两个浏览器，清除了两次cookie，来模拟不同的会话。

BR：跳出率

跳出率就是，只访问了一个页面就走了的会话/会话总数。

为了控制结果的精确度，我们应用round函数来对结果进行处理，取小数点后四位（四舍五入）

设置

## 改成非严格模式
set hive.mapred.mode=nonstrict;

select br_taba.a/br_tabb.b as br from
(
  select count(*) as a from
    (
      select ssid from dataclear
       where reportTime='2022-03-02'
       group by ssid having count(ssid)=1
    ) as br_tab
) as br_taba,
(
  select count(distinct ssid) as b from dataclear
   where reportTime='2022-03-02'
) as br_tabb;

2/5

select round(br_once/vv,3) as br
from
(select
count(1) br_once 
from
(select
ssid
from dataclear
where reportTime='2022-03-02'
group by ssid having count(ssid)=1) as onceTime
)as br_onceTime,

(select
count(distinct ssid) as vv
from dataclear
where reportTime='2022-03-02') as br_total;

NewIP：新增IP数

新增ip数就是当天来访的所有ip中之前从来没有访问过的ip数量。

比如：我们的系统昨天上线，昨天访客有：张飞、关羽、赵云、吕布

今天的访客有：张飞、关羽、貂蝉、孙尚香、吕布。那么新增访客就是貂蝉、孙尚香，对应的新增ip数就是2。
在这里插入图片描述
not in 查询效率较慢

--today
select
distinct cip
from dataclear
where reportTime='2022-03-02' and cip not in(
--history ip
select
distinct cip
from dataclear
where reportTime<'2022-03-02');

左连接查询

--leftjoin  new history
select today_ip.cip
from
(select
distinct cip
from dataclear
where reportTime='2022-03-02')as today_ip
left join
(
select
distinct cip
from dataclear
where reportTime<'2022-03-02'
)as history_ip 
on today_ip.cip=history_ip.cip 
and history_ip.cip is null;

在这里插入图片描述

NewCust：新增访客数

原理与NewIP一样。只不过指标变为uvid

--new uvid
select
count(today_uvid.uvid) as newuvid
from
(select
distinct uvid
from dataclear
where reportTime='2022-03-02') as today_uvid
left join
(
select
distinct uvid
from dataclear
where reportTime<'2022-03-02'
)as history_uvid 
on today_uvid.uvid=history_uvid.uvid
and history_uvid.uvid is null;

AvgTime：平均访问时长

平均访问时长指的是所有会话的时长的平均数。

select
round(avg(sessionTime),4) as avgtime
from
(select 
max(sstime)-min(sstime) as sessionTime
from dataclear
where reportTime='2022-03-02'
group by ssid 
)as t_sessionTime

AvgDeep：平均访问深度

访问深度，指一个会话中浏览的页面个数。

--avgdeep
select
round(avg(deep),4) as avgDeep
from
(
select
count(1) as deep
from dataclear 
where reportTime='2022-03-02'
group by ssid
)as t_deep;

分析结果表

创建业务表并插入数据

create table statistics
(reportTime string,pv int,uv int,vv int, 
br double,newip int, newcust int, avgtime double,avgdeep double) 
row format delimited fields terminated by '|';

insert overwrite table statistics select '2022-03-02',tab1.pv,tab2.uv,tab3.vv,tab4.br,tab5.newip,tab6.newcust,tab7.avgtime,tab8.avgdeep from
 
(select count(*) as pv from dataclear where reportTime = '2022-03-02') as tab1,
 
(select count(distinct uvid) as uv from dataclear where reportTime = '2022-03-02') as tab2,
 
(select count(distinct ssid) as vv from dataclear where reportTime = '2022-03-02') as tab3,
 
(select round(br_taba.a/br_tabb.b,4)as br from (select count(*) as a from (select ssid from dataclear where reportTime='2022-03-02' group by ssid
 
having count(ssid) = 1) as br_tab) as br_taba,
 
(select count(distinct ssid) as b from dataclear where reportTime='2022-03-02') as br_tabb) as tab4,
 
(select count(distinct dataclear.cip) as newip from dataclear where dataclear.reportTime = '2022-03-02' and cip not in (select dc2.cip from dataclear
 
as dc2 where dc2.reportTime < '2022-03-02')) as tab5,
 
(select count(distinct dataclear.uvid) as newcust from dataclear where dataclear.reportTime='2022-03-02' and uvid not in (select dc2.uvid from
 
dataclear as dc2 where dc2.reportTime < '2022-03-02')) as tab6,
 
(select round(avg(atTab.usetime),4) as avgtime from (select max(sstime) - min(sstime) as usetime from dataclear where reportTime='2022-03-02'
 
group by ssid) as atTab) as tab7,
 
(select round(avg(deep),4) as avgdeep from (select count(distinct urlname) as deep from dataclear where reportTime='2022-03-02' group by ssid) as
 
adTab) as tab8;

通过sqoop将数据导入mysql

https://archive.apache.org/dist/sqoop/ sqoop 版本下载 1.4 较稳定

概念

sqoop 沟通hdfs和关系型数据库的桥梁，可以从hdfs导出数据
到关系型数据库，也可以从关系型数据库导入数据到hdfs

下载安装

下载
Apache 提供的工具
安装
要求必须有jdk 和 hadoop的支持，并且有版本要求。

上传解压

上传到linux中，进行解压
在这里插入图片描述

sqoop可以通过JAVA_HOME找到jdk 可以通过HADOOP_HOME找到hadoop所以不需要做任何配置就可以工作。

添加jdbc驱动

需要将要连接的数据库的驱动包加入sqoop的lib目录下
在这里插入图片描述

导入导出

只导出即可
导入
从关系型数据库导入到hdfs

bin/sqoop import --connect jdbc:mysql://192.168.64.101:3306/jtlog --username root --password root --table jtdata -m 1 --target-dir '/sqoop/jtlog' --fields-terminated-by '|';

导出
从hdfs导出到关系型数据库

bin/sqoop export --connect jdbc:mysql://hadoop01:3306/jtlog --username root --password root --export-dir '/user/hive/warehouse/jtlogdb.db/statistics' --table jtdata -m 1 --fields-terminated-by '|';

--fields-terminated-by '|' ：指明每个字段的分隔符 |

在这里插入图片描述
导出到数据库