Flume

最新推荐文章于 2024-07-23 08:03:31 发布

谁说大象不能跳舞

最新推荐文章于 2024-07-23 08:03:31 发布

阅读量289

点赞数

分类专栏： Flume

本文链接：https://blog.csdn.net/jiahonhyu0609/article/details/89114621

版权

Flume 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

在这里插入图片描述

常用分析指标:
.1. 常规数据指标的监测：用户量，新用户量，UGC（社交产
品），销量，付费量，推广期间各种数据等。
2. 渠道分析/流量分析：分析/监控引流渠道优劣
3. 用户的核心转化率：统计付费率，购买率
4. 用户使用时长的监测：用户活跃度，产品验证
5. 用户流失情况：监控用户的流失率（1，3，7，30）
6. 活跃用户动态：关注活跃用户动态
7. 用户特征描述：算法建模上，和产品上使用
8. 用户生命周期的监测：在建模上需要考虑 c user b 商家

具体收集什么数据:
1.用户浏览目标网页的行为
打开某网页、点击某按钮、购买商品、将商品加购物车等
2.附加内容数据：
下单行为产生订单金额等

以上收集策略能满足基本的流量分析、来源分析、内容分析及访客属性等常用分析视角。

可定制数据收集
Google分析、百度统计、搜狗分析
根据定义好的可扩展API（接口），只需编写少量的JavaScript代码就可以实现自定义事件和自定义指标的跟踪和分析。
这里JavaScript就是埋点

在这里插入图片描述

在这里插入图片描述
PV量，UV量

在这里插入图片描述

channel:
采用被动存储的形式，可以链接任意数量的source和sink.

flume可靠性：
1：flume保证单次跳转的可靠的方式，传送完成后，该事件才会从通道中移除。
2：如果网络中断或者其他原因，这个数据会在下一次重新传输。
3：Flume可靠性还体现在数据可暂停上，当目标不可访问后，数据会暂存在channel中，等目标可访问后，再进行传输。

cookie -> uuid(userid) uv
cookie+uuid

cookie
uuid
clickid
pageid
…
登入之后的uuid对cookie做一个回补
这样在统计uv的时候不会既统计一次cookie又统计一次uuid。

爬虫怎么识别？
1.爬网页，我们会给他们提一个规则（加一个标识的字段）
2.离线/在线统计，有大量浏览没有点击的用户（设定一个阈值），浏览的一个频率（阈值）

pv量，uv量
3000w pv uv:3000w/16

一亿条数据，总共大小1650M
10000w 3.3*500m=1650m

【app】host1->处理1
【pc】host2->处理2

IMEI 能唯一标识设备（间接标识用户）

实践部分：
1.master终端起一个flume服务
在/usr/local/src/apache-flume-1.6.0-bin/conf/header_test.conf里：

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = org.apache.flume.source.http.HTTPSource
a1.sources.r1.bind = localhost
a1.sources.r1.port = 9000
#a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

然后启动：

./bin/flume-ng agent --conf conf --conf-file ./conf/header_test.conf --name a1 -Dflume.root.logger=INFO,console

在另一个master终端启动：
往localhost:9000 flume发送数据。

curl -X POST -d '[{"headers":{"timestampe":"1234567","host":"master"},"body":"badou flume"}]' localhost:9000

在上面那个master中可以看见我们发送的数据。event如下：
Event: { headers:{timestampe=1234567, host=master} body: 62 61 64 6F 75 20 66 6C 75 6D 65 badou flume }

FileChannel：把数据写到磁盘，性能相对差一些，能把数据持久化。
Memory Channel:把数据写到内存，只要agent出问题，数据就会丢失。
保证数据不会丢失（WAL实现）
WAL：write ahead logging 预写
写执行操作，再写数据，当数据写失败，再执行一遍操作

2.master
netcat方式把数据发进flume里面，绑定端口44444，主机地址localhost
启动：./bin/flume-ng agent --conf conf --conf-file ./conf/example.conf --name a1 -Dflume.root.logger=INFO,console

a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
example.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

a1.sources.r1.interceptors = i1  
a1.sources.r1.interceptors.i1.type =regex_filter  
a1.sources.r1.interceptors.i1.regex =^[0-9]*$  
a1.sources.r1.interceptors.i1.excludeEvents =true

# Describe the sink
#a1.sinks.k1.type = logger
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs:/flume/events
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileType = DataStream #明文的形式
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

另一个master终端
启动：telnet localhost 44444

 ^[0-9]*$：数字的都给过滤 filter和spark里面相反逻辑
 ^开始，$结尾

spark filter为true的保留下来
flume中的：filter为true的过滤掉 Python

3.master的 source-channel-sink ——>slave1的 source-channel-sink：
flume agent a2[push.conf] -> flume2 agent a1 [pull.conf]
执行步骤：slave3，master
slave3:
bin/flume-ng agent -c conf -f conf/pull.conf -n a1 -Dflume.root.logger=INFO,console
master:
bin/flume-ng agent -c conf -f conf/push.conf -n a2 -Dflume.root.logger=INFO,console

pull.conf:

#Name the components on this agent
a1.sources= r1
a1.sinks= k1
a1.channels= c1

#Describe/configure the source
a1.sources.r1.type= avro
a1.sources.r1.channels= c1
a1.sources.r1.bind= slave1
a1.sources.r1.port= 44444

#Describe the sink
a1.sinks.k1.type= logger
a1.sinks.k1.channel = c1

#Use a channel which buffers events in memory
a1.channels.c1.type= memory
a1.channels.c1.keep-alive= 10
a1.channels.c1.capacity= 100000
a1.channels.c1.transactionCapacity= 100000

push.conf:

#Name the components on this agent
a2.sources= r1
a2.sinks= k1
a2.channels= c1

#Describe/configure the source
a2.sources.r1.type= netcat
a2.sources.r1.bind= localhost
a2.sources.r1.port = 44444
a2.sources.r1.channels= c1

#Use a channel which buffers events in memory
a2.channels.c1.type= memory
a2.channels.c1.keep-alive= 10
a2.channels.c1.capacity= 100000
a2.channels.c1.transactionCapacity= 100000


#发送到slave1上
#Describe/configure the source
a2.sinks.k1.type= avro
a2.sinks.k1.channel= c1
a2.sinks.k1.hostname= slave1
a2.sinks.k1.port= 44444

官网的网址：http://flume.apache.org/FlumeUserGuide.html#hdfs-sink

4.写到hive上

①、hive建表：badou
create table order_flume(
order_id string,
user_id string,
eval_set string,
order_number string,
order_dow string,
order_hour_of_day string,
days_since_prior_order string)
clustered by (order_id) into 5 buckets
stored as orc;

②、flume hive sink配置：

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /home/wl/flume/zheng/flume_exec_test.txt
#a1.sources.r1.type=netcat
#a1.sources.r1.bind=master
#a1.sources.r1.port=44444

a1.sinks.k1.type=hive
a1.sinks.k1.hive.metastore=thrift://master:9083
a1.sinks.k1.hive.database=badou
a1.sinks.k1.hive.table=order_flume
#a1.sinks.k1.hive.partition = eval_set
#a1.sinks.sink1.hive.txnsPerBatchAsk = 2
#a1.sinks.k1.useLocalTimeStamp = false
#a1.sinks.k1.round = true
#a1.sinks.k1.roundValue = 10
#a1.sinks.k1.roundUnit = minute
a1.sinks.k1.serializer = DELIMITED
a1.sinks.k1.serializer.delimiter=","
a1.sinks.k1.serializer.serdeSeparator=’,’
a1.sinks.k1.serializer.fieldnames = order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order

a1.channels.c1.type=memory
a1.channels.c1.capacity = 100000
a1.channels.c1.transactionCapacity = 1000

a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1

③、添加依赖jar包到flume_home/lib中：

/usr/local/src/apache-hive-1.2.2-bin/hcatalog/share/hcatalog/*
/usr/local/src/apache-hive-1.2.2-bin/lib/*

④、修改hive-site.xml 5个配置文件：

hive.support.concurrency
true

hive.exec.dynamic.partition.mode
nonstrict

hive.txn.manager
org.apache.hadoop.hive.ql.lockmgr.DbTxnManager

hive.compactor.initiator.on
true

hive.compactor.worker.threads
1