第12章 Spark Streaming项目实战

最新推荐文章于 2022-08-29 15:02:30 发布

weixin_SAG

最新推荐文章于 2022-08-29 15:02:30 发布

阅读量755

点赞数

分类专栏： spark streaming 大数据

本文链接：https://blog.csdn.net/weixin_38492276/article/details/81354738

版权

本文档详细介绍了使用Spark Streaming处理互联网访问日志的实战项目，包括需求说明、日志介绍、日志生成、Flume实时收集、对接Kafka、Spark Streaming消费与数据清洗，并展示了如何将清洗后的数据存储到HBase，以及如何进行统计分析。项目涵盖了Python日志生成器开发、Flume-Kafka-Spark Streaming数据流以及HBase的交互操作。

摘要由CSDN通过智能技术生成

12-1 -课程目录

项目实战

需求说明

互联网访问日志概述

功能开发及本地运行

生产环境运行

12-2 -需求说明

今天到现在为止实战课程的访问量

今天到现在为止从搜索引擎过来的实战课程的访问量

12-3 -用户行为日志介绍

为什么要记录用户的访问行为日志

网站页面的访问量

网站的粘性

用户行为日志分析的意义

网站的眼睛

网站的神经

网站的大脑

12-4 -Python日志产生器开发之产生访问url和ip信息

12-5 -Python日志产生器开发之产生referer和状态码信息

12-6 -Python日志产生器开发之产生日志访问时间

12-7 -Python日志产生器服务器测试并将日志写入到文件中

12-8 -通过定时调度工具每一分钟产生一批数据

linux crontab

https://tool.lu/crontab

每分钟执行一次crontab表达式:*/1 * * * *

crontab -e

*/1 * * * */home/hadoop/data/project/log_generator.sh

12-9 -使用Flume实时收集日志信息

打通flume&kafka&spark streaming线路

对接Python日志产生器输出的日志到flume

streaming_project.conf

选型：access.log==>控制台输出

exec

memory

logger

具体可以参照：http://flume.apache.org/

exec-memory-logger.sources=exec-sources

exec-memory-logger.sinks=logger-sink

exec-memory-logger.channel=money-channel

exec-memory-logger.sources.exec-source.type=exec

exec-memory-logger.sources.exec-source.command=tail -F /home/hadoop/data/project/logs/access.log

exec-memory-logger.sources.exec-source.shell=/bin/sh -C

exec-memory-logger.channel.memory-channel.type=memory

exec-memory-logger.sinks.logger.sink=logger

exec-memory-logger.sources.execx-source.channels=memory-channel

exec-memory-logger.sinks.logger.sink.channel=memory-channel

启动

12-10 -对接实时日志数据到Kafka并输出到控制台测试

日志==>Flume==>kafka

1、启动zookeeper

./zkServer.sh start

2、启动kafka Server

./kafka-server-start.sh -daemon /home/hadoop/app/kafka_2.11-0.9.0.0/config/server.propertie

3、修改flume配置文件使得flume sink数据到kafka

exec-memory-kafka.sources=exec-sources

exec-memory-kafka.sinks=kafka-sink

exec-memory-kafka.channel=money-channel

exec-memory-kafka.sources.exec-source.type=exec

exec-memory-kafka.sources.exec-source.command=tail -F /home/hadoop/data/project/logs/access.log

exec-memory-kafka.sources.exec-source.shell=/bin/sh -C

exec-memory-kafka.channel.memory-channel.type=memory

exec-memory-kafka.sinks.logger.sink=kafka

exec-memory-kafka.sources.execx-source.channels=memory-channel

exec-memory-kafka.sinks.logger.sink.channel=memory-channel