Spark Streaming实时流处理项目

Spark Streaming 实时流处理项目实战

项目整体流程:

  1. 模拟用户访问日志数据;
  2. 由Flume采集数据并通过Kafka进行消费;
  3. 按照需求由spark sparkstreaming 进行实时处理,并将结果保存到HBase中;

数据准备

  • generate_log.py 生成模拟用户访问网站的日志数据
  • 由ip+时间+url+状态码+搜索引擎来源组成
  • 为了模拟实时处理流程,定时 运行generate_log.py 脚本

在这里插入图片描述

  • 定时运行:
    • lgl.sh
python /home/jackie/project_0907/generate_log.py
  • 每一分钟执行一次,每次生成100条数据

[jackie@hadoop102:project_0907]$ crontab -e

*/1 * * * * /home/jackie/project_0907/lgl.sh


数据采集

  • 启动Flume进行采集
#单机 zookeeper
/opt/module/zookeeper/bin/zkServer.sh start

#启动kafka服务
/opt/module/kafka/bin/kafka-server-start.sh \
-daemon /opt/module/kafka/config/server.properties

#启动flume
/opt/module/flume/bin/flume-ng agent \
--conf /opt/module/flume/conf/ \
--conf-file /home/jackie/project_0907/streaming_project.conf \
--name exec-memory-kafka \
-Dflume.root.logger=INFO,console
#消费
[jackie@hadoop102:kafka]$ bin/kafka-console-consumer.sh --zookeeper hadoop102:2181 --topic streamingtopic

数据处理

测试一 数据接收
  • 验证kafka能否正常接收数据
    //1.初始化Spark配置信息
    //打包时 将 setMaster("local[*]") 注释掉
	val sparkConf = new SparkConf().setAppName("StreamCount").setMaster("local[*]")
    val sparkConf = new SparkConf().setAppName("CourseClickCount")

    //2.初始化SparkStreamingContext 实时数据分析环境对象 采集周期 60s
    val streamingContext = new StreamingContext(sparkConf, Seconds(60))

    //3.从Kafka中采集数据
    val kafkaDStream: ReceiverInputDStream[(String, String)] = 		KafkaUtils.createStream(
      streamingContext,
      "hadoop102:2181",
      "lzou",
      Map("streamingtopic" -> 1)
    )
	kafkaDStream.map(_._2).count().print()

//输出
/*
    10.63.98.87	2020-09-15 18:09:01	"GET /class/143.html HTTP/1.1"	404	-
    143.55.98.124	2020-09-15 18:09:01	"GET /class/112.html HTTP/1.1"	200	http://www.baidu.com/s?wd=Hadoop基础
    132.143.98.156	2020-09-15 18:09:01	"GET /class/128.html HTTP/1.1"	200	-
    10.132.63.124	2020-09-15 18:09:01	"GET /class/128.html HTTP/1.1"	200	-
    187.10.132.72	2020-09-15 18:09:01	"GET /course/list HTTP/1.1"	200	http://www.baidu.com/s?wd=Spark Streaming
     */

测试二 数据清洗
  • 请先出class课程的数据,即:URL中 /class 开头的数据
 val cleanData: DStream[ClickLog] = kafkaDStream.map(_._2).map(line => {
      val fields: Array[String] = line.split("\t")
      val url: String = fields(2).split(" ")(1)
      var courseId = 0

      //      10.55.187.87	2020-09-09 14:03:01	"GET /class/112.html HTTP/1.1"	404	-

      if (url.startsWith("/class")) {
        val courseIdHTML: String = url.split("/")(2)
        courseId = courseIdHTML.substring(0, courseIdHTML.lastIndexOf(".")).toInt
      }

      ClickLog(fields(0), DateUtils.parseToMinute(fields(1)), courseId, fields(3).toInt, fields(4))

    }).filter(ClickLog => ClickLog.courseId != 0)

    cleanData.print()

//输出
 /** cleanData
      * ClickLog(30.87.55.46,20200915211301,143,404,-)
      * ClickLog(156.187.132.98,20200915211301,141,404,-)
      * ClickLog(30.55.63.167,20200915211301,143,500,http://search.yahoo.com/search?p=大数据面试)
      */


测试三 数据存储到HBase
  • rowkey : day+courseid
cleanData.map(x => {
      (x.time.substring(0, 8) + "_" + x.courseId, 1)
    }).reduceByKey(_ + _).foreachRDD(rdd => {

      rdd.foreachPartition(partition => {
        val buffer: ListBuffer[CourseClickCount] = new ListBuffer[CourseClickCount]

        partition.foreach(pair => {
          buffer.append(CourseClickCount(pair._1, pair._2))
        })

        CourseClickCountDAO.save(buffer)
      })
    })

测试四 统计从搜索引擎过来的实战课程访问量
  • rowkey : day+search+id
//
    //ClickLog(10.46.187.63,20200915212801,143,404,http://www.baidu.com/s?wd=Hadoop基础)
    cleanData.map(x => {
      val referer: String = x.referer.replaceAll("//", "/")
      val splits: Array[String] = referer.split("/")
      var host = ""
      if (splits.length > 2) {
        host = splits(1)
      }
      (host, x.courseId, x.time)

    }).filter(_._1 != "").map(x => {

      (x._3.substring(0, 8) + "_" + x._1 + "_" + x._2, 1)

    }).reduceByKey(_ + _).foreachRDD(rdd => {

      rdd.foreachPartition(partition => {

        val list: ListBuffer[CourseSearchClickCount] = new ListBuffer[CourseSearchClickCount]
        partition.foreach(pair => {
          list.append(CourseSearchClickCount(pair._1, pair._2))
        })

        CourseSearchClickCountDAO.save(list)
      })


    })

打包提交

pom.xml依赖
  <build>
        <plugins>


            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.1.1</version>
                <configuration>
                    <archive>
                        <manifest>
                            <!-- 主类信息 -->
                            <mainClass>com.spark.SparkStreaming_Kafka</mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
提交jar包到spark运行
先启动hdfs zk
#启动spark
[jackie@hadoop102:spark]$ sbin/start-all.sh
# 启动spark-shell
[jackie@hadoop102 spark]$ bin/spark-shell

[jackie@hadoop102:spark]$ bin/spark-submit --master local[5] \
> --name CourseClickCount \
> --class com.spark.SparkStreaming_Kafka \
> /home/jackie/project_0907/spark-streaming-project-1.0-SNAPSHOT-jar-with-dependencies.jar 


  • 这里上传的是完整jar包,包括依赖
  • name spark-streaming 运行spark程序的名字 可以随便写
  • class com.spark.SparkStreaming_Kafka 加载的主类名
  • /home/jackie/…jar jar包地址

Spark Web UI : http://hadoop102:4040/


注意事项

源码

需要启动的进程

jps
HDFS、ZK、Flume、Kafka、HBase、Spark


Hbase启动时报错
  • 启动 hbase shell 时:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lND3qFNk-1600184995507)(D:\Desktop\java笔记\assets\1600184092530.png)]

  • 解决方案:

    • 关闭hbase ,注意查看jps是否还在运行,建议直接kill掉进程;
    • 进入zookeeper/bin/目录下,运行zkCli.sh ,执行 rmr /hbase ,删除hbase;
    [jackie@hadoop102:zookeeper]$ bin/zkCli.sh 
    
    [zk: localhost:2181(CONNECTED) 0] rmr /hbase
    
    • 然后再 删除hdfs上的hbse文件夹;
     [jackie@hadoop102:~]$ hadoop fs -rm -r /hbase
    
    • 重新启动
其他
  • 若有其他问题,欢迎交流讨论。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值