流程概述
1 使用log4j生成日志,Flume采集这些日志
2 Flume把log4j收集到的日志输出到Kafka
3 整合Kafka和Spark Streaming
4 Spark Streaming对接收到的数据进行处理
数据流向图
具体操作步骤
------ log4j -> Flume ------
1 编写log4j.properties配置文件(log4j.properties)(Flume source接收的hostname和port在log4j.properties文件中配置)
2 添加flume-log4jappender依赖(flume_log4j_依赖)
3 编写自动生成日志的java程序(LoggerGenerator.java)
4 编写Flume配置文件并启动Flume(streaming.conf,flume启动命令)
------ Flume -> Kafka ------
5 启动zookeeper(zookeeper启动命令)
6 启动Kafka(Kafka启动命令)
7 查看Kafka的topic列表(查看Kafka的topics命令)
8 创建Kafka的topic(Kafka创建topic命令)
9 编写Flume配置文件(streaming2.conf)
10 启动Flume(flume启动命令)
11 启动Kafka消费者(启动Kafka消费者)
12 启动log4j日志生成程序(LoggerGenerator.java)
------ Kafka -> Spark Streaming ------
13 编写Spark Streaming程序(Receiver方式)
14 运行Spark Streaming程序(KafkaStreamingApp.scala)(运行程序之前记得先运行kafka、flume,再运行log4j自动生成日志程序,最后运行Spark Streaming程序)
/*以上是本地测试的步骤,如果想在生产环境部署,需要以下步骤*/
1 把生成log4j日志的程序打成jar包,放到集群中运行
2 Flume和Kafka的操作和本地测试的步骤是一样的
3 Spark Streaming的代码也需要打成jar包,然后使用spark-submit的方式运行,可以根据实际情况选择运行的模式:local/yarn/standalone/mesos
/*在生产上,整个流处理的流程都是一样的,区别在于业务逻辑的复杂性*/
pom文件
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.imooc.spark</groupId>
<artifactId>sparktrain</artifactId>
<version>1.0</version>
<inceptionYear>2008</inceptionYear>
<properties>
<scala.version>2.11.8</scala.version>
<kafka.version>0.9.0.0</kafka.version>
<spark.version>2.2.0</spark.version>
<hadoop.version>2.6.0-cdh5.7.0</hadoop.version>
<hbase.version>1.2.0-cdh5.7.0</hbase.version>
</properties>
<!--添加cloudera的repository-->
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!-- Kafka 依赖-->
<!--
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>${kafka.version}</version>
</dependency>
-->
<!-- Hadoop 依赖-->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<!-- HBase 依赖-->
<dependenc程序程序y>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>${hbase.version}</version>
</dependency>
<!-- Spark Streaming 依赖-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- Spark Streaming整合Flume 依赖-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume-sink_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.5</version>
</dependency>
<!-- Spark SQL 依赖-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.module</groupId>
<artifactId>jackson-module-scala_2.11</artifactId>
<version>2.6.5</version>
</dependency>
<dependency>
<groupId>net.jpountz.lz4</groupId>
<artifactId>lz4</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
</dependency>
<!--Flume整合log4j时要添加这个依赖-->
<dependency>
<groupId>org.apache.flume.flume-ng-clients</groupId>
<artifactId>flume-ng-log4jappender</artifactId>
<version>1.6.0</version>
</dependency>
</dependencies>
<build>
<!--
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
-->
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.5</arg>
</args>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-eclipse-plugin</artifactId>
<configuration>
<downloadSources>true</downloadSources>
<buildcommands>
<buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
</buildcommands>
<additionalProjectnatures>
<projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
</additionalProjectnatures>
<classpathContainers>
<classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
<classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
</classpathContainers>
</configuration>
</plugin>
</plugins>
</build>
<reporting>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</reporting>
</project>
------ log4j -> Flume ------
1 编写log4j.properties配置文件(log4j.properties)(Flume source接收的hostname和port在log4j.properties文件中配置)
log4j.rootLogger=INFO,stdout,flume
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.target = System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c] [%p] - %m%n
log4j.appender.flume = org.apache.flume.clients.log4jappender.Log4jAppender
log4j.appender.flume.Hostname = hadoop000
log4j.appender.flume.Port = 41414
log4j.appender.flume.UnsafeMode = true
2 添加flume-log4jappender依赖(flume_log4j_依赖)
<!--Flume整合log4j时要添加这个依赖-->
<dependency>
<groupId>org.apache.flume.flume-ng-clients</groupId>
<artifactId>flume-ng-log4jappender</artifactId>
<version>1.6.0</version>
</dependency>
3 编写自动生成日志的java程序(LoggerGenerator.java)
import org.apache.log4j.Logger;
/*模拟日志产生*/
public class LoggerGenerator {
private static Logger logger = Logger.getLogger(LoggerGenerator.class.getName());
public static void main(String[] args) throws Exception {
int index = 0;
while (true) {
Thread.sleep(1000);
logger.info("value : " + index++);
}
}
}
4 编写Flume配置文件并启动Flume(streaming.conf,flume启动命令)
// Flume配置文件
agent1.sources=avro-source
agent1.channels=logger-channel
agent1.sinks=log-sink
#define source
agent1.sources.程序avro-source.type=avro
agent1.sources.avro-source.bind=0.0.0.0
agent1.sources.avro-source.port=41414
#define channel
agent1.channels.logger-channel.type=memory
#define sink
agent1.sinks.log-sink.type=logger
agent1.sources.avro-source.channels=logger-channel
agent1.sinks.log-sink.channel=logger-channel
// Flume启动命令
flume-ng agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/streaming.conf \
--name agent1 \
-Dflume.root.logger=INFO,console
// 启动Flume后启动log4j日志生成程序,查看Flume是否收到日志消息
------ Flume -> Kafka ------
5 启动zookeeper(zookeeper启动命令)
zkServer.sh start
6 启动Kafka(Kafka启动命令)
$KAFKA_HOME/bin/kafka-server-start.sh -daemon $KAFKA_HOME/config/server.properties
7 查看Kafka的topic列表(查看Kafka的topics命令)
$KAFKA_HOME/bin/kafka-topics.sh --list --zookeeper hadoop000:2181
8 创建Kafka的topic(Kafka创建topic命令)
$KAFKA_HOME/bin/kafka-topics.sh --create --zookeeper hadoop000:2181 --replication-factor 1 --partitions 1 --topic streamingtopic
9 编写Flume配置文件(streaming2.conf)
agent1.sources=avro-source
agent1.channels=logger-channel
agent1.sinks=kafka-sink
#define source
agent1.sources.avro-source.type=avro
agent1.sources.avro-source.bind=0.0.0.0
agent1.sources.avro-source.port=41414
#define channel
agent1.channels.logger-channel.type=memory
#define sink
agent1.sinks.kafka-sink.type=org.apache.flume.sink.kafka.KafkaSink
agent1.sinks.kafka-sink.topic = streamingtopic
agent1.sinks.kafka-sink.brokerList = hadoop000:9092
agent1.sinks.kafka-sink.requiredAcks = 1
agent1.sinks.kafka-sink.batchSize = 20
agent1.sources.avro-source.channels=logger-channel
agent1.sinks.kafka-sink.channel=logger-channel
10 启动Flume(flume启动命令)
flume-ng agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/streaming2.conf \
--name agent1 \
-Dflume.root.logger=INFO,console
11 启动Kafka消费者(启动Kafka消费者)
$KAFKA_HOME/bin/kafka-console-consumer.sh --zookeeper hadoop000:2181 --topic streamingtopic
12 启动log4j日志生成程序(LoggerGenerator.java)
// 在IDEA中启动即可,启动后查看Kafka消费者端是否有数据输出,数据分批次,在Flume配置文件中的batchSize可设置批次大小,批次设置位20表示Flume采集到20条数据后会把这20条数据一次性发送给Kafka
------ Kafka -> Spark Streaming ------
13 编写Spark Streaming程序(Receiver方式)
package com.imooc.spark
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
/*Spark Streaming 整合 Kafka Receiver 方法*/
object KafkaStreamingApp {
def main(args: Array[String]): Unit = {
if(args.length != 4) {
System.err.println("Usage: KafkaStreamingApp <zkQuorum> <group> <topics> <numThreads>")
}
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("KafkaReceiverWordCount").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(5))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
// Spark Streaming对接Kafka需要ssc,zookeeper,组,topic
val messages = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap)
// 自己去测试为什么要取第二个
messages.map(_._2).count().print()
ssc.start()
ssc.awaitTermination()
}
}
14 运行Spark Streaming程序(KafkaStreamingApp.scala)(运行程序之前记得先运行kafka、flume,再运行log4j自动生成日志程序,最后运行Spark Streaming程序)
// 运行Spark Streaming的时候需要传入参数(zookeeper groupId topics 线程数)(hadoop000:2181 test streamingtopic 1)
// 如果Spark Streaming成功输出运算结果证明整合成功