一、环境
开发环境:
系统:Win10
开发工具:scala-eclipse-IDE
项目管理工具:Maven 3.6.0
JDK 1.8
Scala 2.11.11
Spark (Streaming & SQL) 2.4.3
MySQL:mysql-connector-java-5.1.47
作业运行环境:
系统:Linux CentOS7(两台机:主从节点,2核)
master : 192.168.190.200
slave1 : 192.168.190.201
JDK 1.8
Scala 2.11.11
Spark 2.4.3
Hadoop 2.9.2
MySQL 5.6.44 (位于master节点)
二、案例
1. Spark Streaming作业分析日志数据,结果输出至MySQL;
2. 日志存放于HDFS分布式文件系统上 "hdfs://master:9000/user/spark/logs";
3. 流式作业2s间隔读取日志,结构化处理后(将日志流DStream转化为DataFrame),通过Spark SQL分析读入数据;
4. 将分析结果(DataFrame)通过JDBC驱动向MySQL数据库中写数据。
日志数据(模拟0622和0623两天的数据)如下:
日志格式(\t分隔符):
日志级别 方式 内容
[log_level]\tmethod\tcontent
日志文件1:20190622.log
[info] main 20190622-list size is 10
[info] main 20190622-list size is 11
[info] main 20190622-list size is 12
[info] main 20190622-list size is 13
[info] main 20190622-list size is 14
[info] main 20190622-list size is 15
[warn] calculate 20190622-zero denominator warning!
[info] main 20190622-list size is 16
[info] main 20190622-list size is 17
[error] readFile 20190622-nullpointer file
[info] main 20190622-list size is 18
[info] main 20190622-list size is 19
[info] main 20190622-list size is 20
日志文件2:20190623.log
[info] main 20190623-list size is 10
[info] main 20190623-list size is 11
[info] main 20190623-list size is 12
[info] main 20190623-list size is 13
[info] main 20190623-list size is 14
[info] main 20190623-list size is 15
[warn] calculate 20190623-zero denominator warning!
[info] main 20190623-list size is 16
[info] main 20190623-list size is 17
[error] readFile 20190623-nullpointer file
[info] main 20190623-list size is 18
[info] main 20190623-list size is 19
[info] main 20190623-list size is 20
预期实现结果:
将日志级别为[warn]和[error]的日志信息提取出,写到数据库中,如下
[warn] calculate 20190622-zero denominator warning!
[error] readFile 20190622-nullpointer file
[warn] calculate 20190623-zero denominator warning!
[error] readFile 20190623-nullpointer file
三、代码实现
1. pom.xml:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com</groupId><!-- 组织名 -->
<artifactId>logAnalysis</artifactId><!-- 项目名 -->
<version>0.1</version><!-- 版本号 -->
<dependencies>
<dependency><!-- Spark核心依赖包 -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.3</version>
<scope>provided</scope><!-- 运行时提供,打包不添加,Spark集群已自带 -->
</dependency>
<dependency><!-- Spark Streaming依赖包 -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.4.3</version>
<scope>provided</scope><!-- 运行时提供,打包不添加,Spark集群已自带 -->
</dependency>
<dependency><!-- Spark SQL 依赖包 -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<dependency><!-- MySQL 依赖包 -->
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
<dependency><!-- Log 日志依赖包 -->
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency><!-- 日志依赖接口 -->
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.12</version>
</dependency>
</dependencies>
<build>
<plugins>
<!-- 混合scala/java编译 -->
<plugin><!-- scala编译插件 -->
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<id>compile</id>
<goals>
<goal>compile</goal>
</goals>
<phase>compile</phase>
</execution>
<execution>
<id>test-compile</id>
<goals>
<goal>testCompile</goal>
</goals>
<phase>test-compile</phase>
</execution>
<execution>
<phase>process-resources</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source><!-- 设置Java源 -->
<target>1.8</target>
</configuration>
</plugin>
<!-- for fatjar -->
<plugin><!-- 将所有依赖包打入同一个jar包中 -->
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4</version>
<configuration>
<descriptorRefs>
<!-- jar包的后缀名 -->
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>assemble-all</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin><!-- Maven打包插件 -->
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<!-- 添加类路径 -->
<addClasspath>true</addClasspath>
<!-- 设置程序的入口类 -->
<mainClass>sparkstreaming_action.log.analysis.LogAnalysis</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>
<repositories>
<repository>
<id>alimaven</id>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
</repositories>
</project>
2. 程序主入口:
package sparkstreaming_action.log.analysis
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import java.util.Properties
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.SaveMode
//记录样例类
case class Record(log_level: String, method: String, context:String)
//日志分析类
object LogAnalysis extends App{
//Spark配置
val sparkConf = new SparkConf()
.setAppName("LogAnalysis")
.setMaster("spark://master:7077")
.set("spark.local.dir", "./tmp")
//利用SparkSession建立上下文
val sparkSession = SparkSession.builder()
.appName("LogAnalysis")
.config(sparkConf)
.getOrCreate()
//建立Spark上下文
val sc = sparkSession.sparkContext
//建立流式处理上下文,2s间隔
val ssc = new StreamingContext(sc, Seconds(2))
//MySQL配置
val properties = new Properties()
properties.setProperty("user", "hadoop")
properties.setProperty("password", "123456")
//读入日志文件目录下的日志信息流
val logFilesDir = "hdfs://master:9000/user/spark/logs"
val logStream = ssc.textFileStream(logFilesDir)
//将日志信息流转换为Spark SQL的dataframe
logStream.foreachRDD((rdd: RDD[String]) => {
//从sparkSession实例中导入隐式转换单例:Scala objects into `DataFrame`s
import sparkSession.implicits._
val data = rdd.map((record: String) => {
val tokens = record.split("\t")
//构建样例
Record(tokens(0), tokens(1), tokens(2))
}).toDF() //注:toDF()不是RDD方法,而是DataFrame的方法,RDD被隐式转换为了DataFrame
//创建临时视图(生存周期为SparkSession)
data.createOrReplaceTempView("alldata")
//条件筛选
val logImp = sparkSession.sql("select * from alldata where log_level='[error]' or log_level='[warn]'")
//以表格的形式打印前20条记录
logImp.show()
//输出到外部MySQL中
// //与MySQL表格对应的类型设置:创建结构化的表模式
// val schema = StructType(Array(
// StructField("log_level", StringType, true),
// StructField("method", StringType, true),
// StructField("content", StringType, true)))
//写入MySQL:追加模式
//运行于Driver节点
logImp.write.mode(SaveMode.Append)
.jdbc("jdbc:mysql://master:3306/spark", "log_analysis", properties)
})
ssc.start() //开始计算
ssc.awaitTermination() //等待结束
}
四、打包运行
1.在项目的根目录下运行命令行窗口(在目录下 "shift+右键",选择命令行窗口 Power Shell)
执行如下命令:(编译代码)
> mvn clean install
编译成功后,会在当前目录的 ".\target\" 下产生两个jar包;
其中的 DStreamOutputHBase-0.1-jar-with-dependencies.jar 用来提交给Spaek集群
用终端A(如:Windows的PowerShell,可连接多台)通过ssh登陆master节点,执行
2.创建HDFS日志目录:
$ hadoop fs -mkdir /user/spark
$ hadoop fs -mkdir /user/spark/logs
用终端B(如:Windows的PowerShell,可连接多台)通过ssh登陆master节点,执行
3.将Jar包提交至主节点上,执行Spark作业:
提交Spark作业:(需先配置Spark_HOME环境变量)
$ nohup spark-submit \
--class sparkstreaming_action.log.analysis.LogAnalysis \
--conf spark.default.parallelism=20 \
--driver-class-path mysql-connector-java-5.1.47.jar \
/opt/logAnalysis-0.1-jar-with-dependencies.jar
注1:其中每行的末尾 "\" 代表不换行,命令需在一行上输入,此处只为方便观看
注2:提交的Jar包放在 /opt/ 目录下
注3:增加分区数(每个分区一个task任务线程)配置:spark.default.parallelism=20
注4:设置引入jdbc驱动Jar包,否则计算时找不到
注5:使用nohup命令,隐藏控制台打印信息,并保存至作业提交同目录下的nohup.out文件中
使用 $ tail -f nohup.out 可监控输出日志
Spark流式作业运行过程的输出(以2s间隔,不停打印)
+---------+------+-------+
|log_level|method|context|
+---------+------+-------+
+---------+------+-------+
4.在终端A上执行,依次上传日志文件至HDFS
$ hadoop fs -put logs/20190622.log /user/spark/logs/
$ hadoop fs -put logs/20190623.log /user/spark/logs/
注:作业不会重复读取同一个文件,会根据最后修改时间判断是否读取
此时作业输出如下信息(日志解析结果):
进入MySQL查看接收到的数据:(至此,即成功完成了整个流程)
最后,进入Spark UI中可以看到多了一个SQL栏目框
其中,DataFrame不断循环执行三个任务:
1)创建临时视图:createOrReplaceTempView
2)在视图上按过滤条件查询,并打印控制台结果:show
3)通过JDBC连接MySQL,并写入查询结果:jdbc
五、遇见的几个错误
错误1:无JDBC驱动:
解决:在spark-submit作业提交参数中加入jdbc驱动Jar包
如:--driver-class-path mysql-connector-java-5.1.47.jar
错误2:hadoop@master没有访问MySQL权限:
解决:增加MySQL访问权限,进入MySQL,执行授权命令:
如:mysql> grant all on spark.* to hadoop@'master' identified by '123456';
mysql> flush privileges;
错误3:解析日志一行的数据时数组越界,没有检测到制表符"\t"
原因如下:红色框的勾去掉,否则输入的制表符Tab会被4个空格替代(不过阿里的Java开发规范上要勾上的)
六、参考文章
1.《Spark Streaming 实时流式大数据处理实战》6.6 实例---日志分析
2.《Spark 最佳实践》第5章 Spark SQL
3. Spark 配置信息详解 configuration.html(Spark官网)