一、需求说明
统计出日志信息中的图片信息,统计每张图片出现的次数,并降序排序
二、环境说明
- zookeeper环境配置如下:
节点 | 安装路径 | dataDir路径 |
---|---|---|
hadoop002 | /training/zookeeper-3.4.5 | /training/zookeeper-3.4.5/tmp |
hadoop003 | /training/zookeeper-3.4.5 | /training/zookeeper-3.4.5/tmp |
hadoop004 | /training/zookeeper-3.4.5 | /training/zookeeper-3.4.5/tmp |
- Flink集群环境配置如下:
节点 | 安装路径 | Log路径 |
---|---|---|
hadoop002 | /training/flink-standalone/ | /training/flink-standalone/ |
hadoop003 | /training/flink-standalone/ | /training/flink-standalone/ |
hadoop004 | /training/flink-standalone/ | /training/flink-standalone/ |
三、编程实现
-
创建maven项目
-
添加依赖,即在pom.xml中添加如此下依赖:
<dependencies> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-scala_2.12</artifactId> <version>1.10.2</version> <!-- 为了查看源码,引入源码包 --> <classifier>sources</classifier> <type>java-source</type> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-scala_2.12</artifactId> <version>1.10.2</version> <classifier>sources</classifier> <type>java-source</type> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-simple</artifactId> <version>1.7.25</version> <scope>compile</scope> </dependency> </dependencies> <build> <plugins> <!-- 该插件用于将 Scala 代码编译成 class 文件 --> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.4.6</version> <executions> <execution> <!-- 声明绑定到 maven 的 compile 阶段 --> <goals> <goal>compile</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-assembly-plugin</artifactId> <version>3.0.0</version> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> </plugins> </build>
-
创建WordCount程序,完成代码如下:
package com.suben.flink.wc import org.apache.flink.api.common.io.OutputFormat import org.apache.flink.api.common.operators.Order import org.apache.flink.api.scala.{AggregateDataSet, DataSet, ExecutionEnvironment, createTypeInformation} object WordCount { def main(args: Array[String]): Unit = { val env = ExecutionEnvironment.getExecutionEnvironment // 读取文件 val resultDataSet: DataSet[String] =env.readTextFile("E:\\IdeaProjects\\bigdata-sets002\\flink-basic\\data\\apache.log") .map(line => { var target = "" if (line != null && !"".equals(line)) { val datas: Array[String] = line.split(" ") if (datas.length >= 7) { target = datas(datas.length - 1) } } //(target, 1) target }) //.groupBy(0).sum(1) // resultDataSet.print() println(">>>>>start>>>>>>>>") pictureTopN(resultDataSet) println(">>>>>>Log analytic end>>>>>>>>") } // 统计图片 def pictureTopN(dataSet: DataSet[String]): Unit = { dataSet.map(line => { var target = "" if (line.contains(".png") || line.contains(".jpeg") || line.contains(".jpg")) { val lastSlice: Int = line.lastIndexOf("/") target = line.substring(lastSlice+1) } (target, 1) }).groupBy(0) .sum(1) .filter(!_._1.equals("")) .setParallelism(1) .sortPartition(1, Order.DESCENDING) .print() } }
-
本地运行测试,结果正确如下:
-
修改代码文件路径为虚拟机上的文件路径(
注意:该文件必须在三台虚拟机中都存在
)// 读取文件 val resultDataSet: DataSet[String] = env.readTextFile("file:///training/datas/apache.log") //env.readTextFile("E:\\IdeaProjects\\bigdata-sets002\\flink-basic\\data\\apache.log")
-
打包并将打包后的jar文件上传至虚拟机中
-
使用flink命名提交作业,执行如下命令:
./flink run -c com.suben.flink.wc.WordCount /root/flink-basic-1.0-SNAPSHOT.jar
-
作业运行结果,结果会打印到控制台:
-
在web ui中查看,如下:
总的Task数量:9个
完成Task数量:9个
状态是:Finished
因并发度是3,总共三个节点,故分配了9个Tasks。
作业详情: