概述
由于最近一个报文调阅系统的需求,在需求重,可能会涉及到报文数据的清洗落地,数据来源由网络爬虫实现(初步采用python scrapy实现),通过python-kafka发送MQ消息至本系统kafka服务,接收到消息后基于storm的KafkaSpout实现对数据的处理后统一落地至ES,详细流程如下图:
环境准备
由于环境有限,测试环境只提供了一个本地环境,即所有基于集群部署的服务均以LOCAL模式测试,具体集群部署,可参考其它资料,我这里只做代码开发,最终应用不影响。
服务器:ubuntu server 17.10
JVM环境:jdk_1.8.0_91_64bit
服务治理:zookeeper-3.4.9
实时计算:apache-storm-1.2.2
消息队列:kafka_2.11-2.0.0
索引存储:elasticsearch-5.6.10
应用开发
1、项目基于maven构建,依赖整个方便,项目架构如图:
2、项目POM
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.sdnware.news</groupId>
<artifactId>news-kafka-storm</artifactId>
<version>1.0</version>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<java.version>1.8</java.version>
<junit.version>4.12</junit.version>
<kafka.version>2.0.0</kafka.version>
<storm.version>1.2.2</storm.version>
<storm-kafka.version>1.2.2</storm-kafka.version>
<storm-elasticsearch.version>1.2.2</storm-elasticsearch.version>
<lombok.version>1.18.2</lombok.version>
<gson.version>2.8.5</gson.version>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>${junit.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.12</artifactId>
<version>${kafka.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.zookeeper</groupId>
<artifactId>zookeeper</artifactId>
</exclusion>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
<exclusion>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>${storm.version}</version>
<!-- 当打包部署时, scope需设置为provided -->
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>${lombok.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>${gson.version}</version>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-kafka-client</artifactId>
<version>${storm-kafka.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-elasticsearch</artifactId>
<version>${storm-elasticsearch.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>com.sdnware.news.topo.KafkaTopology</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>assembly</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
</plugins>
</build>
</project>
3、基于storm-kafka的开发
注:在storm1.x以前,官方提供了storm-kafka的maven插件开发,在1.x以后虽然可用,但已经过期了,官方推荐storm-kafka-client来做开发,也是非常方便。
在开发storm实现,我们基本是针对一个topology来开发业务,本例中直接编写KafkaTopology:
package com.sdnware.news.topo;
import com.google.gson.Gson;
import com.sdnware.news.pojo.UserInfo;
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.StormSubmitter;
import org.apache.storm.elasticsearch.bolt.EsIndexBolt;
import org.apache.storm.elasticsearch.common.DefaultEsTupleMapper;
import org.apache.storm.elasticsearch.common.EsConfig;
import org.apache.storm.elasticsearch.common.EsTupleMapper;
import org.apache.storm.kafka.spout.KafkaSpout;
import org.apache.storm.kafka.spout.KafkaSpoutConfig;
import org.apache.storm.topology.BasicOutputCollector;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.TopologyBuilder;
import org.apache.storm.topology.base.BaseBasicBolt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import java.util.Properties;
import java.util.UUID;
/**
* Created by sdnware on 18-8-31.
*/
public class KafkaTopology {
public static void main(String[] args) throws Exception{
/** 这里只是基于storm-kafka编写的一段伪代码:
BrokerHosts zkHosts = new ZkHosts(ZK_HOSTS);
SpoutConfig config = new SpoutConfig(zkHosts, KAFKA_TOPIC, ZK_ROOT + KAFKA_TOPIC,
UUID.randomUUID().toString());
config.scheme = new SchemeAsMultiScheme(new StringScheme());
config.zkServers = Arrays.asList(ZK_SERVERS.split(","));
config.zkPort = ZK_PORT;
config.socketTimeoutMs = socketTimeoutMs; **/
TopologyBuilder topologyBuilder = new TopologyBuilder(); // 定义topo构造器
Properties properties = new Properties();
properties.setProperty("group.id", "test-news-topic"); // kafka server的基本配置
// 定义一个KafkaSpoutConfig
KafkaSpoutConfig<String, String> kafkaSpoutConfig = KafkaSpoutConfig.builder("192.168.100.39:9092",
"news-topic")
.setFirstPollOffsetStrategy(KafkaSpoutConfig.FirstPollOffsetStrategy.UNCOMMITTED_EARLIEST)
.setProp(properties).build();
KafkaSpout<String, String> kafkaSpout = new KafkaSpout<>(kafkaSpoutConfig); // KafkaSpout实现
topologyBuilder.setSpout("kafka-spout", kafkaSpout, 1); // 注入Spout
topologyBuilder.setBolt("kafka-bolt", new NewsBlot(), 1).shuffleGrouping("kafka-spout"); // 通过storm获取kafka-spout数据
EsConfig esConfig = new EsConfig(new String[]{"http://192.168.100.39:9200"}); // 定义一个ES的配置信息
EsTupleMapper esTupleMapper = new DefaultEsTupleMapper(); // 定义ES的默认映射
EsIndexBolt indexBolt = new EsIndexBolt(esConfig, esTupleMapper); //定义一个索引Bolt
topologyBuilder.setBolt("es-bolt", indexBolt, 1).shuffleGrouping("kafka-bolt"); // 向topology注入indexBolt以处理kafka-bolt的数据
// 提交到storm集群
Config config = new Config();
config.setMessageTimeoutSecs(90);
if (args.length > 0) { // 集群模式
config.setDebug(false);
StormSubmitter.submitTopology(args[0],
config, topologyBuilder.createTopology());
} else { // 本地测试模式,一般测试使用这个
// config.setDebug(true);
config.setNumWorkers(2);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("local-kafka-storm-topology",
config, topologyBuilder.createTopology());
}
}
// 自定义处理一个kafka的消息映射Bolt
static class NewsBlot extends BaseBasicBolt {
// 当有消息时执行,封装消息发送,格式与定义输出字段一一对应declarer.declare(xxx)
public void execute(Tuple input, BasicOutputCollector collector) {
// System.err.println(input.getValues());
String id = UUID.randomUUID().toString();
UserInfo userInfo = new UserInfo();
userInfo.setId(id);
userInfo.setUsername("tanwei");
userInfo.setPassword("sdnware");
Gson gson = new Gson();
String source = gson.toJson(userInfo);
collector.emit(new Values(source, "idx_sys", "tb_user", id));
}
// 定义消息发送的字段映射,这里是EsTupleMapper所需要的字段映射逻辑,可跟踪源代码理解
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("source", "index", "type", "id"));
}
}
}
在上面代码中,有些人可能会很疑惑,为什么没有看到storm的配置?代码运行就能找到storm吗,这个我第一次开发时也很疑惑,后面跟踪源码,发现所有storm配置都是基于storm-core这个包中的defaults.yaml来运行的,具体修改参照官方说明,我这里是本地测试,所以不影响测试。
在NewsBlot这个类中execute方法,由于是接受到kafka的消息,默认Tuple是一个List,包含了kafka的topic、group、offset、message信息,正式环境我们需要按业务需求封装message为一个我们所要存储到ES中的数据格式,这里测试我简单模拟了一个NewsInfo对象信息存储,一般ES的source是一个json格式,key表示ES中的字段,value即为对应值。
后记
由于只是简单写了一个demo,大概介绍了其基本实现,在整个报文系统中,需要考虑到数据的定向分组消费等问题,总之,万变不离其宗,多看源码,豁然开朗。