Storm:实时缓存热点数据统计->缓存预热->缓存热点数据自动降级
Hive:Hadoop生态栈里面,做数据仓库的一个系统,高并发访问下,海量请求日志的批量统计分析,日报周报月报,接口调用情况,业务使用情况,等等
在一些大公司里面,是有些人是将海量的请求日志打到hive里面,做离线的分析,然后反过来去优化自己的系统
Spark:离线批量数据处理,比如从DB中一次性批量处理几亿数据,清洗和处理后写入Redis中供后续的系统使用,大型互联网公司的用户相关数据
ZooKeeper:分布式系统的协调,分布式锁,分布式选举->高可用HA架构,轻量级元数据存储
用java开发了分布式的系统架构,你的整套系统拆分成了多个部分,每个部分都会负责一些功能,互相之间需要交互和协调
服务A说,我在处理某件事情的时候,服务B你就别处理了
服务A说,我一旦发生了某些状况,希望服务B你立即感知到,然后做出相应的对策
HBase:海量数据的在线存储和简单查询,替代MySQL分库分表,提供更好的伸缩性
java底层,对应的是海量数据,然后要做一些简单的存储和查询,同时数据增多的时候要快速扩容
mysql分库分表就不太合适了,mysql分库分表扩容,还是比较麻烦的
Elasticsearch:海量数据的复杂检索以及搜索引擎的构建,支撑有大量数据的各种企业信息化系统的搜索引擎,电商/新闻等网站的搜索引擎,等等
一、Storm到底是什么?
1、mysql,hadoop与storm
mysql:事务性系统,面临海量数据的尴尬
hadoop:离线批处理
storm:实时计算
2、
JStorm,阿里
阿里,技术实力,世界一流,顶尖,国内顶尖,一流
JStorm,clojure编程预压,Java重新写了一遍,Galaxy流式计算的系统
3、storm的特点是什么?
(1)支撑各种实时类的项目场景:实时处理消息以及更新数据库,基于最基础的实时计算语义和API(实时数据处理领域);对实时的数据流持续的进行查询或计算,同时将最新的计算结果持续的推送给客户端展示,同样基于最基础的实时计算语义和API(实时数据分析领域);对耗时的查询进行并行化,基于DRPC,即分布式RPC调用,单表30天数据,并行化,每个进程查询一天数据,最后组装结果
storm做各种实时类的项目都ok
(2)高度的可伸缩性:如果要扩容,直接加机器,调整storm计算作业的并行度就可以了,storm会自动部署更多的进程和线程到其他的机器上去,无缝快速扩容
扩容起来,超方便
(3)数据不丢失的保证:storm的消息可靠机制开启后,可以保证一条数据都不丢
数据不丢失,也不重复计算
(4)超强的健壮性:从历史经验来看,storm比hadoop、spark等大数据类系统,健壮的多的多,因为元数据全部放zookeeper,不在内存中,随便挂都不要紧
特别的健壮,稳定性和可用性很高
(5)使用的便捷性:核心语义非常的简单,开发起来效率很高
用起来很简单,开发API还是很简单的
image.png
1.Client提交topology作业的相关jar包到Nimbus
2.提交的jar包会被上传到Nimbus服务器下的Store-local/nimbus/inbox目录下
3.submitTopology方法负责对这个topology进行处理。
首先对storm本身和topology进行一些校验,检查storm状态是否是active的
检查是否存在同名的topology已经在storm中运行了
检查topology中的spout和bolt是否使用了相同的id,以及id是否规范,不能以开头,是系统保留的命名方式
4.建立topology的本地文件目录 /nimbus/stormdist/topology-uuid
该目录包括三个文件
stormjar.jar :包含这个topology的所有代码的jar包
stomecode.ser --这个topology对象的序列化
stomeconf.ser --运行这个topology的配置
5.numbus分配任务,获取空闲的work,根据topology定义中给的numworker参数、parallelism参数和numTask数量,给spout和bolt设置task数目,并且分配相应的task-id,分配worker
6.numbus在ZK上创建taskbeat目录,要求每个task每隔一定时间就要发送心跳汇报状态
7.将分配好的任务写入到zk中,此刻任务才算提交完毕,zk上节点为assignment/topology-uuid
8.将topology的信息写入到zookeeper/storms目录
9.Supervisor监听zookeeper上的storms目录,看看是否有所属的新的任务,有就根据任务信息,启动worker,下载下来jar包
10.删除本地不再运行的topology代码
11.supervisor根据nimbus指定的任务信息启动worker进行工作
12.work根据taskid分辨出spout和blot
13.计算出所代表的spout/blot会给哪些task发送消息
14.根据ip和端口号创建响应的网络连接用来发送消息
大白话讲解
二、Storm的集群架构以及核心概念
1、Storm的集群架构
Nimbus,Supervisor,ZooKeeper,Worker,Executor,Task
2、Storm的核心概念
Topology,Spout,Bolt,Tuple,Stream
拓扑:务虚的一个概念
Spout:数据源的一个代码组件,就是我们可以实现一个spout接口,写一个java类,在这个spout代码中,我们可以自己尝试去数据源获取数据,比如说从kafka中消费数据
bolt:一个业务处理的代码组件,spout会将数据传送给bolt,各种bolt还可以串联成一个计算链条,java类实现了一个bolt接口
一堆spout+bolt,就会组成一个topology,就是一个拓扑,实时计算作业,spout+bolt,一个拓扑涵盖数据源获取/生产+数据处理的所有的代码逻辑,topology
tuple:就是一条数据,每条数据都会被封装在tuple中,在多个spout和bolt之间传递
stream:就是一个流,务虚的一个概念,抽象的概念,源源不断过来的tuple,就组成了一条数据流
并行度和流分组
并行度:Worker->Executor->Task,没错,是Task
流分组:Task与Task之间的数据流向关系
Shuffle Grouping:随机发射,负载均衡
Fields Grouping:根据某一个,或者某些个,fields,进行分组,那一个或者多个fields如果值完全相同的话,那么这些tuple,就会发送给下游bolt的其中固定的一个task
你发射的每条数据是一个tuple,每个tuple中有多个field作为字段
比如tuple,3个字段,name,age,salary
{“name”: “tom”, “age”: 25, “salary”: 10000} -> tuple -> 3个field,name,age,salary
All Grouping
Global Grouping
None Grouping
Direct Grouping
Local or Shuffle Grouping
部署一个storm集群
(1)安装Java 和Pythong
[root@node1 zk]# python -V
Python 2.7.5
java version “1.8.0_131”
Java™ SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot™ 64-Bit Server VM (build 25.131-b11, mixed mode)
[root@node1 zk]#
(2)下载storm安装包,解压缩,重命名,配置环境变量
mv apache-storm-1.1.0/ storm
修改环境变量
vi ~/.bashrc
修改完成以后,source ~/.bashrc
新增storm的环境变量
# .bashrc
export JAVA_HOME=/usr/java/latest
export ZOOKEEPER_HOME=/usr/local/zk
export SCALA_HOME=/usr/local/scala
export STORM_HOME=/usr/local/storm
export PATH=$PATH:$JAVA_HOME/bin:$ZOOKEEPER_HOME/bin:$SCALA_HOME/bin:$STORM_HOME/bin
# User specific aliases and functions
alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
(3)修改storm配置文件
vi storm/conf/storm.yaml
修改服务节点
# storm.zookeeper.servers:
# - "server1"
# - "server2"
#
# nimbus.seeds: ["host1", "host2", "host3"]
----修改为
storm.zookeeper.servers:
- "10.1.218.22"
- "10.1.218.26"
- "10.1.218.24"
#
nimbus.seeds: ["10.1.218.22"]
新增配置storm.local.dir: “/var/storm”
新增配置
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
mkdir /var/storm
nimbus.seeds: [“111.222.333.44”]
slots.ports,指定每个机器上可以启动多少个worker,一个端口号代表一个worker
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
将node1节点中的~/.bashrc 和strom目录拷贝到node2和node3上
[root@node1 storm]# scp ~/.bashrc root@node2:~/
.bashrc 100% 403 85.1KB/s 00:00
[root@node1 storm]# scp ~/.bashrc root@node3:~/
并将storm目录拷贝到node2和node3上
scp -r storm/ root@node2:/usr/local
scp -r storm/ root@node3:/usr/local
然后确保zookeeper已经启动了,三个节点的zk启动成功后,再执行下面的操作,否则nimbus和supervisor启动后,也会自动挂掉
(4)启动storm集群和ui界面
node1节点节点上,storm nimbus >/dev/null 2>&1 &
三个节点上执行,storm supervisor >/dev/null 2>&1 &
这样node1上
[root@node1 local]# jps
16545 Supervisor
11730 nimbus
18763 Jps
node2
[root@node2 local]# jps
25013 Supervisor
25814 Jps
node3上
[root@node3 ~]# jps
24597 Supervisor
25499 Jps
(5)启动图形界面
node1节点,storm ui >/dev/null 2>&1 &
如果后期将java程序打包到strom中运行,如果想要看storm中运行的java的日志信息,需要在部署了Supervisor的节点上,执行storm logviewer >/dev/null 2>&1 &
启动后,jps
进程号 logviewer
(6)访问一下ui界面,8080端口
http://10.1.218.22:8080
提交作业到storm集群来运行
将idea中的工程打包mvn package,然后将jar包上传到node1的/usr/local中
(1)提交作业到storm集群
storm jar strom-helloworld-0.0.1-SNAPSHOT.jar com.example.strom.WordCountTopology wordCountTopology
SLF4J: Found binding in [jar:file:/usr/local/storm/lib/log4j-slf4j-impl-2.8.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/strom-helloworld-0.0.1-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Running: /usr/java/latest/bin/java -client -Ddaemon.name= -Dstorm.options= -Dstorm.home=/usr/local/storm -Dstorm.log.dir=/usr/local/storm/logs -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Dstorm.conf.file= -cp /usr/local/storm/lib/storm-core-1.1.0.jar:/usr/local/storm/lib/kryo-3.0.3.jar:/usr/local/storm/lib/reflectasm-1.10.1.jar:/usr/local/storm/lib/asm-5.0.3.jar:/usr/local/storm/lib/minlog-1.3.0.jar:/usr/local/storm/lib/objenesis-2.1.jar:/usr/local/storm/lib/clojure-1.7.0.jar:/usr/local/storm/lib/ring-cors-0.1.5.jar:/usr/local/storm/lib/disruptor-3.3.2.jar:/usr/local/storm/lib/log4j-api-2.8.jar:/usr/local/storm/lib/log4j-core-2.8.jar:/usr/local/storm/lib/log4j-slf4j-impl-2.8.jar:/usr/local/storm/lib/slf4j-api-1.7.21.jar:/usr/local/storm/lib/log4j-over-slf4j-1.6.6.jar:/usr/local/storm/lib/servlet-api-2.5.jar:/usr/local/storm/lib/storm-rename-hack-1.1.0.jar:strom-helloworld-0.0.1-SNAPSHOT.jar:/usr/local/storm/conf:/usr/local/storm/bin -Dstorm.jar=strom-helloworld-0.0.1-SNAPSHOT.jar -Dstorm.dependency.jars= -Dstorm.dependency.artifacts={} com.example.strom.WordCountTopology wordCountTopology
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/storm/lib/log4j-slf4j-impl-2.8.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/strom-helloworld-0.0.1-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
1323 [main] INFO o.a.s.StormSubmitter - Generated ZooKeeper secret payload for MD5-digest: -5492289408728269242:-6150274628524274243
1558 [main] INFO o.a.s.u.NimbusClient - Found leader nimbus : node1:6627
1581 [main] INFO o.a.s.s.a.AuthUtils - Got AutoCreds []
1592 [main] INFO o.a.s.u.NimbusClient - Found leader nimbus : node1:6627
1647 [main] INFO o.a.s.StormSubmitter - Uploading dependencies - jars...
1647 [main] INFO o.a.s.StormSubmitter - Uploading dependencies - artifacts...
1648 [main] INFO o.a.s.StormSubmitter - Dependency Blob keys - jars : [] / artifacts : []
1679 [main] INFO o.a.s.StormSubmitter - Uploading topology jar strom-helloworld-0.0.1-SNAPSHOT.jar to assigned location: /var/storm/nimbus/inbox/stormjar-6c7ca9d9-3858-4f42-9899-318966784db9.jar
Start uploading file 'strom-helloworld-0.0.1-SNAPSHOT.jar' to '/var/storm/nimbus/inbox/stormjar-6c7ca9d9-3858-4f42-9899-318966784db9.jar' (8929539 bytes)
[==================================================] 8929539 / 8929539
File 'strom-helloworld-0.0.1-SNAPSHOT.jar' uploaded to '/var/storm/nimbus/inbox/stormjar-6c7ca9d9-3858-4f42-9899-318966784db9.jar' (8929539 bytes)
1945 [main] INFO o.a.s.StormSubmitter - Successfully uploaded topology jar to assigned location: /var/storm/nimbus/inbox/stormjar-6c7ca9d9-3858-4f42-9899-318966784db9.jar
1945 [main] INFO o.a.s.StormSubmitter - Submitting topology wordCountTopology in distributed mode with conf {"storm.zookeeper.topology.auth.scheme":"digest","storm.zookeeper.topology.auth.payload":"-5492289408728269242:-6150274628524274243","topology.workers":3,"topology.debug":false}
2454 [main] INFO o.a.s.StormSubmitter - Finished submitting topology: wordCountTopology
(2)在storm ui上观察storm作业的运行
(3)kill掉某个storm作业
storm kill wordCountTopology
UI界面就会显示被kill掉了
java项目
1.pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>2.7.1</version>
<relativePath/> <!-- lookup parent from repository -->
</parent>
<groupId>com.example</groupId>
<artifactId>strom-helloworld</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>strom-helloworld</name>
<description>Demo project for Spring Boot</description>
<properties>
<java.version>1.8</java.version>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter</artifactId>
<exclusions>
<exclusion>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-to-slf4j</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>1.1.0</version>
<scope>provided</scope> ---必须是provided,否则,storm jar 部署的时候,会报错
</dependency>
<dependency>
<groupId>commons-collections</groupId>
<artifactId>commons-collections</artifactId>
<version>3.2.1</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/java</sourceDirectory>
<testSourceDirectory>src/test/java</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
<configuration>
<createDependencyReducedPom>false</createDependencyReducedPom>
<artifactSet>
<excludes>
<exclude>commons-logging:commons-logging</exclude>
<exclude>javax.servlet:servlet-api</exclude>
<exclude>javax.mail:javax.mail-api</exclude>
</excludes>
</artifactSet>
</configuration>
</plugin>
</plugins>
</build>
</project>
2.WordCountBolt
public class WordCountBolt extends BaseRichBolt {
private Map<String, Integer> map = new HashMap<>();
private OutputCollector collector;
@Override
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
this.collector = collector;
}
@Override
public void execute(Tuple input) {
//1 获取传递过来的数据
String word = input.getString(0);
Integer num = input.getInteger(1);
//2 业务逻辑
if (map.containsKey(word)) {
//如果之前统计过有单词的个数,获取个数
Integer count = map.get(word);
map.put(word, count + num);
} else {
map.put(word, num);
}
// 3 控制台打印
System.err.println(Thread.currentThread().getId() + " word : " + word + " num: " + map.get(word));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
}
3.WordCountSplitBolt
public class WordCountSplitBolt extends BaseRichBolt {
private OutputCollector collector;
@Override
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
this.collector = collector;
}
@Override
public void execute(Tuple input) {
//1. 获取数据
String line = input.getString(0);
//2 截取数据
String[] splits = line.split(" ");
//3 发送出去
for (String word : splits) {
collector.emit(new Values(word,1));
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "num"));
}
}
4.WordCountSpout
public class WordCountSpout extends BaseRichSpout {
private SpoutOutputCollector collector;
@Override
public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
this.collector = collector;
}
@Override
public void nextTuple() {
//发送数据
collector.emit(new Values("shnad zhang1 zhsndga1 dasd a a b b c dd d dd"));
//延时0.5 s
try {
Thread.sleep(500);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("love"));
}
}
5.WordCountTopology
/**
* 单词计数的拓扑
*/
public class WordCountTopology {
private static Logger logger = LoggerFactory.getLogger(WordCountTopology.class);
/**
* spout
* 继承一个基类,实现接口,这里主要是从数据源获取数据
* 这里模拟不断发射一些句子
*/
public static class RandomSentenceSpout extends BaseRichSpout{
SpoutOutputCollector collector;
Random random;
/**
* 是对splout进行初始化,例如创建线程池,或者数据路连接池
* @param conf
* @param context
* @param collector 用来发射数据出去的
*/
@Override
public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
this.collector=collector;
//构造随机数生产对象
this.random=new Random();
}
/**
* 这个spout类,最终会运行在task中(某个worker进程的某个executor线程内部的某个task中)
* task会不断的循环调用nextTuple方法,这样可以不断的往外发射数据出去,形成一个数据流
*/
@Override
public void nextTuple() {
Utils.sleep(100);
String[] sentences = new String[]{"the cow jumped over the moon", "an apple a day keeps the doctor away",
"four score and seven years ago", "snow white and the seven dwarfs", "i am at two with nature"};
final String sentence = sentences[random.nextInt(sentences.length)];
logger.info("发射句子:"+sentence);
//values 就是用来构建tuple的,tuple是最小的数据单位,无限个tuple组成的流就是stream
//发射出去
collector.emit(new Values(sentence));
}
/**
* 用来定义发送出去的tuple中的filed是什么
* @param declarer
*/
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("sentence"));
}
}
/**
* 写一个bolt,直接继承BaseRichBolt基类
* 实现里面的所有方法,每个bolt代码,同样是发送到worker进程的executor线程中的task中的
*/
public static class SplitSentence extends BaseRichBolt{
OutputCollector collector;
@Override
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
this.collector=collector;
}
/**
* 每次接受到一条数据后,就会交给这个execute方法执行
* @param tuple
*/
@Override
public void execute(Tuple tuple) {
String sentence = tuple.getStringByField("sentence");
String [] words = sentence.split(" ");
for (String word : words) {
logger.info("发射单词:"+word);
collector.emit(new Values(word));
}
}
/**
* 定义发射出去的tuple,每个filed的名称
* @param declarer
*/
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
public static class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
logger.info("单词计数:"+word+"出现次数是:"+count);
collector.emit(new Values(word, count));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields( "count"));
}
}
public static void main(String[] args) throws Exception {
//将spout和bolts组成在一起,构建一个拓扑
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(/*spout名称*/"RandomSentence", /*spout实例*/new RandomSentenceSpout(),/*设置spout的exuecutor个数*/ 1);
//设置20个task,如果不设置,默认和executor数量是一样的
builder.setBolt("SplitSentence", new SplitSentence(), /*exuecutor数量*/2)
// /*设置多少个task*/.setNumTasks(10)
/*对RandomSentence进行随机分配*/.shuffleGrouping("RandomSentence");
//采用fieldsGrouping策略,那么从SplitSentence中发射出来的,相同的word肯定进入到下游相同的task中去了
//这样才能准确的统计出相同word的数量
builder.setBolt("WordCount", new WordCount(), 4)
// .setNumTasks(20)
.fieldsGrouping("SplitSentence", new Fields("word"));
Config conf = new Config();
conf.setDebug(false);
//在命令行执行,打算提交到storm集群中去
if (args != null && args.length > 0) {
conf.setNumWorkers(3);
StormSubmitter.submitTopologyWithProgressBar(/*名称*/args[0],/*配置*/ conf, builder.createTopology());
}
else {
//在本地运行
conf.setMaxTaskParallelism(20);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("WordCountTopology", conf, builder.createTopology());
//运行10s
Thread.sleep(10000);
//停止
cluster.shutdown();
}
}
}