1、实时系统架构简介
实时数据处理
长江流域水质监测,双十一天猫交易额,高德地图
实时加离线:广告推送
高可用,高并发,高吞吐
消息中间件/消息队列
大数据:Kafka:临时存储数据
实时计算系统:SparkStreeming/storm
数据库:Hbase、Redis/NoSQL
关系型数据库:MySQL、Oracle
2、Kafka消息中间件
Kafka.apache.org
3、Kafka中的概念介绍
Broker
Kafka集群包含一个或多个服务器,这种服务器被称为broker
Topic
每条发布到Kafka集群的消息都有一个类别,这个类别被称为Topic。
(物理上不同Topic的消息分开存储,逻辑上一个Topic的消息虽然保存于一个
或多个broker上但用户只需指定消息的Topic即可生产或消费数据而不必关心数据存于何处)
Partition
Parition是物理上的概念,每个Topic包含一个或多个Partition.
Producer:消息的生产者
负责发布消息到Kafka broker(push)
Consumer
消息消费者,向Kafka broker读取消息的客户端。(poll)
Consumer Group
每个Consumer属于一个特定的Consumer Group(可为每个Consumer指定group name,
若不指定group name则属于默认的group)。
4、kafka集群安装
1.安装zk集群
2.config/server.properties
修改broker.id(唯一的):broker.id=1 修改kafka绑定的网卡host.name=node-1.xiaoniu.com 修改kafka数据存放目录:log.dirs=/bigdata/kafka_2.11-0.8.2.2/data 修改zk地址:zookeeper.connect=node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181
将配置好的的zk拷贝到其他机器上(修改broker.id)
3.启动 bin/kafka-server-start.sh -daemon config/server.properties
4.创建topic
/bigdata/kafka_2.11-0.8.2.2/bin/kafka-topics.sh --create --zookeeper node-1:2181,node-2:2181,node-3:2181 --replication-factor 3 --partitions 3 --topic test
5.列出所有topic
/bigdata/kafka_2.11-0.8.2.2/bin/kafka-topics.sh --list --zookeeper localhost:2181
6.向topic中写入数据 /bigdata/kafka_2.11-0.8.2.2/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic xiaoniu
7.消费数据 /bigdata/kafka_2.11-0.8.2.2/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic xiaoniu --from-beginning
8.查看指定topic的详情 /bigdata/kafka_2.11-0.8.2.2/bin --describe --zookeeper localhost:2181 --topic test |
5、Kafka生产者和消费者
6、Kafka生产者JavaAPI
<dependencies> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka_2.11</artifactId> <version>0.8.2.2</version> </dependency>
</dependencies> |
public class ConsumerDemo { private static final String topic = "test888"; private static final Integer threads = 2;
public static void main(String[] args) {
Properties props = new Properties(); props.put("zookeeper.connect", "node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181"); props.put("group.id", "vvvvv"); //smallest重最开始消费,largest代表重消费者启动后产生的数据才消费 props.put("auto.offset.reset", "smallest");
ConsumerConfig config = new ConsumerConfig(props); ConsumerConnector consumer =Consumer.createJavaConsumerConnector(config); Map<String, Integer> topicCountMap = new HashMap<String, Integer>(); topicCountMap.put(topic, threads); Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap); List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);
for(final KafkaStream<byte[], byte[]> kafkaStream : streams){ new Thread(new Runnable() {
public void run() { for(MessageAndMetadata<byte[], byte[]> mm : kafkaStream){ String msg = new String(mm.message()); System.out.println(msg); } } }).start(); } } } |
public class ProducerDemo {
public static void main(String[] args) { Properties props = new Properties(); props.put("metadata.broker.list", "node-1.xiaoniu.com:9092,node-2.xiaoniu.com:9092,node-3.xiaoniu.com:9092"); props.put("serializer.class", "kafka.serializer.StringEncoder"); ProducerConfig config = new ProducerConfig(props); Producer<String, String> producer = new Producer<String, String>(config); for (int i = 1001; i <= 1100; i++) producer.send(new KeyedMessage<String, String>("test888", "xiaoniu" + i)); } } |
7、Kafka中partition的详解
8、SparkStreaming的详解
Spark.Apache.org
9、DSTream的详解
10、Streaming入门程序
object TcpWordCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("TcpWordCount").setMaster("local[2]")
val sc = new SparkContext(conf)
//先创建StreamingContext,然后才能创建DStream, //指定生产批次的时间,即五秒中生产一个小RDD val ssc = new StreamingContext(sc, Seconds(5))
//通过StreamingContext创建DStream val lines: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.10.12", 8888)
//完成WordCount的功能(Transformation) val result: DStream[(String, Int)] = lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_)
//print是一个Action,是将计算结果打印在控制台 result.print()
//启动实施任务 ssc.start()
//等待任务优雅的退出 ssc.awaitTermination() }
} |
11、SparkStreaming整合Kafka
object KafkaWordCount {
def main(args: Array[String]): Unit = {
val sc = new SparkConf().setAppName("KafkaWordCount").setMaster("local[*]")
val ssc = new StreamingContext(sc, Seconds(5))
//传统低效的API,需要连ZK //通过kafkaUtils创建kafkaStream val zkQuorum = "node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181" val groupId = "g1" //要消费的topic的名字,消费者的线程数量 val topic = Map[String, Int]("wordcount" -> 1) //从kafka中拉取数据 //通过kafka创建DStream val data: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(ssc, zkQuorum, groupId, topic)
//获取kafka的value即为真正存储的数据 val lines: DStream[String] = data.map(_._2)
val result: DStream[(String, Int)] = lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_)
//触发Action result.print()
//启动 ssc.start()
//等待优雅退出 ssc.awaitTermination()
} } |
12、实现可累加历史的WordCount
object StateFulKafkaWordCount {
/** * 迭代器元组中的三个参数: * 第一个参数:代表分组的KEY * 第二个参数:当前批次Key对应的Value,由于有多个Value,那么会将这个批次的Value放到一个SEQ中 * 第三个参数:代表初始值或累加的中间结果 */ val updateFunc = (it: Iterator[(String, Seq[Int], Option[Int])]) => { //it.map(tp => (tp._1, tp._2.sum + tp._3.getOrElse(0))) it.map{ case (x, y, z) => (x, y.sum + z.getOrElse(0))} }
def main(args: Array[String]): Unit = {
val sc = new SparkConf().setAppName("StateFulKafkaWordCount").setMaster("local[*]")
val ssc = new StreamingContext(sc, Seconds(5))
//如果要累加历史结果,一定要指定CheckPiont ssc.checkpoint("./ck")
//传统低效的API,需要连ZK //通过kafkaUtils创建kafkaStream val zkQuorum = "node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181" val groupId = "g1" //要消费的topic的名字,消费者的线程数量 val topic = Map[String, Int]("wordcount" -> 1) //从kafka中拉取数据 //通过kafka创建DStream val data: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(ssc, zkQuorum, groupId, topic)
//获取kafka的value即为真正存储的数据 val lines: DStream[String] = data.map(_._2)
val result: DStream[(String, Int)] = lines.flatMap(_.split(" ")).map((_, 1)).updateStateByKey(updateFunc, new HashPartitioner(ssc.sparkContext.defaultParallelism), true)
//触发Action result.print()
//启动 ssc.start()
//等待优雅退出 ssc.awaitTermination()
} } |