2,从Kafka里面拿数据的两种方式
(1) push(推过来的)
kafka,flume -> Exeuctor内存-》磁盘 处理
1)整个任务出问题了
2)整个集群宕机了
3)机房停电了
数据有可能重复消费,也有可能漏了
Spark 1.3 以后
(2) pull(拉)
sparkStreaming自己去维护消费到哪儿(HDFS上)0-50
1)数据不会丢失
2)数据不会重复消费
添加的依赖
| <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-8_2.11 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka-0-8_2.11</artifactId> <version>2.1.0</version> </dependency> |
import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
object KafkaWordCountTest {
def main(args: Array[String]): Unit= {
val conf = new SparkConf().setAppName(s"${this.getClass.getSimpleName}").setMaster("local[2]")
val ssc = new StreamingContext(conf,Seconds(2))
//这个checkpoint一定要设置,因为streamingContext从kafka里面拿取数据的偏移量记录在这个hdfs的目录下面
ssc.checkpoint("hdfs://hadoop1:9000/kafkastreaming")
/**
* ssc: StreamingContext,
* kafkaParams: Map[String, String],
* topics: Set[String]
*
*/
val kafkaParams = Map("metadata.broker.list" -> "hadoop1:9092")
val topics=Set("htt")
/**
* K, V, KD <: Decoder[K], VD <: Decoder[V]
* k,v
*
* k:位置信息
* v: 数据
*/
val kafkaDStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
.map(_._2)
kafkaDStream.flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).print()
ssc.start()
ssc.awaitTermination()
ssc.stop()
}
}
896

被折叠的 条评论
为什么被折叠?



