一、spark直连方式和Receiver方式比较
consumer 传统的消息者(老的方式)需要连接ZK,新的方式(高效的方式)不需要连接ZK,但是要自己维护偏移量 consumer group 一个消费者组下可以有多个消费者,不重复消息
DStream离散的数据流,是SparkStreaming中一个最基本的抽象,DStream中不存放数据,也可以认为是一个分布式的数据集,本子上DStream是一系列连续的RDD,DStream是对RDD的封装,可以处理实时的数据流
DStream每隔一段时间会生成一个小批次,将该小批次提交到Spark引擎中进行计算,每一个批次都会有计算的结果
SparkStreaming整合Kafka有两种方式(0.8, 0.10只支持直连方式) 第一种方式就是有Receiver的方式,相当于有接受者,接收一个批次产生数据的,然后在进行计算,适合高级的消费API(连接zk,自动跟新偏移量,WAL),但是效率比较低 第二种方式是直连方式,一个SparkStraming的Task之间连到Kafka对应Topic的分区上,以迭代器的方式一条一条对其数据,边度边计算,计算一个批次的时间,生成一个批次的小结构,该消息方式,不需要连接zk,之间连接broker上,但是需要手动维护偏移量(偏移量可以记录到MySQL、Redis、ZK)
|
二、sparkStreaming直连方式整合Kafka0.8
object KafkaDirectWordCount { def main(args: Array[String]): Unit = { val group="g001" val conf = new SparkConf().setMaster("local[2]").setAppName("KafkaDirectWordCount") val ssc = new StreamingContext(conf,Seconds(5)) val topic ="wordcount" //指定kafka的broker地址(sparkStream的Task直连到kafka的分区上,用更加底层的API消费,效率更高) val brokerList = "hdp20-01:9092,hdp20-02:9092,hdp20-03:9092" //指定zk的地址,后期更新消费的偏移量时使用 val zkQuorum = "hdp20-01:2181,hdp20-02:2181,hdp20-03:2181" //创建 stream 时使用的 topic 名字集合 val topics: Set[String] = Set(topic) //创建一个 ZKGroupTopicDirs 对象,其实是指定往zk中写入数据的目录,用于保存偏移量 val topicDirs = new ZKGroupTopicDirs(group, topic) //获取 zookeeper 中的路径 "/g001/offsets/wordcount/" val zkTopicPath = s"${topicDirs.consumerOffsetDir}" //准备kafka的参数 val kafkaParams = Map( "metadata.broker.list" -> brokerList, "group.id" -> group, "auto.offset.reset" -> kafka.api.OffsetRequest.SmallestTimeString ) //zookeeper 的host 和 ip,创建一个 client,用于跟新偏移量量的 val zkClient = new ZkClient(zkQuorum) //查询该路径下是否字节点(默认有字节点为我们自己保存不同 partition 时生成的) // /g001/offsets/wordcount/0/10001" // /g001/offsets/wordcount/1/30001" // /g001/offsets/wordcount/2/10001" //zkTopicPath -> /g001/offsets/wordcount/ val children = zkClient.countChildren(zkTopicPath) var kafkaStream: InputDStream[(String, String)] = null //如果 zookeeper 中有保存 offset,我们会利用这个 offset 作为 kafkaStream 的起始位置 var fromOffsets: Map[TopicAndPartition, Long] = Map() //如果保存过 offset if (children > 0) { for (i <- 0 until children) { // /g001/offsets/wordcount/0/10001 val partitionOffset = zkClient.readData[String](s"$zkTopicPath/${i}") // wordcount/0 val tp = TopicAndPartition(topic, i) //将不同 partition 对应的 offset 增加到 fromOffsets 中 // wordcount/0 -> 10001 fromOffsets += (tp -> partitionOffset.toLong) } //Key: wordcount values: "hello tom hello jerry" //这个会将 kafka 的消息进行 transform,最终 kafak 的数据都会变成 (topic_name, message) 这样的 tuple val messageHandler = (mmd: MessageAndMetadata[String, String]) => (mmd.topic, mmd.message()) //通过KafkaUtils创建直连的DStream(fromOffsets参数的作用是:按照前面计算好了的偏移量继续消费数据) //[String, String, StringDecoder, StringDecoder, (String, String)] // key value key的解码方式 value的解码方式 kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc, kafkaParams, fromOffsets, messageHandler) } else { //如果未保存,根据 kafkaParam 的配置使用最新(largest)或者最旧的(smallest) offset kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) } //偏移量的范围 var offsetRanges = Array[OffsetRange]() //从kafka读取的消息,DStream的Transform方法可以将当前批次的RDD获取出来 //该transform方法计算获取到当前批次RDD的偏移量 val transform: DStream[(String, Int)] = kafkaStream.transform { rdd => //得到该 rdd 对应 kafka 的消息的 offset //该RDD是一个KafkaRDD,可以获得偏移量的范围 offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges rdd }.map(_._2).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_) val messages: DStream[(String, Int)] = transform //依次迭代DStream中的RDD messages.foreachRDD { rdd => //对RDD进行操作,触发Action rdd.foreachPartition(partition => partition.foreach(x => { println(x) }) ) for (o <- offsetRanges) { val zkPath = s"${topicDirs.consumerOffsetDir}/${o.partition}" //将该 partition 的 offset 保存到 zookeeper ZkUtils.updatePersistentPath(zkClient, zkPath,o.fromOffset.toString) } } ssc.start() ssc.awaitTermination() } } |
五、Kafka0.10版本安装
kafka集群部署 broker.id=1 delete.topic.enable=true log.dirs=/bigdata/kafka_2.11-0.10.2.1/data zookeeper.connect=node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181
启动kafka /bigdata/kafka_2.11-0.10.2.1/bin/kafka-server-start.sh -daemon /bigdata/kafka_2.11-0.10.2.1/config/server.properties
停止kafka /bigdata/kafka_2.11-0.10.2.1/bin/kafka-server-stop.sh
创建topic /bigdata/kafka_2.11-0.10.2.1/bin/kafka-topics.sh --create --zookeeper node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181 --replication-factor 3 --partitions 3 --topic my-topic
列出所有topic /bigdata/kafka_2.11-0.10.2.1/bin/kafka-topics.sh --list --zookeeper node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181
查看某个topic信息 /bigdata/kafka_2.11-0.10.2.1/bin/kafka-topics.sh --describe --zookeeper node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181 --topic my-topic
启动一个命令行的生产者 /bigdata/kafka_2.11-0.10.2.1/bin/kafka-console-producer.sh --broker-list node-1.xiaoniu.com:9092,node-1.xiaoniu.xom:9092,node-3.xiaoniu.com:9092 --topic xiaoniu
启动一个命令行的消费者 /bigdata/kafka_2.11-0.10.2.1/bin/kafka-console-consumer.sh --zookeeper node-1.xiaoniu.com:2181,node-2.xiaoniu.com:2181,node-3.xiaoniu.com:2181 --topic my-topic --from-beginning
# 消费者连接到borker的地址 /bigdata/kafka_2.11-0.10.2.1/bin/kafka-console-consumer.sh --bootstrap-server node-1.xiaoniu.com:9092,node-2.xiaoniu.com:9092,node-3.xiaoniu.com:9092 --topic xiaoniu --from-beginning
Kafka Connect: https://kafka.apache.org/documentation/#connect http://docs.confluent.io/2.0.0/connect/connect-jdbc/docs/index.html
Kafka Stream: https://kafka.apache.org/documentation/streams https://spark.apache.org/docs/1.6.1/streaming-kafka-integration.html
kafka monitor: https://kafka.apache.org/documentation/#monitoring https://github.com/quantifind/KafkaOffsetMonitor https://github.com/yahoo/kafka-manager
kafka生态圈: https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem |
六、sparkStreaming直连方式整合Kafka0.10
object DirectStream {
def main(args: Array[String]): Unit = {
//创建SparkConf,如果将任务提交到集群中,那么要去掉.setMaster("local[2]") val conf = new SparkConf().setAppName("DirectStream").setMaster("local[2]") //创建一个StreamingContext,其里面包含了一个SparkContext val streamingContext = new StreamingContext(conf, Seconds(5))
//配置kafka的参数 val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "node-1.xiaoniu.com:9092,node-2.xiaoniu.com:9092,node-3.xiaoniu.com:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "test123", "auto.offset.reset" -> "earliest", // lastest "enable.auto.commit" -> (false: java.lang.Boolean) )
val topics = Array("xiaoniu") //在Kafka中记录读取偏移量 val stream = KafkaUtils.createDirectStream[String, String]( streamingContext, //位置策略(可用的Executor上均匀分配分区) LocationStrategies.PreferConsistent, //消费策略(订阅固定的主题集合) ConsumerStrategies.Subscribe[String, String](topics, kafkaParams) )
//迭代DStream中的RDD(KafkaRDD),将每一个时间点对于的RDD拿出来 stream.foreachRDD { rdd => //获取该RDD对于的偏移量 val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges //拿出对应的数据 rdd.foreach{ line => println(line.key() + " " + line.value()) } //异步更新偏移量到kafka中 // some time later, after outputs have completed stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) } streamingContext.start() streamingContext.awaitTermination() } } |
七、spark on yarn cluster模式
1. 官方文档http://spark.apache.org/docs/latest/running-on-yarn.html 2. 配置安装1.安装hadoop:需要安装HDFS模块和YARN模块,HDFS必须安装,spark运行时要把jar包存放到HDFS上。 2.安装Spark:解压Spark安装程序到一台服务器上,修改spark-env.sh配置文件,spark程序将作为YARN的客户端用于提交任务 export JAVA_HOME=/usr/local/jdk1.7.0_80 export HADOOP_CONF_DIR=/usr/local/hadoop-2.6.4/etc/hadoop 3.启动HDFS和YARN 3. 运行模式(cluster模式和client模式)1.cluster模式 ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ --driver-memory 1g \ --executor-memory 1g \ --executor-cores 2 \ --queue default \ lib/spark-examples*.jar \ 10
--------------------------------------------------------------------------------------------------------------------------------- ./bin/spark-submit --class cn.edu360.spark.day1.WordCount \ --master yarn \ --deploy-mode cluster \ --driver-memory 1g \ --executor-memory 1g \ --executor-cores 2 \ --queue default \ /home/bigdata/hello-spark-1.0.jar \ hdfs://node-1.edu360.cn:9000/wc hdfs://node-1.edu360.cn:9000/out-yarn-1
2.client模式 ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode client \ --driver-memory 1g \ --executor-memory 1g \ --executor-cores 2 \ --queue default \ lib/spark-examples*.jar \ 10
spark-shell必须使用client模式 ./bin/spark-shell --master yarn --deploy-mode client
3.两种模式的区别 cluster模式:Driver程序在YARN中运行,应用的运行结果不能在客户端显示,所以最好运行那些将结果最终保存在外部存储介质(如HDFS、Redis、Mysql)而非stdout输出的应用程序,客户端的终端显示的仅是作为YARN的job的简单运行状况。
client模式:Driver运行在Client上,应用程序运行结果会在客户端显示,所有适合运行结果有输出的应用程序(如spark-shell)
4.原理 cluster模式:
Spark Driver首先作为一个ApplicationMaster在YARN集群中启动,客户端提交给ResourceManager的每一个job都会在集群的NodeManager节点上分配一个唯一的ApplicationMaster,由该ApplicationMaster管理全生命周期的应用。具体过程:
1. 由client向ResourceManager提交请求,并上传jar到HDFS上 这期间包括四个步骤: a).连接到RM b).从RM的ASM(ApplicationsManager )中获得metric、queue和resource等信息。 c). upload app jar and spark-assembly jar d).设置运行环境和container上下文(launch-container.sh等脚本)
2. ResouceManager向NodeManager申请资源,创建Spark ApplicationMaster(每个SparkContext都有一个ApplicationMaster) 3. NodeManager启动ApplicationMaster,并向ResourceManager AsM注册 4. ApplicationMaster从HDFS中找到jar文件,启动SparkContext、DAGscheduler和YARN Cluster Scheduler 5. ResourceManager向ResourceManager AsM注册申请container资源 6. ResourceManager通知NodeManager分配Container,这时可以收到来自ASM关于container的报告。(每个container对应一个executor) 7. Spark ApplicationMaster直接和container(executor)进行交互,完成这个分布式任务。
client模式:
在client模式下,Driver运行在Client上,通过ApplicationMaster向RM获取资源。本地Driver负责与所有的executor container进行交互,并将最后的结果汇总。结束掉终端,相当于kill掉这个spark应用。一般来说,如果运行的结果仅仅返回到terminal上时需要配置这个。
客户端的Driver将应用提交给Yarn后,Yarn会先后启动ApplicationMaster和executor,另外ApplicationMaster和executor都 是装载在container里运行,container默认的内存是1G,ApplicationMaster分配的内存是driver- memory,executor分配的内存是executor-memory。同时,因为Driver在客户端,所以程序的运行结果可以在客户端显 示,Driver以进程名为SparkSubmit的形式存在。
|
八、遇到的问题
ERROR spark.SparkContext: Error initializing SparkContext. org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master. at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173) at org.apache.spark.SparkContext.<init>(SparkContext.scala:509) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509) at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909) at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901) at org.apache.spark.repl.Main$.createSparkSession(Main.scala:97) at $line3.$read$$iw$$iw.<init>(<console>:15) at $line3.$read$$iw.<init>(<console>:42) at $line3.$read.<init>(<console>:44) at $line3.$read$.<init>(<console>:48) at $line3.$read$.<clinit>(<console>) at $line3.$eval$.$print$lzycompute(<console>:7) at $line3.$eval$.$print(<console>:6) at $line3.$eval.$print(<console>) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047) at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638) at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637) at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565) at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807) at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681) at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37) at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37) at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214) at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37) at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:98) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909) at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909) at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97) at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909) at org.apache.spark.repl.Main$.doMain(Main.scala:70) at org.apache.spark.repl.Main$.main(Main.scala:53) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 17/08/29 18:11:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered! 17/08/29 18:11:51 WARN metrics.MetricsSystem: Stopping a MetricsSystem that is not running org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master. at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173) at org.apache.spark.SparkContext.<init>(SparkContext.scala:509) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509) at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909) at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901) at org.apache.spark.repl.Main$.createSparkSession(Main.scala:97) ... 47 elided |
解决:在yarn-site.xml中配置
<property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property>
<property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> |