sparkstream 2.2.0 结合 kafka_2.10-0.10.2.1 的消费示例演示

3 篇文章 0 订阅

今天讲了kafka和sparkstream的一个简单结合,试着在网上找了一个例子进行实现

1、相关配置 spark2.2.0,scala2.11.8,kafka_2.10-0.10.2.1,jdk1.8

2、这里是自己的pom.xml文件  如下

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>make</groupId>
  <artifactId>Spark_code_hive</artifactId>
  <version>1.0-SNAPSHOT</version>
  <inceptionYear>2008</inceptionYear>
  <properties>
    <scala.version>2.11.8</scala.version>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
    <encoding>UTF-8</encoding>
    <spark.version>2.2.0</spark.version>
    <hadoop.version>2.9.1</hadoop.version>
    <kafka.version>0.10.2.1</kafka.version>
  </properties>

  <repositories>
    <repository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </repository>
  </repositories>

  <pluginRepositories>
    <pluginRepository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </pluginRepository>
  </pluginRepositories>

  <dependencies>
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.4</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.specs</groupId>
      <artifactId>specs</artifactId>
      <version>1.2.5</version>
      <scope>test</scope>
    </dependency>
    <!-- Spark core -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>${spark.version}</version>
      <scope>compile</scope>
    </dependency>
    <!-- Spark SQL -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>${spark.version}</version>
      <scope>compile</scope>
    </dependency>
    <!--spark streaming-->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.11</artifactId>
      <version>${spark.version}</version>
      <scope>compile</scope>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
      <version>${spark.version}</version>
      <scope>compile</scope>
    </dependency>

    <!--spark kafka-->
    <dependency>
      <groupId>org.apache.kafka</groupId>
      <artifactId>kafka_2.11</artifactId>
      <version>${kafka.version}</version>
      <scope>compile</scope>
    </dependency>


    <!-- HDFS Client -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>${hadoop.version}</version>
      <scope>compile</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-hive_2.11</artifactId>
      <version>${spark.version}</version>
      <scope>compile</scope>
    </dependency>
    <dependency>
      <groupId>org.spark-project.hive</groupId>
      <artifactId>hive-jdbc</artifactId>
      <version>1.2.1.spark2</version>
    </dependency>
      <dependency>
          <groupId>com.databricks</groupId>
          <artifactId>spark-csv_2.11</artifactId>
          <version>1.5.0</version>
      </dependency>

      <dependency>
      <groupId>mysql</groupId>
      <artifactId>mysql-connector-java</artifactId>
      <version>5.1.27</version>
      </dependency>

    <!--fastjson -->
    <dependency>
      <groupId>com.alibaba</groupId>
      <artifactId>fastjson</artifactId>
      <version>1.2.47</version>
    </dependency>

  </dependencies>

  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <testSourceDirectory>src/test/scala</testSourceDirectory>
    <plugins>
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
        <configuration>
          <scalaVersion>${scala.version}</scalaVersion>
          <args>
            <arg>-target:jvm-1.5</arg>
          </args>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-eclipse-plugin</artifactId>
        <configuration>
          <downloadSources>true</downloadSources>
          <buildcommands>
            <buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
          </buildcommands>
          <additionalProjectnatures>
            <projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
          </additionalProjectnatures>
          <classpathContainers>
            <classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
            <classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
          </classpathContainers>
        </configuration>
      </plugin>
    </plugins>
  </build>
  <reporting>
    <plugins>
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
        <configuration>
          <scalaVersion>${scala.version}</scalaVersion>
        </configuration>
      </plugin>
    </plugins>
  </reporting>
</project>

3、创建一个相关的配置文件,my.properties  如下,就是你的kafka的一些topic的相关设置

# kafka configs
kafka.bootstrap.servers=make.spark.com:9092,make.spark.com:9093,make.spark.com:9094
kafka.topic.source=spark-kafka-demo
kafka.topic.sink=spark-sink-test
kafka.group.id=spark_demo_gid1

4、创建我们的相关代码代码

4.1 首先创建读取我们配置文件my.properties的工具类,如下

package Utils

import java.util.Properties

/**
  * Properties的工具类
  * Created by make on 2017-08-08 18:39
  */
object PropertiesUtil {

  /**
    * 获取配置文件Properties对象
    * @author make
    * @return java.util.Properties
    */
  def getProperties() :Properties = {
    val properties = new Properties()
    //读取源码中resource文件夹下的my.properties配置文件,得到一个properties
    val reader = getClass.getResourceAsStream("/my.properties")
    properties.load(reader)
    properties
  }

  /**
    * 获取配置文件中key对应的value
    * @author make
    * @return java.util.Properties
    */
  def getPropString(key : String) : String = {
    getProperties().getProperty(key)
  }

  /**
    * 获取配置文件中key对应的整数值,可能后面这里会需要其他的值
    * @author yore
    * @return java.util.Properties
    */
  def getPropInt(key : String) : Int = {
    getProperties().getProperty(key).toInt
  }

  /**
    * 获取配置文件中key对应的布尔值
    * @author make
    * @return java.util.Properties
    */
  def getPropBoolean(key : String) : Boolean = {
    getProperties().getProperty(key).toBoolean
  }

}

4.2 我们创建一个kafkasink类用来 实例化producer以及向kafka发送数据  如下

package spark_stream

import java.util.concurrent.Future

import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord, RecordMetadata}

/**
  * 手动实现一个KafkaSink类,并实例化producer 将数据发送到kafka的对应topic
  * This is the key idea that allows us to work around running into NotSerializableExceptions.
  * Created by make on 2018-08-08 18:50
  */
class KafkaSink[K,V](createProducer: () => KafkaProducer[K, V]) extends Serializable {
  //创建一个 生产者
  lazy val producer = createProducer()

  /** 发送消息 */
  //本质是调用producer.send进行数据发送
  def send(topic : String, key : K, value : V) : Future[RecordMetadata] =
    producer.send(new ProducerRecord[K,V](topic,key,value))
  def send(topic : String, value : V) : Future[RecordMetadata] =
    producer.send(new ProducerRecord[K,V](topic,value))
}
//使用了伴生对象,简单实例化kafkasink
object KafkaSink {
  import scala.collection.JavaConversions._
  def apply[K, V](config: Map[String, Object]): KafkaSink[K, V] = {
    val createProducerFunc = () => {
      val producer = new KafkaProducer[K, V](config)
      sys.addShutdownHook {
        // Ensure that, on executor JVM shutdown, the Kafka producer sends
        // any buffered messages to Kafka before shutting down.
        producer.close()
      }
      producer
    }
    //返回一个producer
    new KafkaSink(createProducerFunc)
  }
  def apply[K, V](config: java.util.Properties): KafkaSink[K, V] = apply(config.toMap)
}


 

4.3、创建我们的主方法类

package spark_stream

import java.util.Properties

import Utils.PropertiesUtil
import com.alibaba.fastjson.{JSON, JSONObject}
import org.apache.commons.lang3.StringUtils
import org.apache.kafka.common.serialization.{StringDeserializer, StringSerializer}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.{Milliseconds, StreamingContext}
import org.apache.spark.streaming.kafka010.ConsumerStrategies._
import org.apache.spark.streaming.kafka010.LocationStrategies._


object SparkKafkaDemo extends App {
  // default a Logger Object
  val LOG = org.slf4j.LoggerFactory.getLogger(SparkKafkaDemo.getClass)

  /*if (args.length < 2) {
      System.err.println(s"""
                            |Usage: DirectKafkaWordCount <brokers> <topics>
                            |  <brokers> is a list of one or more Kafka brokers
                            |  <topics> is a list of one or more kafka topics to consume from
                            |
      """.stripMargin)
      System.exit(1)
  }*/
  // 设置日志级别
  Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
  Logger.getLogger("org.apache.spark.sql").setLevel(Level.WARN)

  val Array(brokers, topics, outTopic) = /*args*/ Array(
    PropertiesUtil.getPropString("kafka.bootstrap.servers"),
    PropertiesUtil.getPropString("kafka.topic.source"),
    PropertiesUtil.getPropString("kafka.topic.sink")
  )

  // Create context
  /* 第一种方式 */
  val sparkConf = new SparkConf().setMaster("local[2]")
    .setAppName("spark-kafka-demo1")
  val ssc = new StreamingContext(sparkConf, Milliseconds(1000))

  /* 第二种方式 */
  /*val spark = SparkSession.builder()
      .appName("spark-kafka-demo1")
      .master("local[2]")
      .getOrCreate()
  // 引入隐式转换方法,允许ScalaObject隐式转换为DataFrame
  import spark.implicits._
  val ssc = new StreamingContext(spark.sparkContext,Seconds(1))*/

  // 设置检查点
  ssc.checkpoint("spark_demo_cp1")

  // Create direct Kafka Stream with Brokers and Topics
  // 注意:这个Topic最好是Array形式的,set形式的匹配不上
  //var topicSet = topics.split(",")/*.toSet*/
  val topicsArr: Array[String] = topics.split(",")

  // set Kafka Properties
  val kafkaParams = Map[String, Object](
    "bootstrap.servers" -> brokers,
    "key.deserializer" -> classOf[StringDeserializer],
    "value.deserializer" -> classOf[StringDeserializer],
    "group.id" -> PropertiesUtil.getPropString("kafka.group.id"),
    "auto.offset.reset" -> "latest",
    "enable.auto.commit" -> (false: java.lang.Boolean)
  )

  /**
    * createStream是Spark和Kafka集成包0.8版本中的方法,它是将offset交给ZK来维护的
    *
    * 在0.10的集成包中使用的是createDirectStream,它是自己来维护offset,在这个版本中
    * zkCli中是看不到每个分区,到底消费到了那个偏移量,而在老的版本中,是可以看到的
    * 速度上要比交给ZK维护要快很多,但是无法进行offset的监控。
    * 这个方法只有3个参数,使用起来最为方便,但是每次启动的时候默认从Latest offset开始读取,
    * 或者设置参数auto.offset.reset="smallest"后将会从Earliest offset开始读取。
    *
    * 官方文档@see <a href="http://spark.apache.org/docs/2.1.2/streaming-kafka-0-10-integration.html">Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)</a>
    *
    */

  val stream = KafkaUtils.createDirectStream[String, String](
    ssc,
    PreferConsistent,
    Subscribe[String, String](topicsArr, kafkaParams)
  )

  /** Kafak sink */
  //set producer config
  val kafkaProducer: Broadcast[KafkaSink[String, String]] = {
    val kafkaProducerConfig = {
      val p = new Properties()
      p.setProperty("bootstrap.servers", brokers)
      p.setProperty("key.serializer", classOf[StringSerializer].getName)
      p.setProperty("value.serializer", classOf[StringSerializer].getName)
      p
    }
    LOG.info("kafka producer init done!")
    // 广播KafkaSink 传入kafkaProducerConfig,在kafkaSink中实例化producer
    ssc.sparkContext.broadcast(KafkaSink[String, String](kafkaProducerConfig))
  }

  var jsonObject = new JSONObject()
  //对传入的流中的数据,进行筛选和逻辑处理
  stream.filter(record => {
    // 过滤掉不符合要求的数据
    try {
      jsonObject = JSON.parseObject(record.value)
    } catch {
      case e: Exception => {
        LOG.error("转换为JSON时发生了异常!\t{}", e.getMessage)
      }
    }
    // 如果不为空字符时,为null,返回false过滤,否则为true通过
    StringUtils.isNotEmpty(record.value) && null != jsonObject
  }).map(record => {
    //这个地方可以写自己的业务逻辑代码,因为本次是测试,简单返回一个元组
    jsonObject = JSON.parseObject(record.value)
    // 返出一个元组,(时间戳,json的数据日期,json的关系人姓名)
    (System.currentTimeMillis(),
      jsonObject.getString("date_dt"),
      jsonObject.getString("relater_name")
    )
  }).foreachRDD(rdd => {
    if (!rdd.isEmpty()) {
      rdd.foreach(kafkaTuple => {
        //向Kafka发送数据,outTopic,value,也就是我们kafkasink的第二种send方法
        //取出广播的value 调用send方法 对每个数据进行发送 和 打印
        kafkaProducer.value.send(
          outTopic,
          kafkaTuple._1 + "\t" + kafkaTuple._2 + "\t" + kafkaTuple._3
        )
        //同时将信息打印到控制台,以便查看
        println(kafkaTuple._1 + "\t" + kafkaTuple._2 + "\t" + kafkaTuple._3)
      })
    }
  })

  // 启动streamContext
  ssc.start()
  //一直等待数据  直到关闭
  ssc.awaitTermination()

}

5、我们在我们kafka集群上,创建对应的一个生产者,以及消费者

      创建两个对应的topic 

bin/kafka-topics.sh --create --zookeeper make.spark.com:2181/kafka_10 --topic spark-kafka-demo --partitions 3 --replication-factor 2
bin/kafka-topics.sh --create --zookeeper make.spark.com:2181/kafka_10 --partitions 3 --replication-factor 1 --topic spark-sink-test

   创建一个对我们的程序发送数据的生产者

bin/kafka-console-producer.sh --broker-list make.spark.com:9092,make.spark.com:9093,make.spark.com:9094 --topic spark-kafka-demo

  创建一个消费我们的程发送数据的消费者

bin/kafka-console-consumer.sh --bootstrap-server make.spark.com:9092,make.spark.com:9093,make.spark.com:9094 --from-beginning --topic spark-sink-test

6、启动生产者,启动我们的程序,并在生产者窗口,写入我们的测试数据 如下

{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}
{"date_dt": "201808081823","relater_name": "make"}

切换到我们的idea 可以看到我们的打印信息已经输出了,name我么的数据也发送出去了

切换到我们的消费者窗口,也可以看到数据已经过来了

到这里,就实现一个接受,发送的一个kafka-stream-kafka这样的一个流程,也学到不少东西,以上

参考文章:参考文章点这里!!谢谢博主

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值