Spark和
Kafka都是比较常用的两个大数据框架,
Spark里面提供了对
Kafka读写的支持。默认情况下我们
Kafka只能写Byte数组到Topic里面,如果我们想往Topic里面读写String类型的消息,可以分别使用Kafka里面内置的StringEncoder编码类和StringDecoder解码类。那如果我们想往Kafka里面写对象怎么办?
别担心,Kafka中的kafka.serializer里面有Decoder和Encoder两个trait,这两个trait就是Kafka Topic消息相关的解码类和编码类,内置的StringDecoder和StringEncoder类分别都是继承那两个trait的。直接将String对象用给定的编码转换成Byte数组。来看下Decoder和Encoder两个trait的实现:
别担心,Kafka中的kafka.serializer里面有Decoder和Encoder两个trait,这两个trait就是Kafka Topic消息相关的解码类和编码类,内置的StringDecoder和StringEncoder类分别都是继承那两个trait的。直接将String对象用给定的编码转换成Byte数组。来看下Decoder和Encoder两个trait的实现:
01 | /** |
02 | * A decoder is a method of turning byte arrays into objects. |
03 | * An implementation is required to provide a constructor that |
04 | * takes a VerifiableProperties instance. |
05 | */ |
06 | trait Decoder[T] { |
07 | def fromBytes(bytes : Array[Byte]) : T |
08 | } |
09 | |
10 | /** |
11 | * An encoder is a method of turning objects into byte arrays. |
12 | * An implementation is required to provide a constructor that |
13 | * takes a VerifiableProperties instance. |
14 | */ |
15 | trait Encoder[T] { |
16 | def toBytes(t : T) : Array[Byte] |
17 | } |
也就是说,我们自定义的编码和解码类只需要分别实现toBytes和fromBytes函数即可。那我们如何将对象转换成Byte数组,并且如何将Byte数组转换回对象呢?记得Java中写对象的类没?我们可以用ByteArrayOutputStream
并结合ObjectOutputStream
类将对象转换成Byte数组;并用ByteArrayInputStream
结合ObjectInputStream
类将Byte数组转换回对象。这不就实现了吗??废话不多说,来看看怎么实现:
01 | class IteblogDecoder[T](props : VerifiableProperties = null ) extends Decoder[T] { |
02 | |
03 | def fromBytes(bytes : Array[Byte]) : T = { |
04 | var t : T = null .asInstanceOf[T] |
05 | var bi : ByteArrayInputStream = null |
06 | var oi : ObjectInputStream = null |
07 | try { |
08 | bi = new ByteArrayInputStream(bytes) |
09 | oi = new ObjectInputStream(bi) |
10 | t = oi.readObject().asInstanceOf[T] |
11 | } |
12 | catch { |
13 | case e : Exception = > { |
14 | e.printStackTrace(); null |
15 | } |
16 | } finally { |
17 | bi.close() |
18 | oi.close() |
19 | } |
20 | t |
21 | } |
22 | } |
23 | |
24 | class IteblogEncoder[T](props : VerifiableProperties = null ) extends Encoder[T] { |
25 | |
26 | override def toBytes(t : T) : Array[Byte] = { |
27 | if (t == null ) |
28 | null |
29 | else { |
30 | var bo : ByteArrayOutputStream = null |
31 | var oo : ObjectOutputStream = null |
32 | var byte : Array[Byte] = null |
33 | try { |
34 | bo = new ByteArrayOutputStream() |
35 | oo = new ObjectOutputStream(bo) |
36 | oo.writeObject(t) |
37 | byte = bo.toByteArray |
38 | } catch { |
39 | case ex : Exception = > return byte |
40 | } finally { |
41 | bo.close() |
42 | oo.close() |
43 | } |
44 | byte |
45 | } |
46 | } |
47 | } |
这样我们就定义了自己的编码和解码器。那如何使用呢??假设我们有一个Person类。如下:
1 | case class Person( var name : String, var age : Int) |
我们可以在发送数据这么使用:
01 | def getProducerConfig(brokerAddr : String) : Properties = { |
02 | val props = new Properties() |
03 | props.put( "metadata.broker.list" , brokerAddr) |
04 | props.put( "serializer.class" , classOf[IteblogEncoder[Person]].getName) |
05 | props.put( "key.serializer.class" , classOf[StringEncoder].getName) |
06 | props |
07 | } |
08 | |
09 | def sendMessages(topic : String, messages : List[Person], brokerAddr : String) { |
10 | val producer = new Producer[String, Person]( |
11 | new ProducerConfig(getProducerConfig(brokerAddr))) |
12 | producer.send(messages.map { |
13 | new KeyedMessage[String, Person](topic, "Iteblog" , _ ) |
14 | } : _ *) |
15 | producer.close() |
16 | } |
17 | |
18 | def main(args : Array[String]) { |
19 | val sparkConf = new S parkConf().setAppName( this .getClass.getSimpleName) |
20 | val ssc = new StreamingContext(sparkConf, Milliseconds( 500 )) |
21 | val topic = args( 0 ) |
22 | val brokerAddr = "http://www.iteblog.com:9092" |
23 | |
24 | val data = List(Person( "wyp" , 23 ), Person( "spark" , 34 ), Person( "kafka" , 23 ), |
25 | Person( "iteblog" , 23 )) |
26 | sendMessages(topic, data, brokerAddr) |
27 | } |
在接收端可以这么使用
01 | val sparkConf = new S parkConf().setAppName( this .getClass.getSimpleName) |
02 | val ssc = new StreamingContext(sparkConf, Milliseconds( 500 )) |
03 | val (topic, groupId) = (args( 0 ), args( 1 )) |
04 | |
05 | val kafkaParams = Map( "zookeeper.connect" -> "http://www.iteblog.com:2181" , |
06 | "group.id" -> groupId, |
07 | "auto.offset.reset" -> "smallest" ) |
08 | |
09 | val stream = KafkaUtils.createStream[String, Person, StringDecoder, |
10 | IteblogDecoder[Person]](ssc, kafkaParams, Map(topic -> 1 ), |
11 | StorageLevel.MEMORY _ ONLY) |
12 | stream.foreachRDD(rdd = > { |
13 | if (rdd.count() ! = 0 ) { |
14 | rdd.foreach(item = > if (item ! = null ) println(item)) |
15 | } else { |
16 | println( "Empty rdd!!" ) |
17 | } |
18 | }) |
19 | ssc.start() |
20 | ssc.awaitTermination() |
21 | ssc.stop() |
这样可以发送任意可序列化的对象了。下面是效果:
1 | Empty rdd!! |
2 | (Iteblog,Person(wyp, 23 )) |
3 | (Iteblog,Person(spark, 34 )) |
4 | (Iteblog,Person(kafka, 23 )) |
5 | (Iteblog,Person(iteblog, 23 )) |
6 | Empty rdd!! |
7 | Empty rdd!! |
8 | Empty rdd!! |
9 | Empty rdd!! |
在例子中我们只是简单的将接收到的消息打印到控制台。如果没有接收到消息,则打印Empty rdd!!。