运行环境
集群环境:CDH6
CDH版本号:6.2.0-1.cdh6.2.0.p0.967373
Protocol Buffer3的使用
- 创建Maven项目,导入Protocol Buffer3依赖,注意:这里导入依赖的版本要和下面下载的Protocol Buffer3包一致,此处使用3.15.4
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>3.15.4</version>
</dependency>
- 编写
.proto
文件,示例如下:
syntax = "proto3";
option java_package = "com.donews";
option java_outer_classname = "UserInfoBuf";
message UserInfo {
enum Gender {
GENDER_UNKNOWN = 0;
GENDER_MALE = 1;
GENDER_FEMALE = 2;
}
message TaobaoExtra {
string userid = 1;
string open_sid = 2;
string top_access_token = 3;
string avatar_url = 4;
string havana_sso_token = 5;
string nick = 6;
string open_id = 7;
string top_auth_code = 8;
string top_expire_time = 9;
}
message WechatExtra {
string access_token = 1;
int64 expiresIn = 2;
string refresh_token = 3;
string open_id = 4;
string scope = 5;
string nick_name = 6;
int32 sex = 7;
string province = 8;
string city = 9;
string country = 10;
string headimgurl = 11;
repeated string privilege = 12;
string unionid = 13;
}
uint64 id = 1;
string user_name = 2;
string wechat = 3;
string head_img = 4;
Gender gender = 5;
string birthday = 6;
string token = 7;
string third_party_id = 8;
bool is_new = 9;
WechatExtra wechat_extra = 10;
TaobaoExtra taobao_extra = 11;
string mobile = 12;
string invite_code = 13;
bool is_deleted = 14;
bool is_invited = 15;
string suuid = 16;
string created_at = 17;
string channel = 18;
string version_code = 19;
string package_name = 20;
}
syntax = "proto3";
定义protobuf协议版本,使用proto3的语法option java_package = "com.donews";
文件选项,生成的Java代码所在的包option java_outer_classname = "UserInfoBuf";
生成的Java类名
- 此处是在window环境下生成相应的.java文件,则下载Protocol Buffers 的window包,版本和Maven依赖保持一致,使用3.15.4, 如图:
- 下载完成后解压目录到spark项目目录下
- 执行命令行生成.java文件,这里以
.proto
在resources目录下为例
执行如下命令:
D:\IdeaProjects\spark-pb3-demo>cd protoc-3.15.4-win64/bin
D:\IdeaProjects\spark-pb3-demo\protoc-3.15.4-win64\bin>protoc.exe --proto_path ..\..\src\main\resources --java_out ..\..\src\main\scala ..\..\src\main\resources\UserInfo.proto
执行完命令行,可看到生成了如图所示的文件
Spark消费案例一
这里直接贴代码,pom文件和代码如下:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.donews</groupId>
<artifactId>spark-pb3-demo</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<spark.version>2.4.0</spark.version>
<encoding>UTF-8</encoding>
</properties>
<dependencies>
<!-- 导入spark的依赖 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>3.15.4</version>
</dependency>
</dependencies>
<build>
<pluginManagement>
<plugins>
<!-- 编译scala的插件 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
</plugin>
<!-- 编译java的插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
</plugin>
</plugins>
</pluginManagement>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
</plugin>
<!-- 打jar插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
package com.donews
import java.nio.ByteBuffer
import org.apache.kafka.common.serialization.{ByteBufferDeserializer, StringDeserializer}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.{CanCommitOffsets, HasOffsetRanges, KafkaUtils}
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* @title PB3OuterDemo
* @projectName spark-pb3-demo
* @author Niclas
* @date 2021/2/26 15:02
* @description
*/
object PB3OuterDemo {
/*
spark-submit --class com.donews.PB3OuterDemo --master yarn-client --driver-memory 1g --driver-cores 1 --executor-memory 1g --executor-cores 1 --num-executors 1 --queue root.default --conf spark.dynamicAllocation.enabled=false spark-pb3-demo-1.0-SNAPSHOT.jar
*/
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.set("spark.streaming.stopGracefullyOnShutdown", "true")
.set("spark.streaming.backpressure.enabled", "true")
.setAppName(this.getClass.getSimpleName)
// .setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(3))
val kafkaSourceParams = Map[String, Object](
"bootstrap.servers" -> "bjxg-bd-slave01:29092,bjxg-bd-slave02:29092,bjxg-bd-slave03:29092,bjxg-bd-slave04:29092,bjxg-bd-slave05:29092",
"group.id" -> "pb_test",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[ByteBufferDeserializer],
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val sourceTopic = Array("common-user-info")
val stream = KafkaUtils.createDirectStream[String, ByteBuffer](
ssc,
PreferConsistent,
Subscribe[String, ByteBuffer](sourceTopic, kafkaSourceParams)
)
stream.foreachRDD(rdd => {
// 获取偏移量
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.map(cr =>{
val UserInfoBytes: ByteBuffer = cr.value()
val userInfo = UserInfoBuf.UserInfo.parseFrom(UserInfoBytes)
userInfo
}).foreach(println)
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
})
ssc.start()
ssc.awaitTermination()
}
}
这里打包提到yarn运行的时候出现了报错,如图:
java.lang.NoSuchMethodError:com.google.protobuf.CodedInputStream.readStringRequireUtf8()Ljava/lang/String
解决方法如下:
根据判断,该问题是引入protobuf-java3依赖和spark内部自带的protobuf-java2依赖冲突,修改pom文件之后程序正常运行,如图:
<configuration>
<relocations>
<relocation>
<pattern>com.google.protobuf</pattern>
<shadedPattern>shaded.com.google.protobuf</shadedPattern>
</relocation>
</relocations>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
Spark消费案例二
对于消费单个Topic的情况,我们还可以通过实现Kafka反序列化接口的方式来解析数据,对应的类为:org.apache.kafka.common.serialization.Deserializer
,示例如下:
package com.donews;
import com.google.protobuf.InvalidProtocolBufferException;
import org.apache.kafka.common.serialization.Deserializer;
import java.util.Map;
/**
* @author Niclas
* @title UserInfoDeserializer
* @projectName spark-pb3-demo
* @date 2021/2/26 17:39
* @description
*/
public class UserInfoDeserializer implements Deserializer<UserInfoBuf.UserInfo> {
@Override
public void configure(Map<String, ?> configs, boolean isKey) {
}
@Override
public UserInfoBuf.UserInfo deserialize(String topic, byte[] data) {
if (data == null)
return null;
try {
return UserInfoBuf.UserInfo.parseFrom(data);
} catch (InvalidProtocolBufferException e) {
e.printStackTrace();
}
return null;
}
@Override
public void close() {
}
}
Spark的消费程序如下:
val kafkaSourceParams = Map[String, Object](
"bootstrap.servers" -> "bjxg-bd-slave01:29092,bjxg-bd-slave02:29092,bjxg-bd-slave03:29092,bjxg-bd-slave04:29092,bjxg-bd-slave05:29092",
"group.id" -> "pb_test",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[UserInfoDeserializer],
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val sourceTopic = Array("common-user-info")
val stream = KafkaUtils.createDirectStream[String, UserInfoBuf.UserInfo](
ssc,
PreferConsistent,
Subscribe[String, UserInfoBuf.UserInfo](sourceTopic, kafkaSourceParams)
)
完整示例代码见:
https://github.com/PengShuaixin/spark-pb3-demo