问题背景
最近Spark项目里有这样一个需求:需要从HDFS的某个目录下读入一些文件,这些文件是按照proto文件存储的ProtoMessage,现在需要把它们转换成Parquet存储,以供SQL查询。
环境准备
1. 准备proto文件:person_entity.proto
syntax = "proto3";
message Person { // 定义Person结构体
enum Gender { // 定义性别,为枚举类型,值为男性(Male)或者女性(Female)
Male = 0;
Female = 1;
}
string name = 1; // 定义Person结构体中的名称属性,类型为string
uint32 age = 2; // 定义Person结构体中的年龄属性,类型为uint32
Gender gender = 3; // 定义Person结构体中的性别属性,类型为Gender
// Parquet中不支持自身嵌套自身的类型,因此将该字段注释
// repeated Person_Entity children = 4; // 定义Person结构体中的孩子属性,类型为Person的列表类型
map<string, string> education_address_map = 5; // 定义上学阶段->上学地址的map映射
}
怎么生成java文件可以参考:初识Protobuf
2. 准备插件:project/assembly.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")
addSbtPlugin("com.thesamet" % "sbt-protoc" % "0.99.27")
libraryDependencies += "com.thesamet.scalapb" %% "compilerplugin" % "0.10.0"
3. 准备项目依赖文件:build.sbt
name := "protobuf_test"
version := "0.1"
scalaVersion := "2.12.12"
libraryDependencies ++= Seq(
"com.google.protobuf" % "protobuf-java" % "3.5.0",
"com.google.guava" % "guava" % "16.0.1",
"org.apache.spark" %% "spark-core" % "3.2.1" % "provided",
"org.apache.spark" %% "spark-sql" % "3.2.1" % "provided",
"com.thesamet.scalapb" %% "sparksql-scalapb" % "0.11.0-RC1",
)
PB.targets in Compile := Seq(
scalapb.gen() -> (sourceManaged in Compile).value
)
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.google.protobuf.**" -> "shadeproto.@1").inAll,
ShadeRule.rename("scala.collection.compat.**" -> "shadecompat.@1").inAll
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
之后在终端执行sbt compile, 在target/scala-2.12/src_manag