背景
- 当前公司为互联网金融公司,面对的业务方较多;风控部门的数据分析师,策略分析师,反欺诈分析师等,目前的数据量这些分析师使用Python以及MySQL是无法满足快速高效的分析的;商城、运营部门等的报表看板,定制化用户行为分析等。;目前的自主分析是使用的开源产品Superset做一部分的改造,接入Druid,ES,Impala,分析师们已经全部转到我们的平台,大部分的使用都是基于我们数仓的DWS,但是除此之外实时数据没有完全接入,这是目前的痛点,也是最需要做的;尝试使用HBase做映射使用Impala分析,但是只能按照固定的键高效查询,无法满足更加多元化的定制化分析场景;
使用途径
- 此篇可能只是针对业务来描述的一部分简单类与方法,具体会阐述目前针对性的具体应用,那么在后续会持续更进不用的场景不同的试用过程,那么只需要将目前的简单操作与一部分的高效性探查加入即可,总的来说,只是使用方面不同,需要考虑不同的场景如何更加高效的读写,底层的实现也只是在使用时候简单的看一下,后续会针对具体的框架展开深入的源码分析以及框架分析。
Batch
- 公司的数仓架构和建模还是针对于Hive与Spark,所以使用KUDU并不是想要替代数仓,只是将一部分业务方需要的数据存储从而实现上述的业务需求。
- 批处理会涉及一部分,但可能目前来讲不是很多,后续如果针对KUDU做底层存储的画像可能会使用到,目前以实时流写入为主导。
Streaming
- 由于目前的痛点是解决实时数据分析的问题,所以与KUDU与Spark集成的方式在此次试用过程中不会太多,还是以实时流中的高效处理为主,所以使用更多的连接方式可能是同步异步的客户端而不是直连。
API结构
- 项目还没有开始,所以只是简单的看了一下使用,所以会方法依次阐述,每一个功能点都将SparkBatch与同步异步使用方式分别阐述;我这里提供具体的封装,具体需要的参数直接可以实例化出来,按照底层需要给与分区、具体参数即可。
- 实操中每一个分区包括对表的处理肯定是要求高效的,类似于KuduTable,KuduSession等一定注意尽量少的创建,一次创建完成一个分区的所有的逻辑是最好的,并且flush在最后即可。
API
依赖
<properties>
<scala.version>2.11.8</scala.version>
<spark.version>2.4.0</spark.version>
<mongodb.version>2.2.0</mongodb.version>
<hadoop.version>2.7.0</hadoop.version>
<hbase.version>2.1.0</hbase.version>
<elasticsearch.version>7.6.1</elasticsearch.version>
<encoding>UTF-8</encoding>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<kudu.version>1.6.0-cdh5.14.0</kudu.version>
<scala.version.simple>2.11</scala.version.simple>
</properties>
<repositories>
<repository>
<id>com.e-iceblue</id>
<url>http://repo.e-iceblue.cn/repository/maven-public/</url>
</repository>
<repository>
<id>cloudera</id>
<name>cloudera maven</name>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-spark2_${scala.version.simple}</artifactId>
<version>${kudu.version}</version>
</dependency>
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-client</artifactId>
<version>1.10.0</version>
</dependency>
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-test-utils</artifactId>
<version>1.10.0</version>
</dependency>
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-spark2_2.11</artifactId>
<version>1.10.0</version>
</dependency>
客户端
val kuduContext = new KuduContext("xxxxx:7051" , spark.sparkContext)
val asyncClient = kudu.asyncClient
*/
val syncClient = kudu.syncClient
建表
object KuduCreateTable {
def kuduCreateTable(
kudu: KuduContext,
tableName: String,
createTableOptions: CreateTableOptions,
schema: Schema
): Unit = {
if (kudu.tableExists(tableName)) {
println(s"kudu表已存在,tableName:${tableName},time:")
} else {
kudu.createTable(
tableName,
schema,
createTableOptions
)
}
def syncKuduCreateTable(syncClient: KuduClient , tableName : String , schema : Schema , createTableOptions: CreateTableOptions): Unit ={
if (syncClient.tableExists(tableName)){
println(s"此表已存在!tableName:${tableName},time:${}")
} else {
syncClient.createTable(
tableName ,
schema ,
createTableOptions
)
}
}
def asyncKuduCreateTable(asyncClient: AsyncKuduClient , tableName : String , schema : Schema , createTableOptions: CreateTableOptions): Unit ={
asyncClient.tableExists(tableName).join()
if (asyncClient.tableExists(tableName).join()){
println(s"此表已存在!tableName:${tableName},time:${}")
} else {
asyncClient.createTable(
tableName ,
schema ,
createTableOptions
)
}
}
}
}
import scala.collection.JavaConversions._
val createTableOptions = new CreateTableOptions
createTableOptions.setNumReplicas(1)
createTableOptions.addHashPartitions(ListBuffer("id"), 3)
val idSchemaBuilder = new ColumnSchema.ColumnSchemaBuilder("id", Type.INT32)
val nameSchemaBuilder = new ColumnSchema.ColumnSchemaBuilder("name", Type.STRING)
val schema = new Schema(ListBuffer(
idSchemaBuilder.key(true).build(),
nameSchemaBuilder.key(false).build()
))
删表
object KuduDeleteTable {
def kuduDeleteTable(
kudu : KuduContext ,
tableName : String ,
createTableOptions: CreateTableOptions ,
schema: Schema
): Unit ={
if(kudu.tableExists(tableName)){
kudu.deleteTable(tableName)
}else{
println(s"表不存在,无法删除!tableName:${tableName},time:${}")
}
def syncKuduDeleteTable(syncClient: KuduClient , tableName : String): Unit ={
if(syncClient.tableExists(tableName)){
syncClient.deleteTable(tableName)
}else{
println(s"已删除!tableName:${tableName},time:${}")
}
}
def asyncKuduDeleteTable(asyncClient: AsyncKuduClient , tableName : String): Unit ={
if(asyncClient.tableExists(tableName).join()){
asyncClient.deleteTable(tableName)
}else{
println(s"已删除!tableName:${tableName},time:${}")
}
}
}
}
UpsertRows
object KuduUpsertRows {
def kuduUpsertMethon(kudu: KuduContext, dataFrame: DataFrame, tableName: String, kuduWriteOptions: KuduWriteOptions): Unit = {
kudu.upsertRows(
dataFrame,
tableName,
kuduWriteOptions
)
}
def syncKuduUpsertMethon(
kuduSession: KuduSession,
kuduTable: KuduTable
): Unit = {
val upsert = kuduTable.newUpsert()
val partialRow = upsert.getRow()
kuduSession.apply(upsert)
}
}
val realtime_test_table_kuduMaster = syncClient.openTable("realtime_test_table_kuduMaster")
val kuduSession = syncClient.newSession()
kuduSession.setTimeoutMillis(60)
val upsert = realtime_test_table_kuduMaster.newUpsert()
val partialRow = upsert.getRow()
partialRow.addInt("id" , 10)
partialRow.addString("name" , "lujiwang")
kuduSession.apply(upsert)
kuduSession.flush()
InsertRows
- 具体的客户端使用与Upsert使用如出一辙,不再赘述。
object KuduInsertRows {
def kuduInsertMethon(kudu: KuduContext , dataFrame: DataFrame , tableName : String , kuduWriteOptions: KuduWriteOptions): Unit ={
kudu.insertRows(
dataFrame ,
tableName ,
kuduWriteOptions
)
}
}
DeleteRows
object KuduDeleteRows {
def kuduDeleteRows(kudu: KuduContext, dataFrame: DataFrame, tableName: String, kuduWriteOptions: KuduWriteOptions): Unit = {
kudu.deleteRows(
dataFrame,
tableName,
kuduWriteOptions
)
def syncKuduDeleteRows(
kuduSession: KuduSession,
kuduTable: KuduTable,
rowKeyName: String,
rowKey: Array[Object],
rowKeyType: String
): Unit = {
val delete = kuduTable.newDelete()
val row = delete.getRow
rowKeyType match {
case "Int" => {
rowKey.foreach(data => {
row.addInt(rowKeyName, data.toString.toInt)
})
}
case _ => {
rowKey.foreach(data => {
row.addString(rowKeyName, data.toString)
})
}
}
kuduSession.apply(delete)
}
def asyncKuduDeleteRows(
asyncKuduSession: AsyncKuduSession,
asyncTable: Deferred[KuduTable],
rowKeyName: String,
rowKey: Array[Object],
rowKeyType: String
): Unit = {
val delete = asyncTable.join().newDelete()
val row = delete.getRow
rowKeyType match {
case "Int" => {
rowKey.foreach(data => {
row.addInt(rowKeyName, data.toString.toInt)
})
}
case _ => {
rowKey.foreach(data => {
row.addString(rowKeyName, data.toString)
})
}
}
asyncKuduSession.apply(delete)
}
}
}
SearchRows
object KuduSearchRows {
def kuduSearchMethon(kudu : KuduContext, spark : SparkSession, tableName : String): DataFrame ={
val kuduParms = Map(
"kudu.master" -> kudu.kuduMaster,
"kudu.table" -> "realtime_test_table_kuduMaster"
)
spark.read
.options(kuduParms)
.format("kudu")
.load()
}
}
后续更进
- 其实使用很简单,批处理集成,流处理高效利用好客户端即可。
- 后续具体上在项目中会具体阐述使用,如果针对这方面的优化比较多的情况下,会去看看源码API以及框架实现来和大家唠唠,这里就不搬运了,网上有很多关于架构的介绍,不过目前来讲很多大体的方式都不尽相同,所以其中的差别与场景实现就更加重要了。