java kafka关闭连接池,Spark连接池-这是正确的方法

I have a Spark job in Structured Streaming that consumes data from Kafka and saves it to InfluxDB. I have implemented the connection pooling mechanism as follows:

object InfluxConnectionPool {

val queue = new LinkedBlockingQueue[InfluxDB]()

def initialize(database: String): Unit = {

while (!isConnectionPoolFull) {

queue.put(createNewConnection(database))

}

}

private def isConnectionPoolFull: Boolean = {

val MAX_POOL_SIZE = 1000

if (queue.size < MAX_POOL_SIZE)

false

else

true

}

def getConnectionFromPool: InfluxDB = {

if (queue.size > 0) {

val connection = queue.take()

connection

} else {

System.err.println("InfluxDB connection limit reached. ");

null

}

}

private def createNewConnection(database: String) = {

val influxDBUrl = "..."

val influxDB = InfluxDBFactory.connect(...)

influxDB.enableBatch(10, 100, TimeUnit.MILLISECONDS)

influxDB.setDatabase(database)

influxDB.setRetentionPolicy(database + "_rp")

influxDB

}

def returnConnectionToPool(connection: InfluxDB): Unit = {

queue.put(connection)

}

}

In my spark job, I do the following

def run(): Unit = {

val spark = SparkSession

.builder

.appName("ETL JOB")

.master("local[4]")

.getOrCreate()

...

// This is where I create connection pool

InfluxConnectionPool.initialize("dbname")

val sdvWriter = new ForeachWriter[record] {

var influxDB:InfluxDB = _

def open(partitionId: Long, version: Long): Boolean = {

influxDB = InfluxConnectionPool.getConnectionFromPool

true

}

def process(record: record) = {

// this is where I use the connection object and save the data

MyService.saveData(influxDB, record.topic, record.value)

InfluxConnectionPool.returnConnectionToPool(influxDB)

}

def close(errorOrNull: Throwable): Unit = {

}

}

import spark.implicits._

import org.apache.spark.sql.functions._

//Read data from kafka

val kafkaStreamingDF = spark

.readStream

....

val sdvQuery = kafkaStreamingDF

.writeStream

.foreach(sdvWriter)

.start()

}

But, when I run the job, I get the following exception

18/05/07 00:00:43 ERROR StreamExecution: Query [id = 6af3c096-7158-40d9-9523-13a6bffccbb8, runId = 3b620d11-9b93-462b-9929-ccd2b1ae9027] terminated with error

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 8, 192.168.222.5, executor 1): java.lang.NullPointerException

at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:332)

at com.abc.telemetry.app.influxdb.InfluxConnectionPool$.returnConnectionToPool(InfluxConnectionPool.scala:47)

at com.abc.telemetry.app.ETLappSave$$anon$1.process(ETLappSave.scala:55)

at com.abc.telemetry.app.ETLappSave$$anon$1.process(ETLappSave.scala:46)

at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:53)

at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:49)

The NPE is when the connection is returned to the connection pool in queue.put(connection). What am I missing here? Any help appreciated.

P.S: In the regular DStreams approach, I did it with foreachPartition method. Not sure how to do connection reuse/pooling with structured streaming.

解决方案

I am using the forEachWriter for redis similarly, where the pool is being referenced in the process only. Your request would look something like below

def open(partitionId: Long, version: Long): Boolean = {

true

}

def process(record: record) = {

influxDB = InfluxConnectionPool.getConnectionFromPool

// this is where I use the connection object and save the data

MyService.saveData(influxDB, record.topic, record.value)

InfluxConnectionPool.returnConnectionToPool(influxDB)

}```

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值