1准备kafka数据源
首先把下面这段json数据推到kafka中,这只是模拟的一条数据,structured streaming读取到它之后,会把他当做无边界表(unbounded table)的一条记录,这张表记录的是用户访问日志,它有3个字段,分别是uid(用户id),timestamp(访问的时间戳),agent(用户客户端的user-agent)
{
"uid": "ef16382c8acce8ec",
"timestamp": 1594983278059,
"agent": "Mozilla/5.0 (Linux; Android 10; Redmi K30 5G Build/QKQ1.191222.002; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/80.0.3987.99 Mobile Safari/537.36"
}
2上代码
模拟多几条以上的json数据推到kafka后,我们就开始写structured streaming代码了,代码如下,是用groovy写的,如果你不会groovy,你就当它是没有分号(;)的java去阅读就好了,如果你要运行的话,直接在idea里面,在top1024b.etl包里面新建以.groovy结尾的文件再复制下面代码,然后向java那样运行即可,环境和依赖(包含groovy的依赖)那些我在上一篇博客都写过了
点我看上一篇博客
主代码:
package top1024b.etl
import groovy.transform.CompileStatic
import org.apache.spark.SparkConf
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.StreamingQuery
@CompileStatic
class Test02 {
static void main(String[] args) throws Exception {
SparkSession spark = SparkSession
.builder().config(new SparkConf().setMaster("local[*]").set("spark.sql.shuffle.partitions", "1"))
.appName("JavaStructuredNetworkWordCount")
.getOrCreate()
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.0.1:9092")
.option("subscribe", "user_log")
.option("startingOffsets", "earliest")
.load()
DataSetSql ds = new DataSetSql(spark, df)
String sql = """
SELECT
get_json_object ( VALUE, '\$.uid' ) as uid,
get_json_object ( VALUE, '\$.timestamp' ) as timestamp,
get_json_object ( VALUE, '\$.agent' ) as agent
FROM
t
""".toString().trim()
df = ds
.exe("select CAST(value AS STRING) from t")
.exe(sql)
.get()
StreamingQuery query = df.writeStream()
.outputMode("update")
.format("console")
.start()
query.awaitTermination()
}
}
工具类:
package top1024b.etl
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.SparkSession
import groovy.transform.CompileStatic
@CompileStatic
class DataSetSql {
private Dataset ds
private SparkSession spark
DataSetSql(spark, ds) {
this.ds = ds
this.spark = spark
}
DataSetSql exe(String sql) {
ds.createOrReplaceTempView("t")
ds = spark.sql(sql)
this
}
Dataset get(){
ds
}
}
你复制完上面代码后,需要注意以下几点
- .option(“kafka.bootstrap.servers”, “192.168.0.1:9092”) 这里要根据你kafka实际的ip和端口进行调整,多台kafka用逗号隔开
- .option(“subscribe”, “user_log”)这里要改成你推送到kafka的topic,我的topic是user_log,多个topic的话用逗号隔开
- .option(“startingOffsets”, “earliest”)这里是设置成从kafka最小的offset开始读,如果要设置成从最新的offset开始读,把earliest替换成latest
- @CompileStatic 加了这个注释,动态的groovy和静态的java一样快
- .exe(“select CAST(value AS STRING) from t”) ,structured streaming默认读取kafka是有下图中的几列的,其中key和value是字节数组,所以要用 CAST(value AS STRING)转换成字符串
+----+--------------------+---------+---------+-------+--------------------+-------------+
| key| value| topic|partition| offset| timestamp|timestampType|
+----+--------------------+---------+---------+-------+--------------------+-------------+
|null|[7B 22 75 69 64 2...| user_log| 0|5109826|2020-07-20 18:20:...| 0|
+----+--------------------+---------+---------+-------+--------------------+-------------+
3跑程序
运行Test02,可看到如下结果,每次有数据Batch:后面都会加1,然后控制台输出结果表,上一篇博客好像写过了,我这里还是再bb一次吧
-------------------------------------------
Batch: 0
-------------------------------------------
Code generated in 8.759419 ms
+--------------------+-------------+--------------------+
| uid| timestamp| agent|
+--------------------+-------------+--------------------+
| 869068032689124|1595206351765|Mozilla/5.0 (Linu...|
| 869068032689124|1595206351855|Mozilla/5.0 (Linu...|
| 869068032689124|1595206352110|Mozilla/5.0 (Linu...|
| 869068032689124|1595206352592|Mozilla/5.0 (Linu...|
| 869068032689124|1595206352763|Mozilla/5.0 (Linu...|
| 869068032689124|1595206352841|Mozilla/5.0 (Linu...|
| 869068032689124|1595206354639|Mozilla/5.0 (Linu...|
| 869068032689124|1595206355869|Mozilla/5.0 (Linu...|
| 869068032689124|1595206361842|Mozilla/5.0 (Linu...|
| 869068032689124|1595206361943|Mozilla/5.0 (Linu...|
| 869068032689124|1595206362016|Mozilla/5.0 (Linu...|
| 869068032689124|1595206363860|Mozilla/5.0 (Linu...|
| 869068032689124|1595206364792|Mozilla/5.0 (Linu...|
| 869068032689124|1595206364879|Mozilla/5.0 (Linu...|
|C995DCAB-060A-433...|1595206421047|%E7%82%B9%E8%B4%A...|
|C995DCAB-060A-433...|1595206427094|%E7%82%B9%E8%B4%A...|
|C995DCAB-060A-433...|1595206429983|%E7%82%B9%E8%B4%A...|
|C995DCAB-060A-433...|1595206430600|%E7%82%B9%E8%B4%A...|
| 868144035543674|1595206639415|Mozilla/5.0 (Linu...|
| 868144035543674|1595206649778|Mozilla/5.0 (Linu...|
+--------------------+-------------+--------------------+
only showing top 20 rows
3结束语
structured streaming 对接kafka官方文档:
点这里访问
不要总是对我用groovy来写这个代码耿耿于怀,groovy是世界上最好的语言