背景:
基于流的结构化处理,越来越成为ETL的重要处理手段,使用SQL处理流数据优点可以降低数据处理的编程难度,而且能够工程化的动态配置处理格式。
基于struct_streaming处理pv的简单案例
数据源:kafka的topic input_std1_npanther中
输入格式:json
{
"event_siteid":"kf_3004",
"event_distinctid":"58935524",
"event_name":"page_load",
"event_time":1510209429289,
"event_properties":[
{
"scs":"1166400",
"uname":"guest378_SDK",
"cookie":"1256873337131",
"nts":"afdeb156-1717-4d10-bf1c-60e377130972",
"pgid":"SDKCoreManager",
"fl":"3.0",
"ulevel":0,
"ip":"192.168.90.184",
"useragent":"User-Agent: Dalvik/2.1.0 (Linux; U; Android 6.0; KNT-UL10 Build/HUAWEIKNT-UL10)",
"source":"input",
"login":"guest1256873337131",
"sys":"Android_6.0_KNT-UL10",
"ttl":"登录账号",
"sid":"1.5102094306410037E12",
"dv":"Phone",
"tml":"Android App",
"system":"Android",
"browser":"Unknown",
"siteid":"kf_3004",
"imei":"869394020987738",
"time":1510209429289,
"lev":"1",
"lang":"中文"
}
]
}
输出:dataframe(dataset)
下面具体的代码:
val lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers",kafkaservers)
.option("subscribe", topicInput)
.load()
.selectExpr("CAST(value AS STRING)")
.as[(String)]
import spark.sql
val schema=StructType(Nil)
.add("event_siteid","String")
.add("event_distinctid","String")
.add("event_name","String")
.add("event_time","String")
.add("event_properties","String")
import org.apache.spark.sql.functions._
val query=lines.select(get_json_object('value,"$.event_siteid").alias("siteid"),get_json_object('value,"$.event_name").alias("name"),get_json_object('value,"$.event_time").alias("timestamp"),,get_json_object('value,"$.event_properties[0].ip").alias("ip"))
query.createTempView("jsontable")
val queryTwo=sql("select siteid,name,timestamp,ip from jsontable")
.writeStream
.outputMode("append")
.format("console")
.start
queryTwo.awaitTermination
note:
get_json_object是org.apache.spark.sql.functions的方法,该object的主要是对column的操作。