1 不同结构数据操作
1.1 嵌套结构
Dots(.)可用于访问嵌套列的结构和映射。
// input
{
"a": {
"b": 1
}
}
Python: events.select("a.b")
Scala: events.select("a.b")
SQL: select a.b from events
// output
{
"b": 1
}
嵌套列:SQL函数中的struct函数或只是括号可用于创建一个新的结构体。
// input
{
"a": 1,
"b": 2,
"c": 3
}
Python: events.select(struct(col("a").alias("y")).alias("x"))
Scala: events.select(struct('a as 'y) as 'x)
SQL: select named_struct("y", a) as x from events
// output
{
"x": {
"y": 1
}
}
嵌套所有列:star(*)也可用于将所有列包含在嵌套结构中。
// input
{
"a": 1,
"b": 2
}
Python: events.select(struct("*").alias("x"))
Scala: events.select(struct("*") as 'x)
SQL: select struct(*) as x from events
// output
{
"x": {
"a": 1,
"b": 2
}
}
1.2 扁平结构
一个star(*)可以用来选择结构中的所有子字段。
// input
{
"a": {
"b": 1,
"c": 2
}
}
Python: events.select("a.*")
Scala: events.select("a.*")
SQL: select a.* from events
// output
{
"b": 1,
"c": 2
}
1.3 选择单个阵列或地图元素
getItem()
或方括号(即[ ])可用于从数组或映射中选择单个元素。
// input
{
"a": [1, 2]
}
Python: events.select(col("a").getItem(0).alias("x"))
Scala: events.select('a.getItem(0) as 'x)
SQL: select a[0] as x from events
// output
{ "x": 1 }
// input
{
"a": {
"b": 1
}
}
Python: events.select(col("a").getItem("b").alias("x"))
Scala: events.select('a.getItem("b") as 'x)
SQL: select a['b'] as x from events
// output
{ "x": 1 }
为每个数组或地图元素创建一行:explode()
可用于为数组或每个键值对中的每个元素创建一个新行。这与HiveQL中的LATERAL VIEW EXPLODE类似。
// input
{
"a": [1, 2]
}
Python: events.select(explode("a").alias("x"))
Scala: events.select(explode('a) as 'x)
SQL: select explode(a) as x from events
// output
[{ "x": 1 }, { "x": 2 }]
// input
{
"a": {
"b": 1,
"c": 2
}
}
Python: events.select(explode("a").alias("x", "y"))
Scala: events.select(explode('a) as Seq("x", "y"))
SQL: select explode(a) as (x, y) from events
// output
[{ "x": "b", "y": 1 }, { "x": "c", "y": 2 }]
将多行收集到数组中:collect_list()
和collect_set()
可用于聚集物品进入阵列。
// input
[{ "x": 1 }, { "x": 2 }]
Python: events.select(collect_list("x").alias("x"))
Scala: events.select(collect_list('x) as 'x)
SQL: select collect_list(x) as x from events
// output
{ "x": [1, 2] }
// input
[{ "x": 1, "y": "a" }, { "x": 2, "y": "b" }]
Python: events.groupBy("y").agg(collect_list("x").alias("x"))
Scala: events.groupBy("y").agg(collect_list('x) as 'x)
SQL: select y, collect_list(x) as x from events group by y
// output
[{ "y": "a", "x": [1]}, { "y": "b", "x": [2]}]
从数组中的每个项目中选择一个字段:当您在数组中使用点符号时,我们返回一个新数组,该数组已从每个数组元素中选择该字段。
// input
{
"a": [
{"b": 1},
{"b": 2}
]
}
Python: events.select("a.b")
Scala: events.select("a.b")
SQL: select a.b from events
// output
{
"b": [1, 2]
}
2 Json 格式数据操作:保留列的复杂结构
本文主要讲spark2.0版本以后存在的Sparksql的一些实用的函数,帮助解决复杂嵌套的json数据格式,比如,map和嵌套结构。Spark2.1在spark 的Structured Streaming也可以使用这些功能函数。
下面几个是本文重点要讲的方法。
- A) get_json_object()
- B) from_json()
- C) to_json()
- D) explode()
- E) selectExpr()
首先,创建一个没有任何嵌套的JSon Schema
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val jsonSchema = new StructType().add("battery_level",LongType)\
.add("c02_level",LongType)\
.add("cca3",StringType)\
.add("cn", StringType)\
.add("device_id",LongType)\
.add("device_type", StringType)\
.add("signal",LongType)\
.add("ip",StringType)\
.add("temp", LongType)\
.add("timestamp", TimestampType)
// define a case class
case class DeviceData (id: Int, device: String)
// create some sample data
val eventsDS = Seq (
(0, """{"device_id": 0, "device_type": "sensor-ipad", "ip": "68.161.225.1", "cca3": "USA", "cn": "United States", "temp": 25, "signal": 23, "battery_level": 8, "c02_level": 917, "timestamp" :1475600496 }"""),
(1, """{"device_id": 1, "device_type": "sensor-igauge", "ip": "213.161.254.1", "cca3": "NOR", "cn": "Norway", "temp": 30, "signal": 18, "battery_level": 6, "c02_level": 1413, "timestamp" :1475600498 }"""),
(2, """{"device_id": 2, "device_type": "sensor-ipad", "ip": "88.36.5.1", "cca3": "ITA", "cn": "Italy", "temp": 18, "signal": 25, "battery_level": 5, "c02_level": 1372, "timestamp" :1475600500 }"""),
(3, """{"device_id": 3, "device_type": "sensor-inest", "ip": "66.39.173.154", "cca3": "USA", "cn": "United States", "temp": 47, "signal": 12, "battery_level": 1, "c02_level": 1447, "timestamp" :1475600502 }"""),
(4, """{"device_id": 4, "device_type": "sensor-ipad", "ip": "203.82.41.9", "cca3": "PHL", "cn": "Philippines", "temp": 29, "signal": 11, "battery_level": 0, "c02_level": 983, "timestamp" :1475600504 }"""),
(5, """{"device_id": 5, "device_type": "sensor-istick", "ip": "204.116.105.67", "cca3": "USA", "cn": "United States", "temp": 50, "signal": 16, "battery_level": 8, "c02_level": 1574, "timestamp" :1475600506 }"""),
(6, """{"device_id": 6, "device_type": "sensor-ipad", "ip": "220.173.179.1", "cca3": "CHN", "cn": "China", "temp": 21, "signal": 18, "battery_level": 9, "c02_level": 1249, "timestamp" :1475600508 }"""),
(7, """{"device_id": 7, "device_type": "sensor-ipad", "ip": "118.23.68.227", "cca3": "JPN", "cn": "Japan", "temp": 27, "signal": 15, "battery_level": 0, "c02_level": 1531, "timestamp" :1475600512 }"""),
(8, """{"device_id": 8, "device_type": "sensor-inest", "ip": "208.109.163.218", "cca3": "USA", "cn": "United States", "temp": 40, "signal": 16, "battery_level": 9, "c02_level": 1208, "timestamp" :1475600514 }"""),
(9, """{"device_id": 9, "device_type": "sensor-ipad", "ip": "88.213.191.34", "cca3": "ITA", "cn": "Italy", "temp": 19, "signal": 11, "battery_level": 0, "c02_level": 1171, "timestamp" :1475600516 }"""),
(10, """{"device_id": 10, "device_type": "sensor-igauge", "ip": "68.28.91.22", "cca3": "USA", "cn": "United States", "temp": 32, "signal": 26, "battery_level": 7, "c02_level": 886, "timestamp" :1475600518 }"""),
(11, """{"device_id": 11, "device_type": "sensor-ipad", "ip": "59.144.114.250", "cca3": "IND", "cn": "India", "temp": 46, "signal": 25, "battery_level": 4, "c02_level": 863, "timestamp" :1475600520 }"""),
(12, """{"device_id": 12, "device_type": "sensor-igauge", "ip": "193.156.90.200", "cca3": "NOR", "cn": "Norway", "temp": 18, "signal": 26, "battery_level": 8, "c02_level": 1220, "timestamp" :1475600522 }"""),
(13, """{"device_id": 13, "device_type": "sensor-ipad", "ip": "67.185.72.1", "cca3": "USA", "cn": "United States", "temp": 34, "signal": 20, "battery_level": 8, "c02_level": 1504, "timestamp" :1475600524 }"""),
(14, """{"device_id": 14, "device_type": "sensor-inest", "ip": "68.85.85.106", "cca3": "USA", "cn": "United States", "temp": 39, "signal": 17, "battery_level": 8, "c02_level": 831, "timestamp" :1475600526 }"""),
(15, """{"device_id": 15, "device_type": "sensor-ipad", "ip": "161.188.212.254", "cca3": "USA", "cn": "United States", "temp": 27, "signal": 26, "battery_level": 5, "c02_level": 1378, "timestamp" :1475600528 }"""),
(16, """{"device_id": 16, "device_type": "sensor-igauge", "ip": "221.3.128.242", "cca3": "CHN", "cn": "China", "temp": 10, "signal": 24, "battery_level": 6, "c02_level": 1423, "timestamp" :1475600530 }"""),
(17, """{"device_id": 17, "device_type": "sensor-ipad", "ip": "64.124.180.215", "cca3": "USA", "cn": "United States", "temp": 38, "signal": 17, "battery_level": 9, "c02_level": 1304, "timestamp" :1475600532 }"""),
(18, """{"device_id": 18, "device_type": "sensor-igauge", "ip": "66.153.162.66", "cca3": "USA", "cn": "United States", "temp": 26, "signal": 10, "battery_level": 0, "c02_level": 902, "timestamp" :1475600534 }"""),
(19, """{"device_id": 19, "device_type": "sensor-ipad", "ip": "193.200.142.254", "cca3": "AUT", "cn": "Austria", "temp": 32, "signal": 27, "battery_level": 5, "c02_level": 1282, "timestamp" :1475600536 }"""))\
.toDF("id", "device").as[DeviceData]
2.1 get_json_object()
select get_json_object('{"k": "foo", "v": 1.0}','$.k') as k
上面这句话的意思:从json对象
'{"k": "foo", "v": 1.0}'
中,提取键值key=k的列,重命名为k
需要给定get_json_object 一个json字段名(或者字符串),然后通过类似jsonPath的方式去拿具体的值。这个方法其实有点麻烦,如果要提取里面的是个字段,我就要写是个类似的东西,很复杂。
该方法从spark1.6开始就有了,从一个json 字符串中根据指定的json 路径抽取一个json 对象。从上面的dataset中取出部分数据,然后抽取部分字段组装成新的json 对象。比如,我们仅仅抽取:id,devicetype,ip,CCA3 code
.
val eventsFromJSONDF = Seq (
(0, """{"device_id": 0, "device_type": "sensor-ipad", "ip": "68.161.225.1", "cca3": "USA", "cn": "United States", "temp": 25, "signal": 23, "battery_level": 8, "c02_level": 917, "timestamp" :1475600496 }"""),
(1, """{"device_id": 1, "device_type": "sensor-igauge", "ip": "213.161.254.1", "cca3": "NOR", "cn": "Norway", "temp": 30, "signal": 18, "battery_level": 6, "c02_level": 1413, "timestamp" :1475600498 }"""),
(2, """{"device_id": 2, "device_type": "sensor-ipad", "ip": "88.36.5.1", "cca3": "ITA", "cn": "Italy", "temp": 18, "signal": 25, "battery_level": 5, "c02_level": 1372, "timestamp" :1475600500 }"""),
(3, """{"device_id": 3, "device_type": "sensor-inest", "ip": "66.39.173.154", "cca3": "USA", "cn": "United States", "temp": 47, "signal": 12, "battery_level": 1, "c02_level": 1447, "timestamp" :1475600502 }"""),
(4, """{"device_id": 4, "device_type": "sensor-ipad", "ip": "203.82.41.9", "cca3": "PHL", "cn": "Philippines", "temp": 29, "signal": 11, "battery_level": 0, "c02_level": 983, "timestamp" :1475600504 }"""),
(5, """{"device_id": 5, "device_type": "sensor-istick", "ip": "204.116.105.67", "cca3": "USA", "cn": "United States", "temp": 50, "signal": 16, "battery_level": 8, "c02_level": 1574, "timestamp" :1475600506 }"""),
(6, """{"device_id": 6, "device_type": "sensor-ipad", "ip": "220.173.179.1", "cca3": "CHN", "cn": "China", "temp": 21, "signal": 18, "battery_level": 9, "c02_level": 1249, "timestamp" :1475600508 }"""),
(7, """{"device_id": 7, "device_type": "sensor-ipad", "ip": "118.23.68.227", "cca3": "JPN", "cn": "Japan", "temp": 27, "signal": 15, "battery_level": 0, "c02_level": 1531, "timestamp" :1475600512 }"""),
(8 ,""" {"device_id": 8, "device_type": "sensor-inest", "ip": "208.109.163.218", "cca3": "USA", "cn": "United States", "temp": 40, "signal": 16, "battery_level": 9, "c02_level": 1208, "timestamp" :1475600514 }"""),
(9,"""{"device_id": 9, "device_type": "sensor-ipad", "ip": "88.213.191.34", "cca3": "ITA", "cn": "Italy", "temp": 19, "signal": 11, "battery_level": 0, "c02_level": 1171, "timestamp" :1475600516 }"""))\
.toDF("id", "json")
测试及输出
同理,从 " j s o n " 列 ( "json"列( "json"列("json"是一个json对象),提取出名为device_type的列,重命名为device_type,这里注意提取的小细节
"$.device_type"
,这里有一个 点
val jsDF = eventsFromJSONDF.select($"id", get_json_object($"json","$.device_type").alias("device_type"),get_json_object($"json","$.ip").alias("ip"),get_json_object($"json", "$.cca3").alias("cca3"))
jsDF.printSchema
jsDF.show
2.2 from_json()
Spark SQL提供
to_json()
了将结构体编码为字符串并将from_json()
结构检索为复杂类型的函数。使用JSON字符串作为列在读取或写入流媒体源(如Kafka)时非常有用。每个Kafka键值记录将被添加一些元数据,如摄入时间戳到Kafka,Kafka中的偏移量等。如果包含数据的“值”字段是JSON,您可以使用它from_json()来提取数据,丰富它,清理它,然后将其下游推送到卡夫卡再次或将其写入文件。
from_json()
可用于将具有JSON数据的字符串列转换为结构体。那么你可以按照上述的方式平坦化结构体,使其具有单独的列。此方法目前在SQL中不可用。
select a.k from (
select from_json('{"k": "foo", "v": 1.0}','k STRING, v STRING',map("","")) as a
)
这个方法可以给json定义一个Schema,这样在使用时,就可以直接使用a.k这种方式了,会简化很多。
// input
{
"a": "{\"b\":1}"
}
Python:
schema = StructType().add("b", IntegerType())
events.select(from_json("a", schema).alias("c"))
Scala:
val schema = new StructType().add("b", IntegerType)
events.select(from_json('a, schema) as 'c)
// output
{
"c": {
"b": 1
}
}
有时可能希望将JSON字符串的一部分仍保留为JSON,以避免模式中的复杂性过高。
// input
{
"a": "{\"b\":{\"x\":1,\"y\":{\"z\":2}}}"
}
Python:
schema = StructType().add("b", StructType().add("x", IntegerType())
.add("y", StringType()))
events.select(from_json("a", schema).alias("c"))
Scala:
val schema = new StructType().add("b", new StructType().add("x", IntegerType)
.add("y", StringType))
events.select(from_json('a, schema) as 'c)
// output
{
"c": {
"b": {
"x": 1,
"y": "{\"z\":2}"
}
}
}
与get_json_object
不同的是该方法,使用schema
去抽取单独列。
- 在dataset的api select中使用
from_json()
方法,我可以从一个 json字符串中按照指定的schema格式抽取出来作为DataFrame的列 - 也可以将所有在json中的属性和值当做一个devices的实体。我们不仅可以使用
device.arrtibute
去获取特定值,也可以使用*
通配符。
下面的例子,主要实现如下功能:
- A) 使用上述schema从json字符串中抽取属性和值,并将它们视为devices的独立列
- B) select所有列
- C) 使用.,获取部分列
val devicesDF = eventsDS.select(from_json($"device", jsonSchema) as"devices").select($"devices.*").filter($"devices.temp" > 10 and $"devices.signal" > 15)
json_tuple()
json_tuple()
可以使用JSON数据来提取字符串列中可用的字段。
// input
{
"a": "{\"b\":1}"
}
Python: events.select(json_tuple("a", "b").alias("c"))
Scala: events.select(json_tuple('a, "b") as 'c)
SQL: select json_tuple(a, "b") as c from events
// output
{ "c": 1 }
regexp_extract()
regexp_extract()
可用于使用正则表达式解析字符串。
// input
[{ "a": "x: 1" }, { "a": "y: 2" }]
Python: events.select(regexp_extract("a", "([a-z]):", 1).alias("c"))
Scala: events.select(regexp_extract('a, "([a-z]):", 1) as 'c)
SQL: select regexp_extract(a, "([a-z]):", 1) as c from events
// output
[{ "c": "x" }, { "c": "y" }]
2.3 to_json()
to_json()
可以将结构体转换成JSON字符串。当您将数据写入Kafka时,当您想要将多个列重新编码为一个列时,此方法特别有用。此方法目前在SQL中不可用。
select to_json(struct(*)) AS value
// input
{
"a": {
"b": 1
}
}
Python: events.select(to_json("a").alias("c"))
Scala: events.select(to_json('a) as 'c)
// output
{
"c": "{\"b\":1}"
}
下面使用to_json()
将获取的数据转化为json格式。将结果重新写入kafka或者保存partquet
文件。
val stringJsonDF = eventsDS.select(to_json(struct($"*"))).toDF("devices")
stringJsonDF.show
保存数据到kafka
stringJsonDF.write.format("kafka").option("kafka.bootstrap.servers","localhost:9092").option("topic", "iot-devices").save()
注意依赖
groupId = org.apache.spark
artifactId = spark-sql-kafka-0-10_2.11
version = 2.1.0
Spark高级操作之json复杂和嵌套数据结构的操作
SparkSQL 操作 Json 格式数据
Spark SQL操作JSON字段小Tips