# 拆分的数据

• split
• explode
• postexplode
• substring

## 2. 数据的拆分

#### 2.1 通过explode系列函数进行拆分

• 把一个数组值的列拆分成多行**: explode

>>> import pyspark.sql.functions as F
>>> list_data = [(1, "abc", ["p", "q", "r"]), (2, "def", ["x", "y", "z"])]
>>> schema = ["id", "col1", "col2"]
>>>
>>> df = spark.createDataFrame(list_data, schema)
>>> df.show()
+---+----+---------+
| id|col1|     col2|
+---+----+---------+
|  1| abc|[p, q, r]|
|  2| def|[x, y, z]|
+---+----+---------+

>>> df.withColumn("col2", F.explode("col2")).show()
+---+----+----+
| id|col1|col2|
+---+----+----+
|  1| abc|   p|
|  1| abc|   q|
|  1| abc|   r|
|  2| def|   x|
|  2| def|   y|
|  2| def|   z|
+---+----+----+

• 把一个map值的列拆分成多个列和多行: explode

>>> from pyspark.sql import Row
>>>
>>> rows = [
...     Row(a=1, mapfield={"a": "b"}),
...     Row(a=2, mapfield={"a1": "b1"}),
...     Row(a=3, mapfield={"a2": "b2"}),
...     Row(a=4, mapfield={"a3": "b3", "a4": "b4", "a5": {"key5": "value5"}})
... ]
>>>
>>> eDF = spark.createDataFrame(rows)
>>> eDF.show(truncate=False)
+---+-----------------------------------------+
|a  |mapfield                                 |
+---+-----------------------------------------+
|1  |[a -> b]                                 |
|2  |[a1 -> b1]                               |
|3  |[a2 -> b2]                               |
|4  |[a3 -> b3, a4 -> b4, a5 -> {key5=value5}]|
+---+-----------------------------------------+

>>> eDF.select("a", F.explode(eDF.mapfield).alias("key", "value")).show()
+---+---+-------------+
|  a|key|        value|
+---+---+-------------+
|  1|  a|            b|
|  2| a1|           b1|
|  3| a2|           b2|
|  4| a3|           b3|
|  4| a4|           b4|
|  4| a5|{key5=value5}|
+---+---+-------------+


• 对list和map值进行拆分，并且添加一个index号（从0开始）posexplode
pyspark.sql.functions.posexplode(col)


>>> eDF = spark.createDataFrame([Row(a=1, intlist=[1,2,3], mapfield={"a": "b", "c": "d"})])
>>> eDF.show()
+---+---------+----------------+
|  a|  intlist|        mapfield|
+---+---------+----------------+
|  1|[1, 2, 3]|[a -> b, c -> d]|
+---+---------+----------------+

# 拆分map值的列，并添加一列pos
>> eDF.select("a","intlist",F.posexplode(eDF.mapfield)).show()
+---+---------+---+---+-----+
|  a|  intlist|pos|key|value|
+---+---------+---+---+-----+
|  1|[1, 2, 3]|  0|  a|    b|
|  1|[1, 2, 3]|  1|  c|    d|
+---+---------+---+---+-----+

# 拆分数组值的列，并添加一列索引值pos
>>> eDF.select("a","mapfield",F.posexplode(eDF.intlist)).show()
+---+----------------+---+---+
|  a|        mapfield|pos|col|
+---+----------------+---+---+
|  1|[a -> b, c -> d]|  0|  1|
|  1|[a -> b, c -> d]|  1|  2|
|  1|[a -> b, c -> d]|  2|  3|
+---+----------------+---+---+

• **拆分时对空值进行填充：**posexplode_outer(spark-2.3开始)

df = spark.createDataFrame(
...     [(1, ["foo", "bar"], {"x": 1.0}), (2, [], {}), (3, None, None)],
...     ("id", "an_array", "a_map")
... )
>>> df.select("id", "an_array", posexplode_outer("a_map")).show()
+---+----------+----+----+-----+
| id|  an_array| pos| key|value|
+---+----------+----+----+-----+
|  1|[foo, bar]|   0|   x|  1.0|
|  2|        []|null|null| null|
|  3|      null|null|null| null|
+---+----------+----+----+-----+
>>> df.select("id", "a_map", posexplode_outer("an_array")).show()
+---+----------+----+----+
| id|     a_map| pos| col|
+---+----------+----+----+
|  1|[x -> 1.0]|   0| foo|
|  1|[x -> 1.0]|   1| bar|
|  2|        []|null|null|
|  3|      null|null|null|
+---+----------+----+----+


#### 2.2 split拆分数据

split函数可以把一个字符串值拆分成一个数组值，而且split函数还支持按正则表达式来进行拆分。

pyspark.sql.functions.split(str, pattern)

>>> df = spark.createDataFrame([('ab12cd',"aaaa")], ['c1','c2'])
>>> df.show()
+------+----+
|    c1|  c2|
+------+----+
|ab12cd|aaaa|
+------+----+

# 以下可以看到，拆分后的值都变成了数组值
>>> df.select('c2', F.split(df.c1, '[0-9]+').alias('r')).show()
+----+--------+
|  c2|       r|
+----+--------+
|aaaa|[ab, cd]|
+----+--------+

>>> df.select('c1', F.split(df.c2, '[0-9]+').alias('r')).show()
+------+------+
|    c1|     r|
+------+------+
|ab12cd|[aaaa]|
+------+------+


## 3. 总结

©️2019 CSDN 皮肤主题: 深蓝海洋 设计师: CSDN官方博客