pyspark之集合操作（六）

最新推荐文章于 2022-02-14 12:33:14 发布

hejp_123

最新推荐文章于 2022-02-14 12:33:14 发布

阅读量571

点赞数 1

分类专栏： spark 文章标签： pyspark 集合

本文链接：https://blog.csdn.net/hejp_123/article/details/88033844

版权

spark 专栏收录该内容

17 篇文章 8 订阅

订阅专栏

1. 创建map
2. 创建列表
3. 元素存在判断
4. 数据拉直
5. posexplode
6. json操作
7. 列表排序

1. 创建map

# Creates a new map column.
from pyspark.sql.functions import create_map

df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age"))
df.show()

df.select(create_map([df.name, df.age]).alias("map")).show()
# +-------------------+
# |                map|
# +-------------------+
# |Map(John Doe -> 21)|
# +-------------------+
 
 1
2
3
4
5
6
7
8
9
10
11
12

2. 创建列表

# Creates a new array column.
from pyspark.sql.functions import array

df.select(array('age', 'age').alias("arr")).show()
# +--------+
# |     arr|
# +--------+
# |[21, 21]|
# +--------+
 
 1
2
3
4
5
6
7
8
9

3. 元素存在判断

相当于 pandas.isin, pandas.notin

from pyspark.sql.functions import array_contains

df = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])

df.select(array_contains(df.data, "a")).show()
# +-----------------------+
# |array_contains(data, a)|
# +-----------------------+
# |                   true|
# |                  false|
# +-----------------------+
 
 1
2
3
4
5
6
7
8
9
10
11

4. 数据拉直

这是我造的名词，大概意思是，如果col的值是列表之类的复合数据，则将每个数据单独赋予一行。
Returns a new row for each element in the given array or map

from pyspark.sql import Row
from pyspark.sql.functions import explode

eDF = spark.createDataFrame([Row(a=1, intlist=[1,2,3], mapfield={"a": "b"})])
eDF.show()
# +---+---------+-----------+
# |  a|  intlist|   mapfield|
# +---+---------+-----------+
# |  1|[1, 2, 3]|Map(a -> b)|
# +---+---------+-----------+

eDF.select(explode('intlist').alias("anInt")).show()
# |anInt|
# +-----+
# |    1|
# |    2|
# |    3|
# +-----+

eDF.select(explode('mapfield').alias("key", "value")).show()
# +---+-----+
# |key|value|
# +---+-----+
# |  a|    b|
# +---+-----+

 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

5. posexplode

# Returns a new row for each element with position in the given array or map.
from pyspark.sql import Row
from pyspark.sql.functions import posexplode

eDF = spark.createDataFrame([Row(a=1, intlist=[1,2,3], mapfield={"a": "b"})])
eDF.show()
# +---+---------+-----------+
# |  a|  intlist|   mapfield|
# +---+---------+-----------+
# |  1|[1, 2, 3]|Map(a -> b)|
# +---+---------+-----------+

eDF.select(posexplode('intlist')).show()
# +---+---+
# |pos|col|
# +---+---+
# |  0|  1|
# |  1|  2|
# |  2|  3|
# +---+---+
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

6. json操作

6.1. get_json_object

6.2. json_tuple

6.3. from_json

6.4. to_json

7. 列表排序

# Collection function: sorts the input array in ascending or descending order according
# to the natural ordering of the array elements.
from pyspark.sql.functions import sort_array

df = spark.createDataFrame([([2, 1, 3],),([1],),([],)], ['data'])

df.select(sort_array(df.data).alias('r')).show()
# +---------+
# |        r|
# +---------+
# |[1, 2, 3]|
# |      [1]|
# |       []|
# +---------+

df.select(sort_array(df.data, asc=False).alias('r')).show()
# +---------+
# |        r|
# +---------+
# |[3, 2, 1]|
# |      [1]|
# |       []|
# +---------+
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

					<link href="https://csdnimg.cn/release/phoenix/mdeditor/markdown_views-7b4cdcb592.css" rel="stylesheet">
            </div>

hejp_123

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pyspark之集合操作（六）

1. 创建map2. 创建列表3. 元素存在判断4. 数据拉直5. posexplode6. json操作 6.1. get_json_object6.2. json_tuple6.3. from_json6.4. to_json7. 列表排序1. 创建map# Creates a new map column.from pyspark.sql.functions ...
复制链接

扫一扫

专栏目录