最近开始接触pyspark,其中DataFrame的应用很重要也很简便。因此,这里记录一下自己的学习笔记。
详细的应用可以参看pyspark.sql module。这是官网文档,里面记录了详细的DataFrame使用说明。
目录
一、创建DF或者读入DF
以sql输出的结果创建df,这种形式最常用。
-
from pyspark.sql import SparkSession
-
from pyspark.sql import Row
-
from pyspark.sql.types import *
-
from pyspark.sql.functions import *
-
df = spark.sql("select * from table_name")
也可以使用toDF()
-
from pyspark.sql import Row
-
row = Row("spe_id", "InOther")
-
x = ['x1','x2']
-
y = ['y1','y2']
-
new_df = sc.parallelize([row(x[i], y[i]) for i in range(2)]).toDF()
当然,也可以采用下面的方式创建DF,我们这里造了下面的数据集来说明df的一系列操作。
-
test = []
-
test.append((1, 'age', '30', 50, 40))
-
test.append((1, 'city', 'beijing', 50, 40))
-
test.append((1, 'gender', 'fale', 50, 40))
-
test.append((1, 'height', '172cm', 50, 40))
-
test.append((1, 'weight', '70kg', 50, 40))
-
test.append((2, 'age', '26', 100, 80))
-
test.append((2, 'city', 'beijing', 100, 80))
-
test.append((2, 'gender', 'fale', 100, 80))
-
test.append((2, 'height', '170cm', 100, 80))
-
test.append((2, 'weight', '65kg', 100, 80))
-
test.append((3, 'age', '35', 99, 99))
-
test.append((3, 'city', 'nanjing', 99, 99))
-
test.append((3, 'gender', 'female', 99, 99))
-
test.append((3, 'height', '161cm', 99, 99))
-
test.append((3, 'weight', '50kg', 99, 99))
-
df = spark.createDataFrame(test,['user_id', 'attr_name','attr_value', 'income', 'expenses'])
createDataFrame有一个参数,samplingRatio。这个参数的含义是:如果df的某列的类型不确定,则抽样百分之samplingRatio的数据来看是什么类型。因此,我们一般设定其为1。即,只要该列有1个数据不为空,该列的类型就不会为null。
二、查
2.1 行元素查询操作
打印数据
df.show()默认打印前20条数据,当然可以指定具体打印多少条数据。
如果有些属性值特别长,pyspark会截断数据导致打不全,这时候可以使用df.show(truncate=False)
-
>>> df.show()
-
+-------+---------+----------+------+--------+
-
|user_id|attr_name|attr_value|income|expenses|
-
+-------+---------+----------+------+--------+
-
| 1| age| 30| 50| 40|
-
| 1| city| beijing| 50| 40|
-
| 1| gender| fale| 50| 40|
-
| 1| height| 172cm| 50| 40|
-
| 1| weight| 70kg| 50| 40|
-
| 2| age| 26| 100| 80|
-
| 2| city| beijing| 100| 80|
-
| 2| gender| fale| 100| 80|
-
| 2| height| 170cm| 100| 80|
-
| 2| weight| 65kg| 100| 80|
-
| 3| age| 35| 99| 99|
-
| 3| city| nanjing| 99| 99|
-
| 3| gender| female| 99| 99|
-
| 3| height| 161cm| 99| 99|
-
| 3| weight| 50kg| 99| 99|
-
+-------+---------+----------+------+--------+
-
>>> df.show(3)
-
+-------+---------+----------+------+--------+
-
|user_id|attr_name|attr_value|income|expenses|
-
+-------+---------+----------+------+--------+
-
| 1| age| 30| 50| 40|
-
| 1| city| beijing| 50| 40|
-
| 1| gender| fale| 50| 40|
-
+-------+---------+----------+------+--------+
-
only showing top 3 rows
打印概要
-
>>> df.printSchema()
-
root
-
|-- user_id: long (nullable = true)
-
|-- attr_name: string (nullable = true)
-
|-- attr_value: string (nullable = true)
-
|-- income: long (nullable = true)
-
|-- expenses: long (nullable = true)
查询总行数
-
>>> df.count()
-
15
获取头几行到本地
-
>>> list = df.head(3)
-
>>> df.head(3)
-
[Row(user_id=1, attr_name=u'age', attr_value=u'30', income=50, expenses=40), Row(user_id=1, attr_name=u'city', attr_value=u'beijing', income=50, expenses=40), Row(user_id=1, attr_name=u'gender', attr_value=u'fale', income=50, expenses=40)]
-
>>> df.take(5)
-
[Row(user_id=1, attr_name=u'age', attr_value=u'30', income=50, expenses=40), Row(user_id=1, attr_name=u'city', attr_value=u'beijing', income=50, expenses=40), Row(user_id=1, attr_name=u'gender', attr_value=u'fale', income=50, expenses=40), Row(user_id=1, attr_name=u'height', attr_value=u'172cm', income=50, expenses=40), Row(user_id=1, attr_name=u'weight', attr_value=u'70kg', income=50, expenses=40)]
查询某列为null的行
-
>>> from pyspark.sql.functions import isnull
-
>>> df = df.filter(isnull("income"))
-
>>> df.show()
-
19/02/22 17:05:51 WARN DFSClient: Slow ReadProcessor read fields took 87487ms (threshold=30000ms); ack: seqno: 198 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 17565965 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.21.3.38:50010,DS-82aedc87-a850-40aa-9d04-dc62ab0988ef,DISK], DatanodeInfoWithStorage[172.21.80.165:50010,DS-305daec5-3c77-48cd-bee2-4f839aea8bb4,DISK], DatanodeInfoWithStorage[172.21.151.40:50010,DS-29ba84d5-ad7d-407f-9484-d85aa3f0a736,DISK]]
-
+-------+---------+----------+------+--------+
-
|user_id|attr_name|attr_value|income|expenses|
-
+-------+---------+----------+------+--------+
-
+-------+---------+----------+------+--------+
输出list类型,list中每个元素是Row类:
-
>>> df.collect()
-
[Row(user_id=1, attr_name=u'age', attr_value=u'30', income=50, expenses=40), Row(user_id=1, attr_name=u'city', attr_value=u'beijing', income=50, expenses=40), Row(user_id=1, attr_name=u'gender', attr_value=u'fale', income=50, expenses=40), Row(user_id=1, attr_name=u'height', attr_value=u'172cm', income=50, expenses=40), Row(user_id=1, attr_name=u'weight', attr_value=u'70kg', income=50, expenses=40), Row(user_id=2, attr_name=u'age', attr_value=u'26', income=100, expenses=80), Row(user_id=2, attr_name=u'city', attr_value=u'beijing', income=100, expenses=80), Row(user_id=2, attr_name=u'gender', attr_value=u'fale', income=100, expenses=80), Row(user_id=2, attr_name=u'height', attr_value=u'170cm', income=100, expenses=80), Row(user_id=2, attr_name=u'weight', attr_value=u'65kg', income=100, expenses=80), Row(user_id=3, attr_name=u'age', attr_value=u'35', income=99, expenses=99), Row(user_id=3, attr_name=u'city', attr_value=u'nanjing', income=99, expenses=99), Row(user_id=3, attr_name=u'gender', attr_value=u'female', income=99, expenses=99), Row(user_id=3, attr_name=u'height', attr_value=u'161cm', income=99, expenses=99), Row(user_id=3, attr_name=u'weight', attr_value=u'50kg', income=99, expenses=99)]
注:此方法将所有数据全部导入到本地,返回一个Array对象。当然,我们可以取出Array中的值,是一个Row,我们也可以取出Row中的值。
-
>>> list = df.collect()
-
>>> 19/02/22 16:54:04 WARN DFSClient: Slow ReadProcessor read fields took 43005ms (threshold=30000ms); ack: seqno: 179 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 18446744073455908425 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.21.3.38:50010,DS-82aedc87-a850-40aa-9d04-dc62ab0988ef,DISK], DatanodeInfoWithStorage[172.21.80.165:50010,DS-305daec5-3c77-48cd-bee2-4f839aea8bb4,DISK], DatanodeInfoWithStorage[172.21.151.40:50010,DS-29ba84d5-ad7d-407f-9484-d85aa3f0a736,DISK]]
-
>>> list[0]
-
Row(user_id=1, attr_name=u'age', attr_value=u'30', income=50, expenses=40)
-
>>> list[0][1]
-
u'age'
查询概况
-
>>> df.describe().show()
-
19/02/22 16:58:23 WARN DFSClient: Slow ReadProcessor read fields took 78649ms (threshold=30000ms); ack: seqno: 188 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 187817284 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.21.3.38:50010,DS-82aedc87-a850-40aa-9d04-dc62ab0988ef,DISK], DatanodeInfoWithStorage[172.21.80.165:50010,DS-305daec5-3c77-48cd-bee2-4f839aea8bb4,DISK], DatanodeInfoWithStorage[172.21.151.40:50010,DS-29ba84d5-ad7d-407f-9484-d85aa3f0a736,DISK]]
-
+-------+------------------+---------+------------------+-----------------+------------------+
-
|summary| user_id|attr_name| attr_value| income| expenses|
-
+-------+------------------+---------+------------------+-----------------+------------------+
-
| count| 15| 15| 15| 15| 15|
-
| mean| 2.0| null|30.333333333333332| 83.0| 73.0|
-
| stddev|0.8451542547285166| null| 4.509249752822894|24.15722311383137|25.453037988757707|
-
| min| 1| age| 161cm| 50| 40|
-
| max| 3| weight| nanjing| 100| 99|
-
+-------+------------------+---------+------------------+-----------------+------------------+
去重set操作
-
>>> df.select('user_id').distinct().show()
-
+-------+
-
|user_id|
-
+-------+
-
| 1|
-
| 3|
-
| 2|
-
+-------+
2.2 列元素操作
选择一列或多列:select
-
df["age"]
-
df.age
-
df.select(“name”)
-
df.select(df[‘name’], df[‘age’]+1)
-
df.select(df.a, df.b, df.c) # 选择a、b、c三列
-
df.select(df["a"], df["b"], df["c"]) # 选择a、b、c三列
用where按条件选择
-
>>> df.where("income = 50" ).show()
-
+-------+---------+----------+------+--------+
-
|user_id|attr_name|attr_value|income|expenses|
-
+-------+---------+----------+------+--------+
-
| 1| age| 30| 50| 40|
-
| 1| city| beijing| 50| 40|
-
| 1| gender| fale| 50| 40|
-
| 1| height| 172cm| 50| 40|
-
| 1| weight| 70kg| 50| 40|
-
+-------+---------+----------+------+--------+
2.3 排序
orderBy:按指定字段排序,默认为升序
-
>>> df.orderBy(df.income.desc()).show()
-
19/02/22 18:02:31 WARN DFSClient: Slow ReadProcessor read fields took 87360ms (threshold=30000ms); ack: seqno: 325 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 14139744 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.21.3.38:50010,DS-82aedc87-a850-40aa-9d04-dc62ab0988ef,DISK], DatanodeInfoWithStorage[172.21.80.165:50010,DS-305daec5-3c77-48cd-bee2-4f839aea8bb4,DISK], DatanodeInfoWithStorage[172.21.151.40:50010,DS-29ba84d5-ad7d-407f-9484-d85aa3f0a736,DISK]]
-
+-------+---------+----------+------+--------+
-
|user_id|attr_name|attr_value|income|expenses|
-
+-------+---------+----------+------+--------+
-
| 2| gender| fale| 100| 80|
-
| 2| weight| 65kg| 100| 80|
-
| 2| height| 170cm| 100| 80|
-
| 2| age| 26| 100| 80|
-
| 2| city| beijing| 100| 80|
-
| 3| gender| female| 99| 99|
-
| 3| age| 35| 99| 99|
-
| 3| height| 161cm| 99| 99|
-
| 3| weight| 50kg| 99| 99|
-
| 3| city| nanjing| 99| 99|
-
| 1| age| 30| 50| 40|
-
| 1| height| 172cm| 50| 40|
-
| 1| city| beijing| 50| 40|
-
| 1| weight| 70kg| 50| 40|
-
| 1| gender| fale| 50| 40|
-
+-------+---------+----------+------+--------+
2.4 抽样
sample是抽样函数,其中withReplacement = True or False代表是否有放回。42是seed。
t1 = train.sample(False, 0.2, 42)
三、 增、改
新增数据列
withColumn是通过添加或替换与现有列有相同的名字的列,返回一个新的DataFrame。
但是,我们这么写会报错
-
>>> df.withColumn('label', 0)
-
Traceback (most recent call last):
-
File "<stdin>", line 1, in <module>
-
File "/software/servers/10k/mart_vdp/spark/python/pyspark/sql/dataframe.py", line 1848, in withColumn
-
assert isinstance(col, Column), "col should be Column"
-
AssertionError: col should be Column
报错:AssertionError: col should be Column,即一定要指定某现有列。有两种方式可以实现:
一种方式通过functions
-
>>> from pyspark.sql.functions import *
-
>>> df.withColumn('label', lit(0))
-
DataFrame[user_id: bigint, attr_name: string, attr_value: string, income: bigint, expenses: bigint, label: int]
另一种方式是通过另一个已有变量:
-
>>> df.withColumn('income1', df.income+10).show(5)
-
19/02/22 18:25:03 WARN DFSClient: Slow ReadProcessor read fields took 34439ms (threshold=30000ms); ack: seqno: 382 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 26903061 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[172.21.3.38:50010,DS-82aedc87-a850-40aa-9d04-dc62ab0988ef,DISK], DatanodeInfoWithStorage[172.21.80.165:50010,DS-305daec5-3c77-48cd-bee2-4f839aea8bb4,DISK], DatanodeInfoWithStorage[172.21.151.40:50010,DS-29ba84d5-ad7d-407f-9484-d85aa3f0a736,DISK]]
-
+-------+---------+----------+------+--------+-------+
-
|user_id|attr_name|attr_value|income|expenses|income1|
-
+-------+---------+----------+------+--------+-------+
-
| 1| age| 30| 50| 40| 60|
-
| 1| city| beijing| 50| 40| 60|
-
| 1| gender| fale| 50| 40| 60|
-
| 1| height| 172cm| 50| 40| 60|
-
| 1| weight| 70kg| 50| 40| 60|
-
+-------+---------+----------+------+--------+-------+
-
only showing top 5 rows
修改列名
-
>>> df.withColumnRenamed( "income" , "income2" ).show(3)
-
+-------+---------+----------+-------+--------+
-
|user_id|attr_name|attr_value|income2|expenses|
-
+-------+---------+----------+-------+--------+
-
| 1| age| 30| 50| 40|
-
| 1| city| beijing| 50| 40|
-
| 1| gender| fale| 50| 40|
-
+-------+---------+----------+-------+--------+
-
only showing top 3 rows
四、合并 join / union
4.1 横向拼接union
利用union可以按行拼接
-
>>> df.union(df).show()
-
+-------+---------+----------+------+--------+
-
|user_id|attr_name|attr_value|income|expenses|
-
+-------+---------+----------+------+--------+
-
| 1| age| 30| 50| 40|
-
| 1| city| beijing| 50| 40|
-
| 1| gender| fale| 50| 40|
-
| 1| height| 172cm| 50| 40|
-
| 1| weight| 70kg| 50| 40|
-
| 2| age| 26| 100| 80|
-
| 2| city| beijing| 100| 80|
-
| 2| gender| fale| 100| 80|
-
| 2| height| 170cm| 100| 80|
-
| 2| weight| 65kg| 100| 80|
-
| 3| age| 35| 99| 99|
-
| 3| city| nanjing| 99| 99|
-
| 3| gender| female| 99| 99|
-
| 3| height| 161cm| 99| 99|
-
| 3| weight| 50kg| 99| 99|
-
| 1| age| 30| 50| 40|
-
| 1| city| beijing| 50| 40|
-
| 1| gender| fale| 50| 40|
-
| 1| height| 172cm| 50| 40|
-
| 1| weight| 70kg| 50| 40|
-
+-------+---------+----------+------+--------+
-
only showing top 20 rows
4.2 Join根据条件
单字段Join
合并2个表的join方法:
df_join = df_left.join(df_right, df_left.key == df_right.key, "inner")
4.3 求并集、交集
来看一个例子,先构造两个dataframe:
-
sentenceDataFrame = spark.createDataFrame((
-
(1, "asf"),
-
(2, "2143"),
-
(3, "rfds")
-
)).toDF("label", "sentence")
-
sentenceDataFrame.show()
-
sentenceDataFrame1 = spark.createDataFrame((
-
(1, "asf"),
-
(2, "2143"),
-
(4, "f8934y")
-
)).toDF("label", "sentence")
-
# 差集
-
newDF = sentenceDataFrame1.select("sentence").subtract(sentenceDataFrame.select("sentence"))
-
newDF.show()
-
+--------+
-
|sentence|
-
+--------+
-
| f8934y|
-
+--------+
-
# 交集
-
newDF = sentenceDataFrame1.select("sentence").intersect(sentenceDataFrame.select("sentence"))
-
newDF.show()
-
+--------+
-
|sentence|
-
+--------+
-
| asf|
-
| 2143|
-
+--------+
-
# 并集
-
newDF = sentenceDataFrame1.select("sentence").union(sentenceDataFrame.select("sentence"))
-
newDF.show()
-
+--------+
-
|sentence|
-
+--------+
-
| asf|
-
| 2143|
-
| f8934y|
-
| asf|
-
| 2143|
-
| rfds|
-
+--------+
4.4 分割:行转列
有时候需要根据某个字段内容进行分割,然后生成多行,这时可以使用explode方法。下
面代码中,根据c3字段中的空格将字段内容进行分割,分割的内容存储在新的字段c3_中,如下所示
注:spark的“惰性”性质导致上面的结果,即df.count()是一个Transformations操作,只有执行Action时,
五、 频数统计与筛选
在stat模块中。参考文献【2】有详细介绍。
分组统计 group by
-
train.groupby('Age').agg({'Purchase': 'mean'}).show()
-
Output:
-
+-----+-----------------+
-
| Age| avg(Purchase)|
-
+-----+-----------------+
-
|51-55|9534.808030960236|
-
|46-50|9208.625697468327|
-
| 0-17|8933.464640444974|
-
|36-45|9331.350694917874|
-
|26-35|9252.690632869888|
-
| 55+|9336.280459449405|
-
|18-25|9169.663606261289|
-
+-----+-----------------+
应用多个函数
df.groupBy(“A”).agg(functions.avg(“B”), functions.min(“B”), functions.max(“B”)).show()
apply 函数
udf 函数应用
等等。
当然,pyspark的df的功能特别强大。我这里就不再一一举例了,详见参考文献【2】
参考文献: