(译) pyspark.sql.DataFrame模块

最新推荐文章于 2024-07-18 20:11:39 发布

cjhnbls

最新推荐文章于 2024-07-18 20:11:39 发布

阅读量6.9k

点赞数 3

文章标签： spark pyspark python

class pyspark.sql.DataFrame(jdf, sql_ctx)

分布式的列式分组数据集
(1.3版本新增)
一个DataFrame对象相当于Spark SQL中的一个关系型数据表,可以通过SQLContext中的多个函数生成,如下例:

people = sqlContext.read.parquet("...")

创建了一个DataFrame后,可以用多种语言对DataFrame进行操作,生成DataFrame或Columns对象
可以通过如下方式得到DataFrame的一列:

ageCol = people.age

# To create DataFrame using SQLContext
people = sqlContext.read.parquet("...")
department = sqlContext.read.parquet("...")

people.filter(people.age > 30).join(department, people.deptId == department.id) \
  .groupBy(department.name, "gender").agg({"salary": "avg", "age": "max"})

agg(*exprs)

在不分组的情况下对DataFrame进行聚合,即df.groupBy.agg()的缩写
(1.3版本新增)

如下代码可以得到df中age的最大或最小值,个人觉得第二种写法灵活度更高,比如对列进行重命名操作
>>> df.agg({"age": "max"}).collect()
[Row(max(age)=5)]
>>> from pyspark.sql import functions as F
>>> df.agg(F.min(df.age)).collect()
[Row(min(age)=2)]

alias(alias_name)

得到原DataFrame的一个复制集(目前给我的感觉是这样)
(1.3版本新增)

>>> df1 = df.alias('df1')
>>> df2 = df.alias('df2')
>>> df == df1 or df1 == df2 or df == df2
False

approxQuantile(col, probabilities, relativeError)

(2.0版本新增,原文看着有点懵逼,查了一下,只看到了怎么求中位数)

原文:

Calculates the approximate quantiles of a numerical column of a DataFrame.

The result of this algorithm has the following deterministic bound: If the DataFrame has N elements and if we request the quantile at probability p up to error err, then the algorithm will return a sample x from the DataFrame so that the exact rank of x is close to (p * N). More precisely,

floor((p - err) * N) <= rank(x) <= ceil((p + err) * N).
This method implements a variation of the Greenwald-Khanna algorithm (with some speed optimizations). The algorithm was first present in [[http://dx.doi.org/10.1145/375663.375670 Space-efficient Online Computation of Quantile Summaries]] by Greenwald and Khanna.

Parameters:
col – the name of the numerical column
probabilities – a list of quantile probabilities Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
relativeError – The relative target precision to achieve (>= 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.
Returns:
the approximate quantiles at the given probabilities

对df4中的score列求中位数:
>>> df4.approxQuantile('score',[0.5],0)

cache()

以默认的存储等级(MEMORY_AND_DISK,与scala2.0中相同)缓存DataFrame数据
(1.3版本中新增)

checkpoint(eager=True)

返回数据集的检查信息,可用于截取DataFrame的逻辑计划,这在指数级增长的迭代算法中很有用,这些信息将被保存到SparkContext.setCheckpointDir()设置的路径中
(2.1版本新增,目前处于实验阶段)

参数:

eager ——– 是否立刻生成检查信息

coalesce(numPartitions)

返回一个有numPartitions个分区的DataFrame
跟RDD的coalesce方法类似,这个操作会减小依赖,如果将分区数从1000缩小到100,数据不会重组,而是产生100个新的分区,每个新的分区对应10个当前的分区
(1.4版本新增)

>>> df.coalesce(1).rdd.getNumPartitions()
1

collect()

返回由Row组成的list,list中的每个Row都可以转换为dict

>>> df.collect()
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]

columns

返回由DataFrame所有列名组成的list
(1.3版本新增)

>>> df.columns
['age', 'name']

corr(col1, col2, method=None)

计算DataFrame中两列之间的相关系数,结果为double类型,目前只支持皮尔森相关系数.
DataFrame.corr()和DataFrameStatFunctions.corr()是同一个函数
(1.4版本新增)

参数:

col1 ——– 第一列的名称
col2 ——– 第二列的名称
method ——– 相关性方法名,目前只支持皮尔森系数,即”pearson”

count()

返回DataFrame的行数
(1.3版本新增)

>>> df.count()
2

cov(col1, col2)

计算两列的样本协方差,结果为double类型
DataFrame.cov()和DataFrameStatFunctions.cov()是同一个函数
(1.4版本新增)

参数:

col1 ——– 第一列的名称
col2 ——– 第二列的名称

createGlobalTempView(name)

为DataFrame创建一个全局的临时视图
该临时视图的生命周期取决于当前Spark应用,如果视图名在目录中已存在,则抛出TempTableAlreadyExistsException异常
(2.1版本新增)

>>> df.createGlobalTempView("people")
>>> df2 = spark.sql("select * from global_temp.people")
>>> sorted(df.collect()) == sorted(df2.collect())
True
>>> df.createGlobalTempView("people")
Traceback (most recent call last):
...
AnalysisException: u"Temporary table 'people' already exists;"
>>> spark.catalog.dropGlobalTempView("people")

createOrReplaceTempView(name)

为DataFrame创建一个临时视图,如果已存在,则替换原来的视图.
该临时视图的生命周期取决于创建DataFrame的SparkSession
(2.0版本新增)

>>> df.createOrReplaceTempView("people")
>>> df2 = df.filter(df.age > 3)
>>> df2.createOrReplaceTempView("people")
>>> df3 = spark.sql("select * from people")
>>> sorted(df3.collect()) == sorted(df2.collect())
True
>>> spark.catalog.dropTempView("people")

createTempView(name)

为DataFrame创建一个本地临时视图
该临时视图的生命周期取决于创建DataFrame的SparkSession,如果视图名在目录中已存在,则抛出TempTableAlreadyExistsException异常
(2.0版本新增)

>>> df.createTempView("people")
>>> df2 = spark.sql("select * from people")
>>> sorted(df.collect()) == sorted(df2.collect())
True
>>> df.createTempView("people")
Traceback (most recent call last):
...
AnalysisException: u"Temporary table 'people' already exists;"
>>> spark.catalog.dropTempView("people")

crossJoin(other)

返回与另一个DataFrame的笛卡尔积
(2.1版本新增)

参数:

other ——– 笛卡尔积的第二个参数

crosstab(col1, col2)

求两个列的交叉表,即将这两个列中的值进行组合,对每个组合进行计数.
每个列的值限制在一万(去重后),最多产生一百万个频数不为0的组合.
DataFrame.crosstab()和DataFrameStatFunctions.crosstab()是同一个函数
(1.4版本新增)

参数:

col1 ——– 第一列的名称
col2 ——– 第二列的名称

直接看代码比较直观,结果是一个二维表格,col1为y轴,col2为x轴
>>> df = spark.createDataFrame([{'name':'cjh','id':158},{'name':'cjhjh','id':159},{'name':'cjh','id':158}])
>>> df.crosstab('name','id').show()
+-------+---+---+
|name_id|158|159|
+-------+---+---+
|  cjhjh|  0|  1|
|    cjh|  2|  0|
+-------+---+---+
>>> df.crosstab('id','name').show()
+-------+---+-----+
|id_name|cjh|cjhjh|
+-------+---+-----+
|    158|  2|    0|
|    159|  0|    1|
+-------+---+-----+

cube(*cols)

用指定的多个列为当前DataFrame创建一个数据立方,可以在此之上对数据进行聚合.
(1.4版本新增)

cube函数会计算cols中所有值的组合情况,包括null
>>> df.cube("name", df.age).count().orderBy("name", "age").show()
+-----+----+-----+
| name| age|count|
+-----+----+-----+
| null|null|    2|
| null|   2|    1|
| null|   5|    1|
|Alice|null|    1|
|Alice|   2|    1|
|  Bob|null|    1|
|  Bob|   5|    1|
+-----+----+-----+

describe(*cols)

得到列的一些统计信息,如计数,均值,标准差,最小值,最大值.如果没有传入列名,则统计所有的列.
(1.3.1版本新增)

>>> df.describe(['age']).show()
+-------+------------------+
|summary|               age|
+-------+------------------+
|  count|                 2|
|   mean|               3.5|
| stddev|2.1213203435596424|
|    min|                 2|
|    max|                 5|
+-------+------------------+
>>> df.describe().show()
+-------+------------------+-----+
|summary|               age| name|
+-------+------------------+-----+
|  count|                 2|    2|
|   mean|               3.5| null|
| stddev|2.1213203435596424| null|
|    min|                 2|Alice|
|    max|                 5|  Bob|
+-------+------------------+-----+

distinct()

每DataFrame的行数据进行去重,返回一个新的DataFrame
(1.3版本新增)

>>> df.count()
5
>>> df.distinct().count()
3

drop(*cols)

删除DataFrame中指定的列,返回新的DataFrame
如果指定的列名不包含在DataFrame中,将不会执行任何操作.
(1.4版本新增)

>>> df.drop('age').collect()
[Row(name=u'Alice'), Row(name=u'Bob')]
>>> df.drop(df.age).collect()
[Row(name=u'Alice'), Row(name=u'Bob')]
>>> df.join(df2, df.name == df2.name, 'inner').drop(df.name).collect()
[Row(age=5, height=85, name=u'Bob')]
>>> df.join(df2, df.name == df2.name, 'inner').drop(df2.name).collect()
[Row(age=5, name=u'Bob', height=85)]
>>> df.join(df2, 'name', 'inner').drop('age', 'height').collect()
[Row(name=u'Bob')]

dropDuplicates(subset=None)

根据指定的列的组合对DataFrame的行数据进行去重,返回一个新的DataFrame
drop_duplicates()和dropDuplicates()
(1.4版本新增)

>>> from pyspark.sql import Row
>>> df = sc.parallelize([ \
...     Row(name='Alice', age=5, height=80), \
...     Row(name='Alice', age=5, height=80), \
...     Row(name='Alice', age=10, height=80)]).toDF()
>>> df.dropDuplicates().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
| 10|    80|Alice|
+---+------+-----+
>>> df.dropDuplicates(['name', 'height']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
+---+------+-----+

drop_duplicates(subset=None)

功能同上
(1.4版本新增)

dropna(how=’any’, thresh=None, subset=None)

删除包含空值的行
DataFrame.dropna()和 DataFrameNaFunctions.drop()是同一个函数
(1.3.1版本新增)

参数:

how ——– ‘any’:有一个空值就删除;’all’:全部为空才删除
thresh ——– 删除空值小于指定数字的行 ,若thresh=3,则删除有1个或2个空值的行.这个条件会覆盖上一个条件
subset ——– 由指定的列名组成的list

>>> df.dropna().show()
+---+-----+
| id| name|
+---+-----+
|158|  cjh|
|159|cjhjh|
|158|  cjh|
+---+-----+
>>> df.dropna(how='all').show()
+----+-----+
|  id| name|
+----+-----+
| 158|  cjh|
| 159|cjhjh|
| 158|  cjh|
|null|  cj1|
+----+-----+
>>> df.dropna(how='all',thresh=2).show()
+---+-----+
| id| name|
+---+-----+
|158|  cjh|
|159|cjhjh|
|158|  cjh|
+---+-----+

dtypes

以list的形式返回DataFrame所有列的列名和数据类型
(1.3版本新增)

>>> df.dtypes
[('age', 'int'), ('name', 'string')]

explain(extended=False)

打印DataFrame在逻辑和物理上的信息,用于调试
(1.3版本新增)

参数:

extended ——– 布尔值,默认为False,即只打印物理信息

>>> df.explain()
== Physical Plan ==
Scan ExistingRDD[age#0,name#1]
>>> df.explain(True)
== Parsed Logical Plan ==
...
== Analyzed Logical Plan ==
...
== Optimized Logical Plan ==
...
== Physical Plan ==
...

fillna(value, subset=None)

将指定列中的空值填充为特定值,若特定值的类型与列的数据类型不匹配,则不替换
na.fill(), DataFrame.fillna() 和 DataFrameNaFunctions.fill() 是同一个函数
(1.3.1版本新增)

参数:

value ——– 可以为int,long,float,string或dict类型.使用dict类型时,dict中键为列名,值为用于替换空值的特定值,此时subset参数失效
subset —— 要进行替换的列的列名组成的list

>>> df4.na.fill(50).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
| 10|    80|Alice|
|  5|    50|  Bob|
| 50|    50|  Tom|
| 50|    50| null|
+---+------+-----+
>>> df4.na.fill({'age': 50, 'name': 'unknown'}).show()
+---+------+-------+
|age|height|   name|
+---+------+-------+
| 10|    80|  Alice|
|  5|  null|    Bob|
| 50|  null|    Tom|
| 50|  null|unknown|
+---+------+-------+

filter(condition)

根据条件对DataFrame进行过滤
where(condition)和filter(condition)是同一个函数
(1.3版本新增)

参数:

condition ——– 一个由types.BooleanType组成的Column对象,或一个内容为SQL表达式的字符串

>>> df.filter(df.age > 3).collect()
[Row(age=5, name=u'Bob')]
>>> df.where(df.age == 2).collect()
[Row(age=2, name=u'Alice')]
>>> df.filter("age > 3").collect()
[Row(age=5, name=u'Bob')]
>>> df.where("age = 2").collect()
[Row(age=2, name=u'Alice')]

first()

以Row对象的形式返回DataFrame的第一行,可以用asDict()转换为dict
(1.3版本新增)

>>> df.first()
Row(age=2, name=u'Alice')

foreach(f)

将函数f作用于DataFrame的每一行(Row)
这个函数是df.rdd.foreach()的缩写
(1.3版本新增)

>>> def f(person):
...     print(person.name)
>>> df.foreach(f)

foreachPartition(f)

将函数f作用于DataFrame的每一个分区
这个函数是df.rdd.foreachPartition()的缩写
(1.3版本新增)

>>> def f(people):
...     for person in people:
...         print(person.name)
>>> df.foreachPartition(f)

freqItems(cols, support=None)

给定最小支持度,得到指定列中的频繁项.[详细计算规则看这里](http://dx.doi.org/10.1145/762471.762473)
DataFrame.freqItems() 和 DataFrameStatFunctions.freqItems()是同一个函数
(1.4版本新增)

参数:

cols ——– 要计算频繁项的列的列名组成的list或tuple
support ——– 最小支持度,默认为0.01,不能小于0.0001

groupBy(*cols)

根据指定的列对DataFrame进行分组以便后续的聚合操作,具体的聚合函数可查看GroupData类
groupBy()跟groupby()是一样的
(1.3版本新增)

参数:

cols ——– 一个list,由列名或列表达式组成

>>> df.groupBy().avg().collect()
[Row(avg(age)=3.5)]
>>> sorted(df.groupBy('name').agg({'age': 'mean'}).collect())
[Row(name=u'Alice', avg(age)=2.0), Row(name=u'Bob', avg(age)=5.0)]
>>> sorted(df.groupBy(df.name).avg().collect())
[Row(name=u'Alice', avg(age)=2.0), Row(name=u'Bob', avg(age)=5.0)]
>>> sorted(df.groupBy(['name', df.age]).count().collect())
[Row(name=u'Alice', age=2, count=1), Row(name=u'Bob', age=5, count=1)]

groupby(*cols)

同上
(1.4版本新增)

head(n=None)

取DataFrame的前n行
此函数不适合取大批量的数据,因为这个操作是缓存数据到单机上
(1.3版本新增)

参数:

n 默认为1,即取第一行

返回值:

如果n大于1,返回多个Row组成的list;若n为1,返回一个Row

>>> df.head()
Row(age=2, name=u'Alice')
>>> df.head(1)
[Row(age=2, name=u'Alice')]
````





<div class="se-preview-section-delimiter"></div>

### intersect(other)
    返回两个DataFrame的行数据交集,跟SQL中的用法类似
    (1.3版本新增)





<div class="se-preview-section-delimiter"></div>

### isLocal()
    判断collect()和take()能否在本地执行
    (1.3版本新增)





<div class="se-preview-section-delimiter"></div>

### isStreaming
    大概意思就是判断该DataFrame的数据源是否为流式.如果为流式,就无法执行count(),collect()等操作,否则抛出异常.该方法还处于实验阶段
    (2.0版本新增)
    在此把原文贴出来,自己理解得也不是很到位:
    Returns true if this Dataset contains one or more sources that continuously return data as it arrives. A Dataset that reads data from a streaming source must be executed as a StreamingQuery using the start() method in DataStreamWriter. Methods that return a single answer, (e.g., count() or collect()) will throw an AnalysisException when there is a streaming source present.





<div class="se-preview-section-delimiter"></div>

### join(other, on=None, how=None)
    通过指定的表达式将两个DataFrame进行合并
    (1.3版本新增)




<div class="se-preview-section-delimiter"></div>

#### 参数:
- other ------- 被合并的DataFrame
- on -------- 要合并的列,由列名组成的list,一个表达式(字符串),或一个由列对象组成的list;如果为列名或列名组成的list,那么这些列必须在两个DataFrame中都存在.
- how -------- 字符串,默认为'inner',可输入'inner','outer','left_outer','right_outer','leftsemi'






<div class="se-preview-section-delimiter"></div>

``` python
>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()
[Row(name=None, height=80), Row(name=u'Bob', height=85), Row(name=u'Alice', height=None)]
>>> df.join(df2, 'name', 'outer').select('name', 'height').collect()
[Row(name=u'Tom', height=80), Row(name=u'Bob', height=85), Row(name=u'Alice', height=None)]
>>> cond = [df.name == df3.name, df.age == df3.age]
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect()
[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]
>>> df.join(df2, 'name').select(df.name, df2.height).collect()
[Row(name=u'Bob', height=85)]
>>> df.join(df4, ['name', 'age']).select(df.name, df.age).collect()
[Row(name=u'Bob', age=5)]

limit(num)

取DataFrame中指定数量的行数据
(1.3版本新增)

>>> df.limit(1).collect()
[Row(age=2, name=u'Alice')]
>>> df.limit(0).collect()
[]

na

返回DataFrameNaFunctions用以对空值进行操作
(1.3.1版本新增)

orderBy(*cols, **kwargs)

返回一个安装指定列进行排序后的新DataFrame
(1.3版本新增)

参数:

cols ——– 由列名组成的list
ascending ——– 布尔值,默认为True,即按照递增的方式排序.也可以为布尔值组成的list

>>> df.sort(df.age.desc()).collect()
[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]
>>> df.sort("age", ascending=False).collect()
[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]
>>> df.orderBy(df.age.desc()).collect()
[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]
>>> from pyspark.sql.functions import *
>>> df.sort(asc("age")).collect()
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
>>> df.orderBy(desc("age"), "name").collect()
[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]
>>> df.orderBy(["age", "name"], ascending=[0, 1]).collect()
[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]

persist(storageLevel=StorageLevel(True, True, False, False, 1))

对DataFrame进行持久化存储
(1.3版本新增)
Sets the storage level to persist the contents of the DataFrame across operations after the first time it is computed. This can only be used to assign a new storage level if the DataFrame does not have a storage level set yet. If no storage level is specified defaults to (MEMORY_AND_DISK).

Note The default storage level has changed to MEMORY_AND_DISK to match Scala in 2.0.

printSchema()

打印DataFrame的格式信息
(1.3版本新增)

>>> df.printSchema()
root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)

### randomSplit(weights, seed=None)
根据权重对DataFrame进行切分
(1.4版本新增)

#### 参数:
- weights ——– 由double组成的list,如果list中的权重,如果这些权重相加不为1,将会被归一化
- seed ——– 我猜跟随机数的种子差不多吧 (~ ~)

rdd

返回DataFrame对应的RDD
(1.3版本新增)

registerTempTable(name)

以指定的名称注册临时表
这个临时表的生命周期取决于创建DataFrame的SQLContext
(1.3版本新增,目前已废弃,在2.0版本中被createOrReplaceTempView替代)

>>> df.registerTempTable("people")
>>> df2 = spark.sql("select * from people")
>>> sorted(df.collect()) == sorted(df2.collect())
True
>>> spark.catalog.dropTempView("people")

repartition(numPartitions, *cols)

按照给定的数量和列名对DataFrame进行重新分区.如果未指定分区数,则按默认分区数来
(1.3版本新增)

>>> df.repartition(10).rdd.getNumPartitions()
10
>>> data = df.union(df).repartition("age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
|  5|  Bob|
|  5|  Bob|
|  2|Alice|
|  2|Alice|
+---+-----+
>>> data = data.repartition(7, "age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
|  2|Alice|
|  5|  Bob|
|  2|Alice|
|  5|  Bob|
+---+-----+
>>> data.rdd.getNumPartitions()
7
>>> data = data.repartition("name", "age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
|  5|  Bob|
|  5|  Bob|
|  2|Alice|
|  2|Alice|
+---+-----+

replace(to_replace, value, subset=None)

对DataFrame指定列中的值用其他值进行替换
DataFrame.replace() 和 DataFrameNaFunctions.replace() 是同一个函数
(1.4版本新增)

参数:

to_replace ——– 被替换的值,类型可以为int,long,float,sting,list;如果为dict,则键为被替换的值,值为替换后的值,会覆盖value的作用
value ——– 替换后的值,与to_replace对应
subset ——– 被替换列的列名组成的list


>>> df4.na.replace(10, 20).show()
+----+------+-----+
| age|height| name|
+----+------+-----+
|  20|    80|Alice|
|   5|  null|  Bob|
|null|  null|  Tom|
|null|  null| null|
+----+------+-----+
>>> df4.na.replace(['Alice', 'Bob'], ['A', 'B'], 'name').show()
+----+------+----+
| age|height|name|
+----+------+----+
|  10|    80|   A|
|   5|  null|   B|
|null|  null| Tom|
|null|  null|null|
+----+------+----+

rollup(*cols)

对指定列进行组合,以后后续的聚合操作,与cube()函数相比,rollu()主要针对第一列数据进行组合
(1.4版本新增)

>>> df.rollup("name", df.age).count().orderBy("name", "age").show()
+-----+----+-----+
| name| age|count|
+-----+----+-----+
| null|null|    2|
|Alice|null|    1|
|Alice|   2|    1|
|  Bob|null|    1|
|  Bob|   5|    1|
+-----+----+-----+

sample(withReplacement, fraction, seed=None)

按照指定方式获取当前DataFrame的一个采样子集
采样比例无法得到保证
(1.4版本新增)

参数:

>>> df.sample(False, 0.5, 42).count()
2

sampleBy(col, fractions, seed=None)

根据给定的比例获取DataFrame的分层抽样
(1.5版本新增)

参数:

col ——— 按该列进行分层
fractions ——– 抽样比例
seed ——– 抽样种子

返回值:

抽样结果对应的DataFrame

>>> from pyspark.sql.functions import col
>>> dataset = sqlContext.range(0, 100).select((col("id") % 3).alias("key"))
>>> sampled = dataset.sampleBy("key", fractions={0: 0.1, 1: 0.2}, seed=0)
>>> sampled.groupBy("key").count().orderBy("key").show()
+---+-----+
|key|count|
+---+-----+
|  0|    5|
|  1|    9|
+---+-----+

schema

以pyspark.sql.types.StructType的形式返回DataFrame的列信息
(1.3版本新增)

>>> df.schema
StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true)))

select(*cols)

选取DataFrame中指定的列组成一个新的DataFrame
(1.3版本新增)

参数:

cols ——– 列名组成list或表达式;若未给出,则默认为’*’,即当前DataFrame的所有列

>>> df.select('*').collect()
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
>>> df.select('name', 'age').collect()
[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]
>>> df.select(df.name, (df.age + 10).alias('age')).collect()
[Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]

selectExpr(*expr)

根据给定的SQL语句,生成新的DataFrame,是select()的一个特例
(1.3版本新增)

>>> df.selectExpr("age * 2", "abs(age)").collect()
[Row((age * 2)=4, abs(age)=2), Row((age * 2)=10, abs(age)=5)]

show(n=20, truncate=True)

打印DataFrame的前n行
(1.3版本新增)

参数:

n ——– 要打印的行数,默认为20
truncate ——– 默认为True,打印的值长度为20.如果给定的truncate比实际的值短,则对值进行截取并右对齐.

>>> df
DataFrame[age: int, name: string]
>>> df.show()
+---+-----+
|age| name|
+---+-----+
|  2|Alice|
|  5|  Bob|
+---+-----+
>>> df.show(truncate=3)
+---+----+
|age|name|
+---+----+
|  2| Ali|
|  5| Bob|
+---+----+

sort(*cols, **kwargs)

按照指定列进行排序
(1.3版本新增)

参数:

cols ——– 列名组成的list
ascending ——– 是否按升序排列,默认为True.也可以是一个布尔值或01组成的list,代表每个列是否按升序排列.

>>> df.sort(df.age.desc()).collect()
[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]
>>> df.sort("age", ascending=False).collect()
[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]
>>> df.orderBy(df.age.desc()).collect()
[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]
>>> from pyspark.sql.functions import *
>>> df.sort(asc("age")).collect()
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
>>> df.orderBy(desc("age"), "name").collect()
[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]
>>> df.orderBy(["age", "name"], ascending=[0, 1]).collect()
[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]

sortWithinPartitions(*cols, **kwargs)

将DataFrame的每个分区按照指定列进行排序
(1.6版本新增)

参数:

同上

stat

DataFrame的统计函数,即DataFrameStatFunctions
(1.4版本新增)

storageLevel

获取DataFrame的存储信息
(2.1版本新增)

>>> df.storageLevel
StorageLevel(False, False, False, False, 1)
>>> df.cache().storageLevel
StorageLevel(True, True, False, True, 1)
>>> df2.persist(StorageLevel.DISK_ONLY_2).storageLevel
StorageLevel(True, False, False, False, 2)

subtract(other)

得到当前DataFrame与另一个DataFrame的补集求交集,跟SQL中的EXCEPT作用相同
(1.3版本新增)

take(num)

由Row组成的list,即DataFrame中的num行
(1.3版本新增)

toDF(*cols)

将当前DataFrame转换为一个新的DataFrame,同时对各个列进行重命名

参数:

cols ——– 新的列名组成的list

>>> df.toDF('f1', 'f2').collect()
[Row(f1=2, f2=u'Alice'), Row(f1=5, f2=u'Bob')]

toJSON(use_unicode=True)

将DataFrame转换成每行为json的RDD
(1.3版本新增)

>>> df.toJSON().first()
u'{"age":2,"name":"Alice"}'

toLocalIterator()

得到包含DataFrame行数据的迭代器,迭代器消耗的内存和DataFrame中最大的分区相同
(2.0版本新增)

>>> list(df.toLocalIterator())
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]

toPandas()

将DataFrame转换为pandas的DataFrame
只有在pandas模块已安装的情况下才有效,由于消耗的是单机内存,故不适合大数据量操作
(1.3版本新增)

>>> df.toPandas()  
   age   name
0    2  Alice
1    5    Bob
````    





<div class="se-preview-section-delimiter"></div>

### union(other)
    将两个DataFrame的所有Row合并到一个DataFrame中,跟SQL中 UNION ALL的功能相同
    (2.0版本新增)





<div class="se-preview-section-delimiter"></div>

### unionAll(other)
    已废弃,被union()取代





<div class="se-preview-section-delimiter"></div>

### unpersist(blocking=False)
    取消持久化存储,将块信息从内存和硬盘中移除
    (1.3版本新增)





<div class="se-preview-section-delimiter"></div>

### where(condition)
    即filter()





<div class="se-preview-section-delimiter"></div>

### withColumn(colName, col)
    给DataFrame增加一个名称为colName的列
    (1.3版本新增)





<div class="se-preview-section-delimiter"></div>

#### 参数:
- colName -------- 新增列的列名
- col -------- 列对象





<div class="se-preview-section-delimiter"></div>

``` python
>>> df.withColumn('age2', df.age + 2).collect()
[Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]

withColumnRenamed(existing, new)

对DataFrame中的列进行重命名,返回一个新的DataFrame.若列名不存在则不进行任何操作
(1.3版本新增)

参数:

existing ——– 当前的列名
new ——– 新的列名

withWatermark(eventTime, delayThreshold)

根据事件给DataFrame打上时间水印.该水印在数据更新前会及时记录一个时间点.
(2.1版本新增)

水印有以下作用,翻译不清楚,贴原文吧

The current watermark is computed by looking at the MAX(eventTime) seen across all of the partitions in the query minus a user specified delayThreshold. Due to the cost of coordinating this value across partitions, the actual watermark used is only guaranteed to be at least delayThreshold behind the actual event time. In some cases we may still process records that arrive more than delayThreshold late.

参数:

eventTime –——- the name of the column that contains the event time of the row.
delayThreshold –——- the minimum delay to wait to data to arrive late, relative to the latest record that has been processed in the form of an interval (e.g. “1 minute” or “5 hours”).

>>> sdf.select('name', sdf.time.cast('timestamp')).withWatermark('time', '10 minutes')
DataFrame[name: string, time: timestamp]

write

用于将非流式的DataFrame写入到硬盘
(1.4版本新增)

返回值:

DataFrameWriter 通过该对象可以将DataFrame输出成各种格式

writeStream

用于将流式DataFrame进行写入到硬盘,还处于实验阶段
(2.0版本新增)

返回值:

DataStreamWriter 可将流式DataFrame输出输出成各种格式

cjhnbls

关注

3
点赞
踩
15

收藏

觉得还不错? 一键收藏
0
评论
(译) pyspark.sql.DataFrame模块

class pyspark.sql.DataFrame(jdf, sql_ctx)分布式的列式分组数据集(1.3版本新增)一个DataFrame对象相当于Spark SQL中的一个关系型数据表,可以通过SQLContext中的多个函数生成,如下例:people = sqlContext.read.parquet("...")创建了一个DataFrame后,可以用多种语言对Da
复制链接

扫一扫