PySpark---SparkSQL中的DataFrame(三)_seed for sampling.-CSDN博客

本文链接：https://blog.csdn.net/xiaodunlp/article/details/98222733

1.filter(condition)

"""Filters rows using the given condition.
:func:`where` is an alias for :func:`filter`.
:param condition: a :class:`Column` of :class:`types.BooleanType`
    or a string of SQL expression."""

按照传入的条件进行过滤，其实where方法就是filter方法的一个别名而已。

不仅可以传入布尔表达式的方式,还可以直接把条件字符串用类似SQL语句中的筛选条件传入.

df.show()
df.filter("age>=12").show()
df.filter(df["grade2"] > 40).show() # 或者写成df.grade2

2.first()

返回DataFrame的第一条记录

print(df.first())

如果想得到里面的内容,可以直接在后面["列名"]或者用asDIct()转成字典,使用字段的方法来查询

3.foreach(f)

"""Applies the ``f`` function to all :class:`Row` of this :class:`DataFrame`.
This is a shorthand for ``df.rdd.foreach()``"""

在每一个Row上运用f方法，实际上它调用的是df.rdd.foreach这个RDD上的foreach方法

df.foreach(lambda x: print("id是: %s,年龄是: %s" % (x.id, x.age)))

还有一个foreachPartition方法，他是在整个分区上调用传入的f方法，效率比foreach方法更加高效，因为foreach方法是在每个Row上进行调用。

def print_func(partition_datas):
    for row in partition_datas:
        print("id is %s ,age is %s" % (row.id, row.age))


df.foreachPartition(print_func)