PySpark---SparkSQL中的DataFrame(四)

本文详细介绍了PySpark中DataFrame的各种操作,包括replace()方法的替换功能,sample()和sampleBy()的采样方法,schema信息获取,select()和selectExpr()的选择功能,show()的展示方式,以及stat统计、storageLevel缓存级别、union()和unionByName()的合并操作等。此外,还涵盖了where()过滤、withColumn()修改列、unpersist()取消缓存和write接口的数据持久化等关键功能。
摘要由CSDN通过智能技术生成

1.replace(to_replace, value=_NoValue, subset=None)

"""Returns a new :class:`DataFrame` replacing a value with another value.
:func:`DataFrame.replace` and :func:`DataFrameNaFunctions.replace` are
aliases of each other.
Values to_replace and value must have the same type and can only be numerics, booleans,
or strings. Value can have None. When replacing, the new value will be cast
to the type of the existing column.
For numeric replacements all values to be replaced should have unique
floating point representation. In case of conflicts (for example with `{42: -1, 42.0: 1}`)
and arbitrary replacement will be used.

:param to_replace: bool, int, long, float, string, list or dict.
    Value to be replaced.
    If the value is a dict, then `value` is ignored or can be omitted, and `to_replace`
    must be a mapping between a value and a replacement.
:param value: bool, int, long, float, string, list or None.
    The replacement value must be a bool, int, long, float, string or None. If `value` is a
    list, `value` should be of the same length and type as `to_replace`.
    If `value` is a scalar and `to_replace` is a sequence, then `value` is
    used as a replacement for each item in `to_replace`.
:param subset: optional list of column names to consider.
    Columns specified in subset that do not have matching data type are ignored.
    For example, if `value` is a string, and subset contains a non-string column,
    then the non-string column is simply ignored."""

这个方法通过第一个参数指定要 被替换掉的老的值,第二个参数指定新的值,subset关键字参数指定子集,默认是在整个 DataFrame上进行替换。

注意上面在替换的过程中to_replace和value的类型必须要相同,而且to_replace数据类型只 能是:bool, int, long, float, string, list or dict。value数据类型只能是: bool, int, long, float, string, list or None

df.show()

df.replace([r'\N', ' ', '\t',"''"], None).show()
df.replace([r'\N', ' ', '\t',"''"],None,["name","age"]).show()

还可以针对某列中的值进行对应替换,如下:将Genre列的Female替换成F,将Male替换成M

2.sample(withReplacement=None, fraction=None, seed=None)

"""Returns a sampled subset of this :class:`DataFrame
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值