问题描述
dataframe展示
希望对总面积,客厅面积等求均值
中间曲折
在网上找了很多资料,感觉rdd进行map,reduce可以完成,转化成rdd
,
每一个属性进行map,reduce,这样汇总求均值并排序,但是发现这样做太麻烦了,但是网上我没找到相关教程,于是我在库中寻找dataframe相关方法。结果让我找到了。
相关方法
1、summary()函数
suammary 函数可用于dataframe求均值,方差,最大值最小值等等
可填入参数
‘max’ | 所选dataframe的最大值 |
---|---|
’mean‘ | 平均值 |
‘min’ | 最小值 |
stddev" | 样本标准差 |
‘count’ | 总的个数 |
’25%‘ | 百分比的任意近似百分位数。’25%‘是个例子 |
实例
train_data.select('总面积','客厅面积','卧室面积','总价/万元','单价').summary().show()
select挑选哪些列,然后这些列进行summary()函数
运行结果:
summary()方法源码展示
def summary(self, *statistics):
"""Computes specified statistics for numeric and string columns. Available statistics are:
- count
- mean
- stddev
- min
- max
- arbitrary approximate percentiles specified as a percentage (eg, 75%)
If no statistics are given, this function computes count, mean, stddev, min,
approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
.. note:: This function is meant for exploratory data analysis, as we make no
guarantee about the backward compatibility of the schema of the resulting DataFrame.
>>> df.summary().show()
+-------+------------------+-----+
|summary| age| name|
+-------+------------------+-----+
| count| 2| 2|
| mean| 3.5| null|
| stddev|2.1213203435596424| null|
| min| 2|Alice|
| 25%| 2| null|
| 50%| 2| null|
| 75%| 5| null|
| max| 5| Bob|
+-------+------------------+-----+
>>> df.summary("count", "min", "25%", "75%", "max").show()
+-------+---+-----+
|summary|age| name|
+-------+---+-----+
| count| 2| 2|
| min| 2|Alice|
| 25%| 2| null|
| 75%| 5| null|
| max| 5| Bob|
+-------+---+-----+
To do a summary for specific columns first select them:
>>> df.select("age", "name").summary("count").show()
+-------+---+----+
|summary|age|name|
+-------+---+----+
| count| 2| 2|
+-------+---+----+
See also describe for basic statistics.
"""
if len(statistics) == 1 and isinstance(statistics[0], list):
statistics = statistics[0]
jdf = self._jdf.summary(self._jseq(statistics))
return DataFrame(jdf, self.sql_ctx)
@ignore_unicode_prefix
@since(1.3)
sort()方法
源码展示
def sort(self, *cols, **kwargs):
"""Returns a new :class:`DataFrame` sorted by the specified column(s).
:param cols: list of :class:`Column` or column names to sort by.
:param ascending: boolean or list of boolean (default True).
Sort ascending vs. descending. Specify list for multiple sort orders.
If a list is specified, length of the list must equal length of the `cols`.
>>> df.sort(df.age.desc()).collect()
[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]
>>> df.sort("age", ascending=False).collect()
[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]
>>> df.orderBy(df.age.desc()).collect()
[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]
>>> from pyspark.sql.functions import *
>>> df.sort(asc("age")).collect()
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
>>> df.orderBy(desc("age"), "name").collect()
[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]
>>> df.orderBy(["age", "name"], ascending=[0, 1]).collect()
[Row(age=5, name=u'Bob'), Row(age=2, name=u'Alice')]
"""
jdf = self._jdf.sort(self._sort_cols(cols, kwargs))
return DataFrame(jdf, self.sql_ctx)
orderBy = sort
实例
#排序
sort=train_data.sort(tag, ascending=False).collect()
#输出top20
for i in sort:
print(i)
if i==sort[19]:
break
tag是列索引,这里可以替换成’总面积’,输出top20,仿照源码编写,最后实现关于总面积的top20输出