文章目录
[ Pandas version: 1.0.1 ]
八、累计与分组
在对较大的数据进行分析时,一项基本的工作就是有效的数据累计(summarization):计算累计(aggregation)指标,如sum(), mean(), median(), min(), max()
,其中每一个指标都呈现了大数据集的特征。
(一)Pandas的简单累计功能
import numpy as np
import pandas as pd
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
# 0 0.374540
# 1 0.950714
# 2 0.731994
# 3 0.598658
# 4 0.156019
# dtype: float64
ser.sum() # 2.811925491708157
ser.mean() # 0.5623850983416314
df = pd.DataFrame({
'A': rng.rand(5), 'B': rng.rand(5)})
# A B
# 0 0.155995 0.020584
# 1 0.058084 0.969910
# 2 0.866176 0.832443
# 3 0.601115 0.212339
# 4 0.708073 0.181825
df.mean()
# A 0.477888
# B 0.443420
# dtype: float64
df.mean(axis='columns')
# 0 0.088290
# 1 0.513997
# 2 0.849309
# 3 0.406727
# 4 0.444949
# dtype: float64
# 行星数据集
planets = pd.read_csv('./seaborn-data-master/planets.csv')
planets.shape # (1035, 6)
planets.head()
planets.dropna().describe()
Pandas的累计方法
指标 | 描述 |
---|---|
count() | 计数项 |
first(), last() | 第一项,最后一项 |
mean(), median() | 均值,中位数 |
min(), max() | 最小值,最大值 |
std(), var() | 标准差,方差 |
mad() | 均值绝对偏差 (mean absolute deviation) |
prod() | 所有项乘积 |
sum() | 所有项求和 |
(二)GroupBy:分割、应用和组合
# pandas.DataFrame.groupby — pandas 1.0.3 documentation
DataFrame.groupby(self, by=None, axis=0, level=None, as_index: bool = True, sort: bool = True, group_keys: bool = True, squeeze: bool = False, observed: bool = False) → 'groupby_generic.DataFrameGroupBy'[source]
Group DataFrame using a mapper or by a Series of columns.
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
Parameters:
by: mapping, functi