当我们完成对数据清洗之后,我们更多的工作是需要根据不同的分析需求对数据进行整合,比如需要统计某类数据的出现次数,或者需要按照不同级别来分别统计等等。这就牵扯到数据的分割,应用和重组了。
今天我们就来了解一下有关pandas函数中的相关功能。
- 拆分: 进行分组的根据
- 应用: 每个分组运行的计算规则
- 合并: 把每个分组的计算结果合并起来
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=<no_default>, observed=False, dropna=True)
by:mapping, function, label, or list of labels
Used to determine the groups for the groupby. If by
is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align()
method). If a list or ndarray of length equal to the selected axis is passed (see the groupby user guide), the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns in self
. Notice that a tuple is interpreted as a (single) key.
axis:{0 or ‘index’, 1 or ‘columns’}, default 0
Split along rows (0) or columns (1).
level:int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
as_index:bool, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
sort:bool, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
group_keys:bool, default True
When calling apply, add group keys to index to identify pieces.
squeeze:bool, default False
Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
Deprecated since version 1.1.0.
observed:bool, default False
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
dropna:bool, default True
If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.
简单来说总结以下几类:
by
:接收list,string,mapping,generator
,用于确定分组的依据,无默认。axis
:接收int
,表示操作的轴向,默认为0,对列进行操作。level
:接收int
或者索引名,代表标签所在级别。as_index
:表示聚合后的聚合标签是否以DataFrame
索引形式输出,默认为True
。sort
:表示是否对分组依据分组标签进行排序。group_keys
:表示是否显示分组标签的名称。squeeze
:表示是否在允许的情况下对返回数据进行降维。
对对象的优化方法有:
count
:计算分组中非NA值的数量sum
:计算非NA值的和mean
:计算非NA值的平均值median
:计算非NA值的算术中位数std、var
:无偏(分母为n-1)标准差和方差min、max
:非NA值的最小值和最大值
import pandas as pd
import numpy as np
# 先创建一个DataFrame,包含课程,等级,和男女生人数信息
data = {"lessons": ['Ch', 'En', 'Ch', 'En', 'Ch', 'Ma', 'En', 'Ma'],
"grade": ['A', 'A', 'B', 'B', 'C', 'A', 'C', 'B'],
"boy": np.random.randint(10, 20, 8), # 10-20选择8个随机数
"girl": np.random.randint(10, 20, 8)}
d_1 = pd.DataFrame(data)
print(d_1)
print(d_1.groupby(by = "lessons")) # 直接进行分组,返回的是内存地址
group1 = d_1.groupby(by = "grade") # 总和各等级男生人数
print(group1.sum())
group2 = d_1.groupby(by = "lessons") # lessons分组元素个数
print(group2.count())
print(group2.size()) # 每个分组元素个数
group3 = d_1.groupby(by = [d_1["lessons"],d_1["grade"]]) # 统计各学科和各等级的男女生人数
print(group3.sum())
print(d_1.groupby(by = [d_1["lessons"],d_1["grade"]]).sum()["boy"]) # 在上面分布的情况下统计男生在各学科和各等级的人数