AI学习：python.pandas数据分割与组合学习（第六天）

最新推荐文章于 2024-03-13 20:07:49 发布

定个小目标：1亿行代码

最新推荐文章于 2024-03-13 20:07:49 发布

阅读量435

点赞数

文章标签：学习人工智能 pandas

本文链接：https://blog.csdn.net/weixin_52241563/article/details/125478558

版权

当我们完成对数据清洗之后，我们更多的工作是需要根据不同的分析需求对数据进行整合，比如需要统计某类数据的出现次数，或者需要按照不同级别来分别统计等等。这就牵扯到数据的分割，应用和重组了。

今天我们就来了解一下有关pandas函数中的相关功能。

拆分： 进行分组的根据
应用： 每个分组运行的计算规则
合并： 把每个分组的计算结果合并起来

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=<no_default>, observed=False, dropna=True)

by：mapping, function, label, or list of labels

Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method). If a list or ndarray of length equal to the selected axis is passed (see the groupby user guide), the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted as a (single) key.

axis：{0 or ‘index’, 1 or ‘columns’}, default 0

Split along rows (0) or columns (1).

level：int, level name, or sequence of such, default None

If the axis is a MultiIndex (hierarchical), group by a particular level or levels.

as_index：bool, default True

For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.

sort：bool, default True

Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

group_keys：bool, default True

When calling apply, add group keys to index to identify pieces.

squeeze：bool, default False

Reduce the dimensionality of the return type if possible, otherwise return a consistent type.

Deprecated since version 1.1.0.

observed：bool, default False

This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

dropna：bool, default True

If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.

简单来说总结以下几类：

by：接收list，string，mapping，generator，用于确定分组的依据，无默认。
axis：接收int，表示操作的轴向，默认为0，对列进行操作。
level：接收int或者索引名，代表标签所在级别。
as_index：表示聚合后的聚合标签是否以DataFrame索引形式输出，默认为True。
sort：表示是否对分组依据分组标签进行排序。
group_keys：表示是否显示分组标签的名称。
squeeze：表示是否在允许的情况下对返回数据进行降维。

对对象的优化方法有：

count：计算分组中非NA值的数量
sum：计算非NA值的和
mean：计算非NA值的平均值
median：计算非NA值的算术中位数
std、var：无偏（分母为n-1）标准差和方差
min、max：非NA值的最小值和最大值

import pandas as pd
import numpy as np

# 先创建一个DataFrame，包含课程，等级，和男女生人数信息
data = {"lessons": ['Ch', 'En', 'Ch', 'En', 'Ch', 'Ma', 'En', 'Ma'],
        "grade": ['A', 'A', 'B', 'B', 'C', 'A', 'C', 'B'],
        "boy": np.random.randint(10, 20, 8),  # 10-20选择8个随机数
        "girl": np.random.randint(10, 20, 8)}
d_1 = pd.DataFrame(data)
print(d_1)
print(d_1.groupby(by = "lessons"))      # 直接进行分组，返回的是内存地址
group1 = d_1.groupby(by = "grade")      # 总和各等级男生人数
print(group1.sum())
group2 = d_1.groupby(by = "lessons")    # lessons分组元素个数
print(group2.count())
print(group2.size())            # 每个分组元素个数
group3 = d_1.groupby(by = [d_1["lessons"],d_1["grade"]])        # 统计各学科和各等级的男女生人数
print(group3.sum())
print(d_1.groupby(by = [d_1["lessons"],d_1["grade"]]).sum()["boy"])     # 在上面分布的情况下统计男生在各学科和各等级的人数