DataFrame的基本方法

最新推荐文章于 2024-06-12 21:10:34 发布

黄佳俊、

最新推荐文章于 2024-06-12 21:10:34 发布

阅读量2k

点赞数

分类专栏： python数据分析学习文章标签： python

本文链接：https://blog.csdn.net/weixin_48419914/article/details/120360172

版权

python数据分析学习专栏收录该内容

41 篇文章 10 订阅

订阅专栏

DataFrame中常见的方法：

基本数学操作

较为复杂功能：分组统计

pandas.DataFrame.count

功能

参数

DataFrame中常见的方法：

基本数学操作

df.count() #非空元素计算
df.min() #最小值
df.max() #最大值
df.idxmin() #最小值的位置，类似于R中的which.min函数
df.idxmax() #最大值的位置，类似于R中的which.max函数
df.quantile(0.1) #10%分位数
df.sum() #求和
df.mean() #均值
df.median() #中位数
df.mode() #众数
df.var() #方差
df.std() #标准差
df.mad() #平均绝对偏差
df.skew() #偏度
df.kurt() #峰度
df.describe() #一次性输出多个描述性统计指标

较为复杂功能：分组统计

df.groupby('Person').sum()

pandas.DataFrame.count

功能

计数

参数

1、轴:{0或' index '， 1或' columns '}，默认为0

如果为每个列生成0或' index '计数。如果为每一行生成1个或“列”计数。

2、级别:int或str，可选

如果轴是一个多索引(层次结构)，则沿着特定的级别计数，折叠成一个dataframe。str指定级别名称。

3、numeric_only:布尔值，默认为False

只包含浮点数、int或boolean数据。

给出的例子

1、构建一个DataFrame

df = pd.DataFrame({"Person":
... ["John", "Myla", "Lewis", "John", "Myla"],
... "Age": [24., np.nan, 21., 33, 26],
... "Single": [False, True, True, True, False]})
>>> df
Person Age Single
0 John 24.0 False
1 Myla NaN True
2 Lewis 21.0 True
3 John 33.0 True
4 Myla 26.0 False

2、统计NA

>>> df.count()
Person 5
Age 4
Single 5
dtype: int64

3、针对每一行，进行统计

df.count(axis='columns')
0 3
1 2
2 3
3 3
4 3
dtype: int64
注意：这里axis='columns'表示按“列”操作，相当于axis=0；如果axis=1,对每一行进行操作

4、计算多索引的一个级别

>>> df.set_index(["Person", "Single"]).count(level="Person")
Age
Person
John 2
Lewis 1
Myla 1

set_index相关补充

DataFrame可以通过set_index方法，可以使用现有列设置单索引和复合索引

DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)

参数：

keys：label or array-like or list of labels/arrays，这个是需要设置为索引的列名，可以是单个列名，或者是多个列名
drop：bool, default True，删除要用作新索引的列
append：bool, default False，添加新索引
inplace：bool, default False，是否要覆盖数据集
verify_integrity：bool, default False，检查新索引是否重复。否则，将检查推迟到必要时进行。设置为False将改善此方法的性能

注意：drop为False，inplace为True时，索引将会还原为列

官网例子：

df = pd.DataFrame({'month': [1, 4, 7, 10],
                   'year': [2012, 2014, 2013, 2014],
                   'sale': [55, 40, 84, 31]})


#设置单个列作为索引
df.set_index('month')
'''
       year  sale
month
1      2012    55
4      2014    40
7      2013    84
10     2014    31
'''
#设置复合索引
df.set_index(['year', 'month'])
'''
            sale
year  month
2012  1     55
2014  4     40
2013  7     84
2014  10    31
'''
#自定义索引和某列作为复合索引
df.set_index([pd.Index([1, 2, 3, 4]), 'year'])
'''
         month  sale
   year
1  2012  1      55
2  2014  4      40
3  2013  7      84
4  2014  10     31
'''
#自定义索引
s = pd.Series([1, 2, 3, 4])
df.set_index([s, s**2])
'''
      month  year  sale
1 1       1  2012    55
2 4       4  2014    40
3 9       7  2013    84
4 16     10  2014    31
'''