Pandas库的数据特征分析

最新推荐文章于 2023-09-20 16:57:54 发布

cd_sywe

最新推荐文章于 2023-09-20 16:57:54 发布

阅读量423

点赞数

分类专栏： Python数据分析文章标签： Pandas库的数据特征分析

本文链接：https://blog.csdn.net/cd_sywe/article/details/103002063

版权

Python数据分析专栏收录该内容

17 篇文章 6 订阅

订阅专栏

Pandas库的数据排序

1、sort_index()方法

.sort_index() 方法在指定轴上根据索引进行排序，默认为零轴，升序。
.sort_index(axis=0,ascending=True) ascending 指递增排序。

import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(20).reshape(4,5), index = 'c','a','d','b'])
a
    0    1   2   3   4
c   0    1   2   3   4
a   5    6   7   8   9
d  10   11  12  13  14
b  15   16  17  18  19

# 默认按照0轴索引升序排序
a.sort_index()
    0   1   2   3   4
a   5   6   7   8   9
b  15  16  17  18  19
c   0   1   2   3   4
d  10  11  12  13  14
# 规定按照降序排序
a.sort_index(ascending=False)
    0   1   2   3   4
d  10  11  12  13  14
c   0   1   2   3   4
b  15  16  17  18  19
a   5   6   7   8   9
# 规定在1轴上按照降序排序
a.sort_index(axis=1, ascending=False)
    4   3   2   1   0
c   4   3   2   1   0
a   9   8   7   6   5
d  14  13  12  11  10
b  19  18  17  16  15

2、sort_values()方法

.sort_values() 方法在指定轴上根据数值进行排序，默认为零轴，升序

Series.sort_values(axis=0, ascending=True)

DataFrame.sort_values(by, axis=0, ascending=True)

by: axis 轴上的某个索引或索引列表

b = pd.DataFrame(np.arange(20).reshape(4,5), index = 'a','b','c','d'])
b
    0   1   2   3   4
a   0   1   2   3   4
b   5   6   7   8   9
c  10  11  12  13  14
d  15  16  17  18  19
# 默认在0轴上，指定对索引为2的那一列按照降序排序
b.sort_values(2, ascending=False)
    0   1   2   3   4
d  15  16  17  18  19
c  10  11  12  13  14
b   5   6   7   8   9
a   0   1   2   3   4
# 指定在1轴上，对索引为‘c’的那一行进行降序排序
b.sort_values('a',axis=1, ascending=False)
    4   3   2   1   0
a   4   3   2   1   0
b   9   8   7   6   5
c  14  13  12  11  10
d  19  18  17  16  15

3、对排序时空值的处理

NaN 统一放到排序末尾

b = pd.DataFrame(np.arange(20).reshape(4,5), index = 'a','b','c','d'])
a = pd.DataFrame(np.arange(12).reshape(3,4), index = 'a','b','c'])
c = b-a
c
     0    1    2    3   4
a  0.0  0.0  0.0  0.0 NaN
b  1.0  1.0  1.0  1.0 NaN
c  2.0  2.0  2.0  2.0 NaN
d  NaN  NaN  NaN  NaN NaN

c.sort_values(2, ascending = False)
     0    1    2    3   4
c  2.0  2.0  2.0  2.0 NaN
b  1.0  1.0  1.0  1.0 NaN
a  0.0  0.0  0.0  0.0 NaN
d  NaN  NaN  NaN  NaN NaN

Pandas库数据的基本统计分析

适用于 Series 和 DataFrame 类型：

方法	说明
.sum()	计算数据的总和，按0轴计算，下同
.count()	非NaN值的数量
.mean() .median()	计算数据的算数平均值、算数中位数
.var() .std()	计算数据的方差、标准差
.min() .max()	计算数据的最小值、最大值

适用于 Series 类型的方法：

方法	说明
.argmin() .argmax()	计算数据最大值、最小值所在位置的索引（返回自动索引）
.idxmin() .idxmax()	计算数据最大值、最小值所在位置的索引（返回自定义索引）

适用于 Series 和 DataFrame 类型：

方法	说明
.describe()	针对0轴（各列）的统计汇总

1、对Series类型的 .describe() 方法：

a = pd.Series([9,8,7,6], index=['a','b','c','d'])
a
a    9
b    8
c    7
d    6
dtype: int64

# describe()把统计值一次性输出出来
a.describe()
count    4.000000
mean     7.500000
std      1.290994
min      6.000000
25%      6.750000
50%      7.500000
75%      8.250000
max      9.000000
dtype: float64

# describe()输出的为Series类型，可以对其使用Series类型的方法。
type(a.describe())
<class 'pandas.core.series.Series'>
a.describe()['count']
4.0
a.describe()['min']
6.0

2、对 DataFrame 类型的 .describe() 方法：

b = pd.DataFrame(np.arange(20).reshape(4,5), index = 'c','a','d','b'])
b.describe()
               0          1          2          3          4
count   4.000000   4.000000   4.000000   4.000000   4.000000
mean    7.500000   8.500000   9.500000  10.500000  11.500000
std     6.454972   6.454972   6.454972   6.454972   6.454972
min     0.000000   1.000000   2.000000   3.000000   4.000000
25%     3.750000   4.750000   5.750000   6.750000   7.750000
50%     7.500000   8.500000   9.500000  10.500000  11.500000
75%    11.250000  12.250000  13.250000  14.250000  15.250000
max    15.000000  16.000000  17.000000  18.000000  19.000000

# describe()输出的为DataFrame类型，可以对其使用DataFrame类型的方法。
type(b.describe())
<class 'pandas.core.frame.DataFrame'>
b.describe()[2]
count     4.000000
mean      9.500000
std       6.454972
min       2.000000
25%       5.750000
50%       9.500000
75%      13.250000
max      17.000000
Name: 2, dtype: float64

# 这里注意一下DataFrame获取某一行数据时的方法
b.describe().ix['max']
0    15.0
1    16.0
2    17.0
3    18.0
4    19.0
Name: max, dtype: float64

Pandas库数据的累计统计分析

适用于 Series 和 DataFrame 类型：

方法	说明
.cumsum()	依次给出前1、2、…、n个数的和
.cumprod()	依次给出前1、2、…、n个数的积
.cummax()	依次给出前1、2、…、n个数的最大值
.cummin()	依次给出前1、2、…、n个数的最小值

a = pd.DataFrame(np.arange(20).reshape(4,5), index = 'a','b','c','d'])
a
    0   1   2   3   4
a   0   1   2   3   4
b   5   6   7   8   9
c  10  11  12  13  14
d  15  16  17  18  19
a.cumsum()
    0   1   2   3   4
a   0   1   2   3   4
b   5   7   9  11  13
c  15  18  21  24  27
d  30  34  38  42  46
a.cumprod()
   0     1     2     3     4
a  0     1     2     3     4
b  0     6    14    24    36
c  0    66   168   312   504
d  0  1056  2856  5616  9576
a.cummax()
    0   1   2   3   4
a   0   1   2   3   4
b   5   6   7   8   9
c  10  11  12  13  14
d  15  16  17  18  19
a.cummin()
   0  1  2  3  4
a  0  1  2  3  4
b  0  1  2  3  4
c  0  1  2  3  4
d  0  1  2  3  4

适用于 Series 和 DataFrame 类型，滚动计算（窗口计算）：

方法	说明
.rolling(w).sum()	依次计算相邻w个元素的和
.rolling(w).mean()	依次计算相邻w个元素的算数平均值
.rolling(w).var()	依次计算相邻w个元素的方差
.rolling(w).std()	依次计算相邻w个元素的标准差
.rolling(w).min() .max()	依次计算相邻w个元素的最小值和最大值

b = pd.DataFrame(np.arange(20).reshape(4,5), index = 'a','b','c','d'])
b
    0   1   2   3   4
a   0   1   2   3   4
b   5   6   7   8   9
c  10  11  12  13  14
d  15  16  17  18  19

# 每一行元素都和他的上一行元素相加。第一行的上一行为NaN，相加还为NaN
b.rolling(2).sum()
      0     1     2     3     4
a   NaN   NaN   NaN   NaN   NaN
b   5.0   7.0   9.0  11.0  13.0
c  15.0  17.0  19.0  21.0  23.0
d  25.0  27.0  29.0  31.0  33.0

b.rolling(3).sum()
      0     1     2     3     4
a   NaN   NaN   NaN   NaN   NaN
b   NaN   NaN   NaN   NaN   NaN
c  15.0  18.0  21.0  24.0  27.0
d  30.0  33.0  36.0  39.0  42.0

cd_sywe

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Pandas库的数据特征分析

Pandas库的数据排序1、sort_index()方法.sort_index() 方法在指定轴上根据索引进行排序，默认为零轴，升序。.sort_index(axis=0,ascending=True) ascending 指递增排序。import pandas as pdimport numpy as npa = pd.DataFrame(np.arange(20).reshap...
复制链接

扫一扫