Pandas函数方法分类参数说明+实例分析！！

最新推荐文章于 2024-06-11 22:16:02 发布

沐锹

最新推荐文章于 2024-06-11 22:16:02 发布

阅读量2k

点赞数 2

分类专栏：数据分析文章标签： pandas 分类 python

本文链接：https://blog.csdn.net/weixin_52850085/article/details/126213240

版权

数据分析专栏收录该内容

6 篇文章 1 订阅

订阅专栏

Pandas描述性统计（函数用法说明）

函数索引（同样可以对行列进行操作）

count()            非空数据的数量
sum()              所有值之和
mean()             所有值的平均值
median()           所有值的中位数
mode()             值的模值
std()              值的标准偏差
min()              所有值中的最小值
max()              所有值中的最大值
abs()              绝对值
prod()             数组元素的乘积
cumsum()           累计总和
cuoprod()          累计乘积
describe()         计算有关DataFrame列的统计信息的摘要

describe()内含include参数
是用于传递关于什么列需要考虑用于总结的必要信息的参数。获取值列表; 默认情况                    下是”数字值”。
object – 汇总字符串列
number – 汇总数字列
all – 将所有列汇总在一起(不应将其作为列表值传递)

举例

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print("原始数据：")
print(df)
print("……………………………………………………………………")
print("sum求和：")
print (df.sum())
print("……………………………………………………………………")
print("对某列求和：")
print(df.sum(1))
print("……………………………………………………………………")
print("cumsum累计求和：")
print(df.cumsum())
print("……………………………………………………………………")
print("describe信息摘要：")
print(df.describe())
print("……………………………………………………………………")
print("include中的object汇总字符串列：")
print(df.describe(include=['object']))
print("……………………………………………………………………")
print("include中的number汇总数字列：")
print(df.describe(include=['number']))
print("……………………………………………………………………")
print("include中的all将所有列汇总在一起：")
print(df.describe(include='all'))


'''
原始数据：
    Age    Name  Rating
0    25     Tom    4.23
1    26   James    3.24
2    25   Ricky    3.98
3    23     Vin    2.56
4    30   Steve    3.20
5    29   Minsu    4.60
6    23    Jack    3.80
7    34     Lee    3.78
8    40   David    2.98
9    30  Gasper    4.80
10   51  Betina    4.10
11   46  Andres    3.65
……………………………………………………………………
sum求和：
Age                                                     382
Name      TomJamesRickyVinSteveMinsuJackLeeDavidGasperBe...
Rating                                                44.92
dtype: object
……………………………………………………………………
对某列求和：
0     29.23
1     29.24
2     28.98
3     25.56
4     33.20
5     33.60
6     26.80
7     37.78
8     42.98
9     34.80
10    55.10
11    49.65
dtype: float64
……………………………………………………………………
cumsum累计求和：
    Age                                               Name Rating
0    25                                                Tom   4.23
1    51                                           TomJames   7.47
2    76                                      TomJamesRicky  11.45
3    99                                   TomJamesRickyVin  14.01
4   129                              TomJamesRickyVinSteve  17.21
5   158                         TomJamesRickyVinSteveMinsu  21.81
6   181                     TomJamesRickyVinSteveMinsuJack  25.61
7   215                  TomJamesRickyVinSteveMinsuJackLee  29.39
8   255             TomJamesRickyVinSteveMinsuJackLeeDavid  32.37
9   285       TomJamesRickyVinSteveMinsuJackLeeDavidGasper  37.17
10  336  TomJamesRickyVinSteveMinsuJackLeeDavidGasperBe...  41.27
11  382  TomJamesRickyVinSteveMinsuJackLeeDavidGasperBe...  44.92
……………………………………………………………………
describe信息摘要：
             Age     Rating
count  12.000000  12.000000
mean   31.833333   3.743333
std     9.232682   0.661628
min    23.000000   2.560000
25%    25.000000   3.230000
50%    29.500000   3.790000
75%    35.500000   4.132500
max    51.000000   4.800000
……………………………………………………………………
include中的object汇总字符串列：
         Name
count      12
unique     12
top     Ricky
freq        1
……………………………………………………………………
include中的number汇总数字列：
             Age     Rating
count  12.000000  12.000000
mean   31.833333   3.743333
std     9.232682   0.661628
min    23.000000   2.560000
25%    25.000000   3.230000
50%    29.500000   3.790000
75%    35.500000   4.132500
max    51.000000   4.800000
……………………………………………………………………
include中的all将所有列汇总在一起：
              Age   Name     Rating
count   12.000000     12  12.000000
unique        NaN     12        NaN
top           NaN  Ricky        NaN
freq          NaN      1        NaN
mean    31.833333    NaN   3.743333
std      9.232682    NaN   0.661628
min     23.000000    NaN   2.560000
25%     25.000000    NaN   3.230000
50%     29.500000    NaN   3.790000
75%     35.500000    NaN   4.132500
max     51.000000    NaN   4.800000
'''

Pandas函数应用

即将自定义或者其他库的函数应用于Pandas对象，包含三个方法，使用什么方法取决于函数是否期望在整个DataFrame行或列或元素上进行操作。

Pandas迭代

Pandas对象之间的基本迭代的行为取决于类型。当迭代一个系列时，它被视为数组式，基本迭代产生这些值。其他数据结构，如：DataFrame和Panel，遵循类似惯例迭代对象的键。简而言之，基本迭代(对于i在对象中)产生。【极客教程】

Series – 值

DataFrame – 列标签

Pannel – 项目标签

不要尝试在迭代时修改任何对象。迭代是用于读取，迭代器返回原始对象(视图)的副本，因此更改将不会反映在原始对象上。

迭代DataFrame提供列名

# -*- coding: UTF-8 -*-

import pandas as pd
s3=pd.Series(['狗蛋','大黄','老公鸡','大熊'],index=['row1','row2','row3','row4'])
s4=pd.Series([10,20,30,40],index=['row1','row2','row3','row4'])
x2={'name':s3,'age':s4}
dx2=pd.DataFrame(x2)
print("原始数据：")
print(dx2)
print("……………………………………")
print("列名：")
for col in dx2:
	print(col)
'''
原始数据：
      age name
row1   10   狗蛋
row2   20   大黄
row3   30  老公鸡
row4   40   大熊
……………………………………
列名：
age
name
'''

迭代DataFrame的行

iteritems() – 迭代(key，value)对
iterrows() – 将行迭代为(索引，系列)对
itertuples() – 以namedtuples的形式迭代行

iteritems()

将每个列作为键，将值与值作为键和列值迭代为Series对象。

# -*- coding: UTF-8 -*-
import pandas as pd
s3=pd.Series(['狗蛋','大黄','老公鸡','大熊'],index=['row1','row2','row3','row4'])
s4=pd.Series([10,20,30,40],index=['row1','row2','row3','row4'])
x2={'name':s3,'age':s4}
dx2=pd.DataFrame(x2)
print("原始数据：")
print(dx2)
print("……………………………………")
print("iteritems迭代：")
for key,value in dx2.iteritems():
	print(key,value)
'''
原始数据：
      age name
row1   10   狗蛋
row2   20   大黄
row3   30  老公鸡
row4   40   大熊
……………………………………
iteritems迭代：
('age', row1    10
row2    20
row3    30
row4    40
Name: age, dtype: int64)
('name', row1     狗蛋
row2     大黄
row3    老公鸡
row4     大熊
Name: name, dtype: object)
'''

iterrows()

iterrows()`返回迭代器，产生每个索引值以及包含每行数据的序列。

# -*- coding: UTF-8 -*-
import pandas as pd
s3=pd.Series(['狗蛋','大黄','老公鸡','大熊'],index=['row1','row2','row3','row4'])
s4=pd.Series([10,20,30,40],index=['row1','row2','row3','row4'])
x2={'name':s3,'age':s4}
dx2=pd.DataFrame(x2)
print("原始数据：")
print(dx2)
print("……………………………………")
print("iterrows迭代：")
for row_index,row in dx2.iterrows():
	print(row_index,row)
'''
原始数据：
      age name
row1   10   狗蛋
row2   20   大黄
row3   30  老公鸡
row4   40   大熊
……………………………………
iterrows迭代：
('row1', age     10
name    狗蛋
Name: row1, dtype: object)
('row2', age     20
name    大黄
Name: row2, dtype: object)
('row3', age      30
name    老公鸡
Name: row3, dtype: object)
('row4', age     40
name    大熊
Name: row4, dtype: object)
'''

itertuples()

itertuples()方法将为DataFrame中的每一行返回一个产生一个命名元组的迭代器。元组的第一个元素将是行的相应索引值，而剩余的值是行值。

# -*- coding: UTF-8 -*-
import pandas as pd
s3=pd.Series(['狗蛋','大黄','老公鸡','大熊'],index=['row1','row2','row3','row4'])
s4=pd.Series([10,20,30,40],index=['row1','row2','row3','row4'])
x2={'name':s3,'age':s4}
dx2=pd.DataFrame(x2)
print("原始数据：")
print(dx2)
print("……………………………………")
print("itertuples迭代：")
for row in dx2.itertuples():
	print(row)
'''
原始数据：
      age name
row1   10   狗蛋
row2   20   大黄
row3   30  老公鸡
row4   40   大熊
……………………………………
itertuples迭代：
Pandas(Index='row1', age=10, name='\xe7\x8b\x97\xe8\x9b\x8b')
Pandas(Index='row2', age=20, name='\xe5\xa4\xa7\xe9\xbb\x84')
Pandas(Index='row3', age=30, name='\xe8\x80\x81\xe5\x85\xac\xe9\xb8\xa1')
Pandas(Index='row4', age=40, name='\xe5\xa4\xa7\xe7\x86\x8a')
'''

Pandas排序

原始数据

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np
data=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['two','one'])
print(data)
'''
        two       one
1  1.132523 -1.097201
4  1.020550 -0.641788
6 -0.769821 -1.463691
2  2.224695 -0.570571
3 -0.485055  1.116554
5 -0.152900 -0.311590
9  1.214404 -0.554556
8  0.519785  0.980446
0 -1.503853 -0.092038
7 -1.755938  0.085069
'''

按标签排序

使用sort_index()方法，通过传递axis参数和排序顺序，可以对DataFrame进行排序。默认情况下，按照升序对行标签进行排序。

sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last',ignore_index: bool = False)

axis：0按照行名排序；1按照列名排序,默认是0
level：默认None，若不为None，则对指定索引级别的值进行排序
ascending：默认True升序排列；False降序排列
inplace：默认False，否则排序之后的数据直接替换原来的数据
kind：排序方法，{‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’。似乎不用太关心。
na_position：缺失值默认排在最后{"first","last"}
ignore_index= False:忽略索引，默认是不忽略（False），True：忽略

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np
data=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['two','one'])
print("原始数据：")
print(data)
print("……………………………………")
data2=data.sort_index()
print("按标签排序后：")
print(data2)

'''
原始数据：
        two       one
1  0.327346 -1.371823
4  0.766698  0.774780
6  0.554645  0.724602
2  1.695929 -0.188749
3 -1.301285 -0.361423
5 -1.109693 -0.021815
9  0.509077  0.184829
8  0.571038  0.149157
0 -0.320515  0.620416
7  0.517924  0.202861
……………………………………
按标签排序后：
        two       one
0 -0.320515  0.620416
1  0.327346 -1.371823
2  1.695929 -0.188749
3 -1.301285 -0.361423
4  0.766698  0.774780
5 -1.109693 -0.021815
6  0.554645  0.724602
7  0.517924  0.202861
8  0.571038  0.149157
9  0.509077  0.184829
'''

按行排序

通过将布尔值传递给升序参数，控制排序顺序。

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np
data=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['two','one'])
print("原始数据：")
print(data)
print("……………………………………")
data2=data.sort_index(ascending=False)
print("按标签排序后：")
print(data2)
'''
原始数据：
        two       one
1  1.240027  1.000773
4  0.011837  1.452669
6 -1.513732  0.197287
2  0.127446  1.168875
3  0.642374 -0.491269
5  0.244226 -1.115241
9  1.611361  0.282201
8 -0.271991 -1.638814
0  0.070785 -1.034654
7  0.712523  0.067568
……………………………………
按标签排序后：
        two       one
9  1.611361  0.282201
8 -0.271991 -1.638814
7  0.712523  0.067568
6 -1.513732  0.197287
5  0.244226 -1.115241
4  0.011837  1.452669
3  0.642374 -0.491269
2  0.127446  1.168875
1  1.240027  1.000773
0  0.070785 -1.034654
'''

按列排序

通过传递axis参数值为0或1，可以对列标签进行排序。默认情况下，axis = 0，逐行排列。

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np
data=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['two','one'])
print("原始数据：")
print(data)
print("……………………………………")
data2=data.sort_index(axis=1)
print("按标签排序后：")
print(data2)
'''
原始数据：
        two       one
1 -0.098167 -1.337630
4 -1.256009  0.980084
6  1.523998 -0.256460
2  1.431966  0.413947
3  1.358575 -0.161808
5 -0.520922 -0.289887
9 -2.076201  0.497662
8  0.825859 -1.329819
0  0.786688 -1.401064
7 -0.123275 -1.397867
……………………………………
按标签排序后：
        one       two
1 -1.337630 -0.098167
4  0.980084 -1.256009
6 -0.256460  1.523998
2  0.413947  1.431966
3 -0.161808  1.358575
5 -0.289887 -0.520922
9  0.497662 -2.076201
8 -1.329819  0.825859
0 -1.401064  0.786688
7 -1.397867 -0.123275
'''

按值排序

像索引排序一样，sort_values()是按值排序的方法。它接受一个by参数，它将使用要与其排序值的DataFrame的列名称。

sort_values(by.axis=0,ascending=True,inplace=False,kind='quicksort',na_posttion='last',ignore_index=False)

by:按照哪一列排序
axis:axis=0或'index'，则按照指定列中数据大小排序
     axis=1或'columns'，则按照指定行中数据大小排序，默认为axis=0
ascending:True是升序，False是逆序
inplace:是否用排序后的数据集替换原来的数据，默认为false，即不替换
kind='quicksort':排序算法，不用修改
    quicksort :快速排序
    mergesort:稳定排序
    heapsort:堆排序算法
na_position:空值放在那里，默认放在last(后面)，first放在前面
ignore_index:False:忽略索引，默认是不忽略(False),True:忽略

# -*- coding: UTF-8 -*-
import pandas as pd
data =pd.DataFrame([[2,3,12],[6,2,8],[9,5,7]], 
                 index=["0", "2", "1"], 
                 columns=["col_a", "col_c", "col_b"])
print("原始数据：")
print(data)
print("按指定列的值大小顺序进行排序：")
data.sort_values(by='col_c')
print(data)
print("按多列进行排序：")
data.sort_values(by=['col_b','col_a'])
print(data)
print("先按col_b降序，再按col_a升序排序：")
data.sort_values(by=['col_b','col_a'],axis=0,ascending=[False,True])
print(data)
print("按行升序排列")
data.sort_values(by='2',axis=1)
print(data)
'''
原始数据：
   col_a  col_c  col_b
0      2      3     12
2      6      2      8
1      9      5      7
按指定列的值大小顺序进行排序：
   col_a  col_c  col_b
0      2      3     12
2      6      2      8
1      9      5      7
按多列进行排序：
   col_a  col_c  col_b
0      2      3     12
2      6      2      8
1      9      5      7
先按col_b降序，再按col_a升序排序：
   col_a  col_c  col_b
0      2      3     12
2      6      2      8
1      9      5      7
按行升序排列
   col_a  col_c  col_b
0      2      3     12
2      6      2      8
1      9      5      7
'''

排序算法

kind=‘quicksort’:排序算法，不用修改

quicksort :快速排序
mergesort:稳定排序
heapsort:堆排序算法

# -*- coding: UTF-8 -*-
import pandas as pd
data =pd.DataFrame([[2,3,12],[6,2,8],[9,5,7]], 
                 index=["0", "2", "1"], 
                 columns=["col_a", "col_c", "col_b"])
print("原始数据：")
print(data)
print("Mergesort算法：")
data2=data.sort_values(by='col_c',kind='Mergesort')
print(data)
'''
原始数据：
   col_a  col_c  col_b
0      2      3     12
2      6      2      8
1      9      5      7
Mergesort算法：
   col_a  col_c  col_b
2      6      2      8
0      2      3     12
1      9      5      7
'''

take()函数

用numpy.ramdom.permutation()函数，调用Series对象或DataFrame对象各行的顺序(随机排序)很简单，如下所示，创建一个元素为整数且按照升序排列的DataFrame对象。

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np
data=pd.DataFrame(np.arange(25).reshape(5,5))
print("原始数据：")
print(data)
print("用permutation()函数创建一个包含0-4(顺序随机)这五各整数的数组，我们按照这个数组元素的顺序为DataFrame对象进行行排序，对DataFrame对象的所有行应用take()函数，把新的次序传给它：")
data2=np.random.permutation(5)
print(data2)
print(data.take(data2))
print("只对DataFrame对象的一部分进行排序操作。它将生成一个数组，只包含特定索引范围的数据:")
data3=[3,4,2]
print(data.take(data3))
'''
原始数据：
    0   1   2   3   4
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19
4  20  21  22  23  24
用permutation()函数创建一个包含0-4(顺序随机)这五各整数的数组，我们按照这个数组元素的顺序为DataFrame对象进行行排序，对DataFrame对象的所有行应用take()函数，把新的次序传给它：
[2 1 0 3 4]
    0   1   2   3   4
2  10  11  12  13  14
1   5   6   7   8   9
0   0   1   2   3   4
3  15  16  17  18  19
4  20  21  22  23  24
只对DataFrame对象的一部分进行排序操作。它将生成一个数组，只包含特定索引范围的数据:
    0   1   2   3   4
3  15  16  17  18  19
4  20  21  22  23  24
2  10  11  12  13  14
'''

Pandas字符串和文本数据（函数方法）

Pandas提供了一组字符串函数，可以方便地对字符串数据进行操作。最重要的是，这些函数忽略了NaN值。几乎这些方法都使用Python 字符串函数。因此，将Series对象转换为String对象，然后执行该操作。

lower()					将Seris/Index中的字符串转换成小写
upper()					将Seris/Index中的字符串转换成大写
len()					计算字符串长度
strip()					从两侧的系列/索引字符串中删除空格（包括换行符）
split('')				用给定的模式拆分字符串
cat(sep='')				使用给定的分隔符连接元素
get_dummies()			返回具有单热编码值得数据帧
contains(pattern)		如果元素中包含子字符串，则返回每个元素的布尔值True
replace(a,b)			将a替换成b
repeat(value)			重复每个元素指定的次数
count(pattern)			返回模式中每个元素的出现总数
startswith(pattern)  	如果元素以pattern(模式)开始，则返回True
endswith(pattern)		如果元素以pattern(模式)结束，则返回True
find(pattern)			返回pattern(模式)第一次出现的位置 
findall(pattern)		返回pattern(模式)的所有出现的列表
swapcase				变换字母大小写
islower()				检查每个字符串中的所有字符是否小写，返回布尔值
isupper()				检查每个字符串中的所有字符是否大写，返回布尔值
isnumeric()				检查每个字符串中的所有字符是否为数字，返回布尔值

举例

lower()、upper()、len()

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np

s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveMinsu'])
print("原始数据：")
print (s)
print("lower():")
print(s.str.lower())
print("upper():")
print(s.str.upper())
print("len():")
print(s.str.len())
'''
原始数据：
0             Tom
1    William Rick
2            John
3         Alber@t
4             NaN
5            1234
6      SteveMinsu
dtype: object
lower():
0             tom
1    william rick
2            john
3         alber@t
4             NaN
5            1234
6      steveminsu
dtype: object
upper():
0             TOM
1    WILLIAM RICK
2            JOHN
3         ALBER@T
4             NaN
5            1234
6      STEVEMINSU
dtype: object
len():
0     3.0
1    12.0
2     4.0
3     7.0
4     NaN
5     4.0
6    10.0
dtype: float64
'''

strip()吞左右空格

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np

s = pd.Series(['      Tom', 'William Rick      ', np.nan, '1234','SteveMinsu'])
print("原始数据：")
print (s)
print("strip():")
print(s.str.strip())
'''
原始数据：
0                   Tom
1    William Rick      
2                   NaN
3                  1234
4            SteveMinsu
dtype: object
strip():
0             Tom
1    William Rick
2             NaN
3            1234
4      SteveMinsu
dtype: object
'''

split(‘’)用给定的模式拆分字符串

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np

s = pd.Series(['yebiyebi', 'William Rick', np.nan, '131234','CaiCap'])
print("原始数据：")
print (s)
print("split():")
print(s.str.split())
'''
原始数据：
0        yebiyebi
1    William Rick
2             NaN
3          131234
4          CaiCap
dtype: object
split():
0         [yebiyebi]
1    [William, Rick]
2                NaN
3           [131234]
4           [CaiCap]
dtype: object
'''

cat(sep=pattern)用指定字符连接

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np

s = pd.Series(['yebiyebi', 'William Rick', np.nan, '131234','CaiCap'])
print("原始数据：")
print (s)
print("cat(pattern():")
print (s.str.cat(sep=' <<\--$$$--/>> '))
'''
原始数据：
0        yebiyebi
1    William Rick
2             NaN
3          131234
4          CaiCap
dtype: object
cat(pattern():
yebiyebi <<\--$$$--/>> William Rick <<\--$$$--/>> 131234 <<\--$$$--/>> CaiCap
'''

repeat(value)重复

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np

s = pd.Series(['yebiyebi', 'William Rick', np.nan, '131234','CaiCap'])
print("原始数据：")
print (s)
print("repeat:")
print (s.str.repeat(2))
'''
原始数据：
0        yebiyebi
1    William Rick
2             NaN
3          131234
4          CaiCap
dtype: object
repeat:
0            yebiyebiyebiyebi
1    William RickWilliam Rick
2                         NaN
3                131234131234
4                CaiCapCaiCap
dtype: object
'''

findall(pattern)

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np

s = pd.Series(['yebiyebi', 'William Rick', np.nan, '131234','CaiCap'])
print("原始数据：")
print (s)
print("findall:")
print (s.str.findall('i'))
'''
原始数据：
0        yebiyebi
1    William Rick
2             NaN
3          131234
4          CaiCap
dtype: object
findall:
0       [i, i]
1    [i, i, i]
2          NaN
3           []
4          [i]
dtype: object
'''

统计函数（函数）

count			非空数据的个数
sum				数据之和
mean			算术平均值
mad				平均绝对方差
madian			中位数
min				最小值
max				最大值
mode			众数
abs				绝对值
prod			数组元素的乘积
std				标准差
var				方差
sem				标准误差
skew			偏差
kurt			样本值峰度
quantile		分位数
cumsum			累加
cumprod			累乘
cummax			累计最大值
cummin			累计最小值
cov()			协方差
corr()			相关系数
rank()			数据排名
pct_change()	计算百分数变化

pct_change() 计算百分数变化

此函数将每个元素与其前一个元素进行比较，并计算变化百分比。

默认为对列进行操作，可添加axis=1对行进行操作

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np
s=pd.Series([1,2,3,4,5,4])
print(s)
print(s.pct_change())
'''
0    1
1    2
2    3
3    4
4    5
5    4
dtype: int64
0         NaN
1    1.000000
2    0.500000
3    0.333333
4    0.250000
5   -0.200000
dtype: float64
'''

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np
s=pd.DataFrame(np.random.randn(3,4))
print("原始数据：")
print(s)
print("应用后的：")
print(s.pct_change())
print("对行应用：")
print(s.pct_change(axis=1))
'''
原始数据：
          0         1         2         3
0 -0.223930 -0.205909 -1.235244  1.051225
1 -0.408899 -1.794809  1.202108 -0.375630
2  0.575691  1.214627  0.362256  0.349045
应用后的：
          0         1         2         3
0       NaN       NaN       NaN       NaN
1  0.826012  7.716504 -1.973175 -1.357326
2 -2.407906 -1.676744 -0.698649 -1.929227
对行应用：
    0         1         2         3
0 NaN -0.080475  4.998971 -1.851027
1 NaN  3.389370 -1.669769 -1.312476
2 NaN  1.109859 -0.701755 -0.036469
'''

协方差(cov)

协方差适用于Series数据。Series对象有一个方法cov用来计算序列对象之间的协方差。NA将被自动排除。

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np
s1=pd.Series([1,2,3,4,5])
s2=pd.Series([6,7,8,9,10])
print(s1.cov(s2))
'''
2.5
'''

当应用于DataFrame时，协方差方法计算所有列之间的协方差值

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np
data=pd.DataFrame(np.random.randn(10,5))
print("原始数据：")
print(data)
print("协方差：")
print(data.cov())
'''
原始数据：
          0         1         2         3         4
0  0.248519 -0.318176 -0.438136  0.824835 -2.118278
1 -0.830909 -0.214343 -0.668916  0.077291 -1.142607
2  0.361157  0.100038  0.774078 -2.000748 -0.784875
3  0.237667 -0.659196  1.028732 -0.481548  0.107821
4 -1.156160  0.415080  0.092217 -0.164445 -0.888647
5 -1.163534 -0.671134 -0.533112  0.462044 -0.247828
6  0.611452 -0.155465 -0.028276  1.549388  0.450246
7  0.069735 -1.270305 -0.434300 -1.034437 -0.787866
8  0.232204 -1.331670  0.338220  0.096134 -0.866169
9  1.973164  0.966405  1.205722 -1.157814  0.473820
协方差：
          0         1         2         3         4
0  0.880180  0.208508  0.424085 -0.258226  0.308181
1  0.208508  0.508265  0.196497 -0.143112  0.143010
2  0.424085  0.196497  0.463182 -0.381367  0.292146
3 -0.258226 -0.143112 -0.381367  1.071948 -0.051840
4  0.308181  0.143010  0.292146 -0.051840  0.633394
'''

数据排名

数据排名为元素数组中的每个元素生成排名。在相同的情况下，分配平均等级

默认为true(ascending=True) 升序，可设置为false(ascending=False) 降序

支持不同的tie=breaking方法

average 默认值，相同数据分配平均数

min 相同数据分配最小等级

max 相同数据分配最大等级

first 相同数据根据出现在数组的顺序分配等级

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np
s=pd.Series(np.random.randn(5),index=list('abcde'))
print("原始数据：")
print(s)
print(s.rank())
'''
原始数据：
a   -0.441999
b   -0.894847
c    0.244536
d    1.348017
e   -0.716308
dtype: float64
a    3.0
b    1.0
c    4.0
d    5.0
e    2.0
dtype: float64
'''

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np
s=pd.Series(np.random.randn(5),index=list('abcde'))
print("原始数据：")
s['d']=s['a']
print(s)
print(s.rank(method='first'))
'''
原始数据：
a   -0.912937
b    0.470113
c   -0.293490
d   -0.912937
e   -1.225552
dtype: float64
a    2.0
b    5.0
c    4.0
d    3.0
e    1.0
dtype: float64
'''

窗口函数

为了更好地处理数值型数据，提供了移动函数(rolling)、扩展函数(expanding)、指数加权函数(ewm)等几种窗口函数。

举个简单的应用例子。现有有10天的销售额，想每三天求一次总和，就可用窗口函数。

rolling() 移动函数

语法格式

rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None)

参数说明

window			可选参数，默认值为1，表示窗口的大小，也就是观测值的数量
min_periods		表示窗口的最小观察值，默认与window的参数值相等，小于这个值得窗口为NAN
center			把窗口的标签设置为居中，布尔型，默认为False
win_type		窗口的类型，截取窗的各种函数，字符串类型，默认为None
on				可选参数，对于dataframe而言，值为列名，用于指定要计算的列
axis			默认为0，即对列进行计算
closed			定义区间的开闭，默认为right，可设置为left、both等

聚合方法

可以与mean、count、sum等聚合函数一起使用，同样也有专门的聚合方法。

rolling_count()				计算各个窗口中非NA观测值的数量 
rolling_sum() 				计算各个移动窗口中的元素之和 
rolling_mean() 				计算各个移动窗口中元素的均值 
rolling_median() 			计算各个移动窗口中元素的中位数 
rolling_var() 				计算各个移动窗口中元素的方差 
rolling_std() 				计算各个移动窗口中元素的标准差 
rolling_min() 				计算各个移动窗口中元素的最小值 
rolling_max() 				计算各个移动窗口中元素的最大值 
rolling_corr() 				计算各个移动窗口中元素的相关系数 
rolling_corr_pairwise()    	 计算各个移动窗口中配对数据的相关系数 
rolling_cov() 				计算各个移动窗口中元素的的协方差 
rolling_quantile() 			计算各个移动窗口中元素的分位数

举例

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np

#生成时间序列
data=pd.DataFrame(np.random.randn(10,5),
				index=pd.date_range('1/7/2022',periods=10),
				columns=['A','B','C','D','E'])
print("原始数据：")
print(data)
print("\n按列，每三个数求一次均值：")
#按列，每三个数求一次均值		
print(data.rolling(window=3).mean())
'''
原始数据：
                   A         B         C         D         E
2022-01-07  0.542684 -1.635823 -1.802155 -1.980247 -1.767874
2022-01-08  0.473660  0.238196  0.635087  0.095407  2.917256
2022-01-09 -0.686964  1.065335  0.111492  0.188647 -1.547853
2022-01-10  0.624000 -0.989019  0.952622  0.535373  1.777557
2022-01-11 -0.081070 -2.004308  0.130385  0.629720  1.847073
2022-01-12  0.576814 -1.625438 -0.241784  0.554110  1.218151
2022-01-13  0.821037  0.547141  0.687803  1.036830 -0.384649
2022-01-14  0.227098 -0.139544  1.420318 -0.662357 -0.568126
2022-01-15 -0.550632 -0.314651 -0.085709  1.391168 -1.380198
2022-01-16  0.831026 -0.015572  1.173119 -1.844141  0.293887

按列，每三个数求一次均值：
                   A         B         C         D         E
2022-01-07       NaN       NaN       NaN       NaN       NaN
2022-01-08       NaN       NaN       NaN       NaN       NaN
2022-01-09  0.109794 -0.110764 -0.351858 -0.565398 -0.132824
2022-01-10  0.136899  0.104837  0.566401  0.273142  1.048987
2022-01-11 -0.048011 -0.642664  0.398166  0.451247  0.692259
2022-01-12  0.373248 -1.539588  0.280408  0.573068  1.614260
2022-01-13  0.438927 -1.027535  0.192135  0.740220  0.893525
2022-01-14  0.541650 -0.405947  0.622112  0.309528  0.088459
2022-01-15  0.165834  0.030982  0.674137  0.588547 -0.777658
2022-01-16  0.169164 -0.156589  0.835910 -0.371776 -0.551479
'''

window=3表示每一列中依次紧邻的每三个数求一次均值。当不满足三个数时，返回为NaN,因此前两行的值为NaN，第三行开始才满足window=3。

expanding() 扩展窗口函数

expanding()又叫扩展窗口函数，此函数可以应用于一系列数据。指定min_periods = n参数并在其上调用适当的统计函数。例：min_periods=3表示向后移动3个值求一次

和rolling()函数参数用法相同，不同的是，其不是固定窗口长度，其长度是不断的扩大的。expanding()函数，类似cumsum()函数的累计求和，其优势在于还可以进行更多的聚类计算；

语法格式

expanding(min_periods=1, center=False, axis=0)

举例

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np

#生成时间序列
data=pd.DataFrame(np.random.randn(10,5),
				index=pd.date_range('1/7/2022',periods=10),
				columns=['A','B','C','D','E'])
print("原始数据：")
print(data)
print("\n按列，向后移动3个值求一次均值：")
print(data.expanding(min_periods=3).mean())
'''
原始数据：
                   A         B         C         D         E
2022-01-07 -0.771347 -0.780668  0.451256 -0.149097  0.283219
2022-01-08 -1.464821 -0.911337  0.613680  0.390272  1.106521
2022-01-09 -1.130770  0.301968 -1.765989  0.201855  0.111467
2022-01-10  0.233874 -0.703496 -0.061122  1.040556 -0.635603
2022-01-11 -0.865815 -1.080583 -0.481439  0.951778  1.321507
2022-01-12 -1.218809 -0.203118  0.489582  0.982277 -0.639730
2022-01-13  0.931522  1.195148  0.249427  0.436640 -1.634224
2022-01-14  0.830553  0.596932  0.073513  0.655686 -0.482013
2022-01-15 -0.358706 -0.640338  2.273640 -0.112657 -0.925666
2022-01-16 -0.515777  2.087584 -0.267237  1.549786 -0.835644

按列，向后移动3个值求一次均值：
                   A         B         C         D         E
2022-01-07       NaN       NaN       NaN       NaN       NaN
2022-01-08       NaN       NaN       NaN       NaN       NaN
2022-01-09 -1.122313 -0.463346 -0.233684  0.147677  0.500403
2022-01-10 -0.783266 -0.523383 -0.190543  0.370897  0.216401
2022-01-11 -0.799776 -0.634823 -0.248723  0.487073  0.437422
2022-01-12 -0.869615 -0.562872 -0.125672  0.569607  0.257897
2022-01-13 -0.612310 -0.311727 -0.072086  0.550612 -0.012406
2022-01-14 -0.431952 -0.198144 -0.053886  0.563746 -0.071107
2022-01-15 -0.423813 -0.247277  0.204728  0.488590 -0.166058
2022-01-16 -0.433010 -0.013791  0.157531  0.594710 -0.233017
'''

ewm()指数加权函数

表示指数加权移动，ewm()函数先会对序列元素做指数加权运算，其次计算加权后的均值。该函数通过指定com、span、halflife参数来实现指数加权移动。

语法格式

ewm(com=None, span=None, halflife=None, alpha=None, min_periods=0, adjust=True, ignore_na=False, axis=0)

参数说明

com				可选，根据质心指定衰减
span			可选，根据范围指定衰减
halflife		可选，根据半衰期指定衰减
alpha			可选，直接指定平滑系数
min_periods		默认0.窗口中具有值得最小观察数
ignore_na		默认为False，计算权重时忽略缺失值
axis			指定行列

举例

# -*- coding: UTF-8 -*-
import pandas as pd
import numpy as np

#生成时间序列
data=pd.DataFrame(np.random.randn(10,5),
				index=pd.date_range('1/7/2022',periods=10),
				columns=['A','B','C','D','E'])
print("原始数据：")
print(data)
print("\newm指数加权：")
print(data.ewm(com=0.5).mean())
'''
原始数据：
                   A         B         C         D         E
2022-01-07  0.753775 -2.286938  0.084993 -0.221862 -0.119122
2022-01-08  0.623465 -1.524564 -1.886333 -0.829565 -0.385675
2022-01-09  0.706225  0.549525  0.700519 -0.369354 -0.610060
2022-01-10 -1.080328  0.148164 -0.997218  0.163784  1.693016
2022-01-11  0.278026 -0.182397 -0.380273  0.758482 -0.696516
2022-01-12 -1.075102  0.249555 -0.026082 -0.610358 -1.881195
2022-01-13  0.408630  0.857240  1.423119  0.393536  1.242594
2022-01-14  0.250480 -2.281525 -0.236896 -0.274785 -0.625786
2022-01-15 -0.452394 -0.161964 -0.990537  0.240394 -2.166892
2022-01-16 -1.020385  0.016318 -0.909341  2.362652  0.401030

ewm指数加权：
                   A         B         C         D         E
2022-01-07  0.753775 -2.286938  0.084993 -0.221862 -0.119122
2022-01-08  0.656043 -1.715157 -1.393502 -0.677639 -0.319037
2022-01-09  0.690784 -0.147301  0.056205 -0.464211 -0.520515
2022-01-10 -0.504716  0.052138 -0.654856 -0.040315  0.973619
2022-01-11  0.019268 -0.104865 -0.471044  0.494417 -0.144405
2022-01-12 -0.711314  0.131739 -0.173995 -0.243111 -1.303856
2022-01-13  0.035657  0.615628  0.891235  0.181514  0.394554
2022-01-14  0.178894 -1.316102  0.139033 -0.122732 -0.285776
2022-01-15 -0.241986 -0.546638 -0.614052  0.119365 -1.539917
2022-01-16 -0.760927 -0.171328 -0.810915  1.614915 -0.245930
'''