#IPython控制台操作#import pandas as pd
import numpy as np
b=pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b'])
b
Out[4]:01234
c 01234
a 56789
d 1011121314
b 1516171819
b.sort_index()
Out[5]:01234
a 56789
b 1516171819
c 01234
d 1011121314
b.sort_index(ascending=False)
Out[6]:01234
d 1011121314
c 01234
b 1516171819
a 56789
c=b.sort_index(axis=1,ascending=False)
c
Out[8]:43210
c 43210
a 98765
d 1413121110
b 1918171615
c=b.sort_values(2,ascending=False)#按下标为2的列降序排序
c
Out[10]:01234
b 1516171819
d 1011121314
a 56789
c 01234
c=c.sort_values('a',axis=1,ascending=False)#按1轴,索引为'a'的数据排序
c
Out[12]:43210
b 1918171615
d 1413121110
a 98765
c 43210
数据的基本统计分析
方法
说明
.sum()
计算数据总和,按0轴计算,下同
.count()
非NaN值的数量
.mean()
计算数据的算术平均值
.median()
计算数据的算术中位数
.var() .std()
计算数据的方差、标准差
.min() .max()
计算数据的最小值、最大值
.argmin() .argmax()
计算数据最大值、最小值位置的索引位置(自动索引)
.idxmin() .idxmax()
计算数据最大值、最小值所在位置的索引(自定义索引)
.describe()
针对0轴对上面所有的分析进行汇总显示
例1
import pandas as pd
a=pd.Series([9,8,7,6],index=['a','b','c','d'])
a
Out[3]:
a 9
b 8
c 7
d 6
dtype: int64
a.describe()
Out[4]:
count 4.000000
mean 7.500000
std 1.290994min6.00000025%6.75000050%7.50000075%8.250000max9.000000
dtype: float64
type(a.describe())#返回的类型为Series
Out[5]: pandas.core.series.Series
a.describe()['count']#当做Series来操作
Out[6]:4.0
a.describe()['max']#当做Series来操作
Out[7]:9.0
例2
#IPython控制台操作#import pandas as pd
import numpy as np
b=pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b'])
b
Out[4]:01234
c 01234
a 56789
d 1011121314
b 1516171819
b.describe()
Out[5]:01234
count 4.0000004.0000004.0000004.0000004.000000
mean 7.5000008.5000009.50000010.50000011.500000
std 6.4549726.4549726.4549726.4549726.454972min0.0000001.0000002.0000003.0000004.00000025%3.7500004.7500005.7500006.7500007.75000050%7.5000008.5000009.50000010.50000011.50000075%11.25000012.25000013.25000014.25000015.250000max15.00000016.00000017.00000018.00000019.000000type(b.describe())#返回的是DataFrame类型
Out[6]: pandas.core.frame.DataFrame
b.describe().loc['max']#访问行
Out[7]:015.0116.0217.0318.0419.0
Name:max, dtype: float64
b.describe()[2]#访问列
Out[8]:
count 4.000000
mean 9.500000
std 6.454972min2.00000025%5.75000050%9.50000075%13.250000max17.000000
Name:2, dtype: float64
累计统计分析函数
函数适用于Series和DataFrame
方法
说明
.cumsum()
依次给出前1、2、…、n个数的和,下面类似
.cumprod()
积
.cummax()
最大值
.cummin()
最小值
方法
说明
.rolling(w).sum()
依次计算相邻w个元素的和,下面类似
.rolling(w).mean()
算术平均值
.rolling(w).var()
方差
.rolling(w).std()
标准差
.rolling(w).min().max()
最小、最大值
例1
#IPython控制台操作#import pandas as pd
import numpy as np
b=pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b'])
b
Out[4]:01234
c 01234
a 56789
d 1011121314
b 1516171819
b.cumsum()
Out[5]:01234
c 01234
a 5791113
d 1518212427
b 3034384246
例2
#IPython控制台操作#import pandas as pd
import numpy as np
b=pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b'])
b
Out[4]:01234
c 01234
a 56789
d 1011121314
b 1516171819
b.rolling(2).sum()#第一行=前一项+此项 前一项=NaN
Out[5]:01234
c NaN NaN NaN NaN NaN
a 5.07.09.011.013.0
d 15.017.019.021.023.0
b 25.027.029.031.033.0
b.rolling(3).sum()
Out[6]:01234
c NaN NaN NaN NaN NaN
a NaN NaN NaN NaN NaN
d 15.018.021.024.027.0
b 30.033.036.039.042.0
数据相关性分析
协方差:
c
o
v
(
X
,
Y
)
=
∑
i
=
1
n
(
X
i
−
X
ˉ
)
(
Y
i
−
Y
ˉ
)
n
−
1
cov(X,Y)=\frac{\sum_{i=1}^n(X_i-\bar{X})(Y_i-\bar{Y})}{n-1}
cov(X,Y)=n−1∑i=1n(Xi−Xˉ)(Yi−Yˉ)
协方差>0,X与Y正相关
协方差<0,X与Y负相关
协方差=0,X与Y独立无关
Pearson相关系数:
r
=
∑
i
=
1
n
(
x
i
−
x
ˉ
)
(
y
i
−
y
ˉ
)
∑
i
=
1
n
(
x
i
−
x
ˉ
)
2
∑
i
=
1
n
(
y
i
−
y
ˉ
)
2
r=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2} \sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}}
r=∑i=1n(xi−xˉ)2∑i=1n(yi−yˉ)2∑i=1n(xi−xˉ)(yi−yˉ)
r
∈
[
−
1
,
1
]
r\in[-1,1]
r∈[−1,1]
r
∈
[
0.8
,
1.0
]
r\in[0.8,1.0]
r∈[0.8,1.0] 极强相关
r
∈
[
0.6
,
0.8
]
r\in[0.6,0.8]
r∈[0.6,0.8] 强相关
r
∈
[
0.4
,
0.6
]
r\in[0.4,0.6]
r∈[0.4,0.6] 中等程度相关
r
∈
[
0.2
,
0.4
]
r\in[0.2,0.4]
r∈[0.2,0.4] 弱相关
r
∈
[
0.0
,
0.2
]
r\in[0.0,0.2]
r∈[0.0,0.2] 极弱相关
方法
说明
.cov()
计算协方差矩阵
.corr()
计算相关系数矩阵,Pearson、Spearman、Kendall
#IPython控制台操作#import pandas as pd
hprice=pd.Series([3.04,22.93,12.75,22.6,12.33],index=['2008','2009','2010','2011','2012'])
m2=pd.Series([8.18,18.38,9.13,7.82,6.69],index=['2008','2009','2010','2011','2012'])
hprice.corr(m2)
Out[4]:0.5239439145220387