numpy补充和pandas入门
numpy补充
矩阵的索引和切片
索引和切片的基本操作和python一样,在此介绍二维数组的索引和切片操作。
索引行和列
生成4x4的矩阵,分别索引需要的行和列:
import numpy as np
A= np.arange(16).reshape(4,4)
print(A)
print(A[1:3]) #索引第2行和第3行
print(A[:,2:4]) #索引第3列和第4列
print(A[1:3,2:4])#索引第2行和第3行中的第3列和第4列
结果如下:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
[[ 4 5 6 7]
[ 8 9 10 11]]
[[ 2 3]
[ 6 7]
[10 11]
[14 15]]
[[ 6 7]
[10 11]]
按特定条件索引
如:
import numpy as np
A= np.arange(16).reshape(4,4)
print(A)
print(A[A>10]) #输出A中大于10的元素组成的数组
print(A[A%2==0]) #输出A中偶数元素组成的数组
结果如下:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
[11 12 13 14 15]
[ 0 2 4 6 8 10 12 14]
pandas入门
pandas的关键数据结构
一维数据结构Series
import pandas as pd
import numpy as np
s=pd.Series([1,3,5,np.NaN,8,4]) #NaN表示Not a Number
print(s)
输出结果如下:
0 1.0
1 3.0
2 5.0
3 NaN
4 8.0
5 4.0
dtype: float64
第一列为索引(自动创建),第二列为值。
二维数据结构DataFrame
- 创建日期序列
import pandas as pd
import numpy as np
dates=pd.date_range('20160301',periods=6)
print(dates)
输出结果如下:
DatetimeIndex(['2016-03-01', '2016-03-02', '2016-03-03', '2016-03-04',
'2016-03-05', '2016-03-06'],
dtype='datetime64[ns]', freq='D')
- 用数据结构DataFrame创建二维数组:
import pandas as pd
import numpy as np
dates=pd.date_range('20160301',periods=6)
data=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
#使用np.random.randn()创建6x4的随机数组,使用dates作为行索引,列索引用ABCD表示。
print(data)
输出结果如下:
A B C D
2016-03-01 -1.025216 -1.750513 0.717700 1.249785
2016-03-02 -1.159727 1.247696 0.681604 -0.795137
2016-03-03 -0.904264 1.103957 0.639108 -0.573438
2016-03-04 0.287851 1.855954 -1.431715 0.890373
2016-03-05 0.034006 -1.472277 0.609982 -0.546555
2016-03-06 0.301257 -1.247960 2.360873 -0.425789
字典(另一个创建二维数组的方法)
使用字典创建一个二维数组:
import pandas as pd
import numpy as np
d={'A':1,"B":pd.Timestamp("20130301"),'C':range(4),'D':np.arange(4)}
#ABCD四个key分别赋值数值1、一个时间戳、range()函数的值、np.arange()函数的值。
df=pd.DataFrame(d)
print(df)
输出结果如下:
A B C D
0 1 2013-03-01 0 0
1 1 2013-03-01 1 1
2 1 2013-03-01 2 2
3 1 2013-03-01 3 3
查看数据
head查看前(五)行、tail查看后(五)行
head()、tail()默认返回5行数据
import pandas as pd
import numpy as np
dates=pd.date_range('20160301',periods=6)
data=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
#使用np.random.randn()创建6x4的随机数组,使用dates作为行索引,列索引用ABCD表示。
print(data.head())
print(data.tail())
print(data.head(2))
print(data.tail(3))
输出结果如下:
A B C D
2016-03-01 0.687377 0.291196 0.788835 0.144668
2016-03-02 0.133120 -1.066542 -0.655071 -0.278354
2016-03-03 0.571230 1.059514 -1.572226 0.998279
2016-03-04 -0.707941 0.490541 -0.508189 1.332583
2016-03-05 0.782082 -0.309733 0.385113 -2.001689
A B C D
2016-03-02 0.133120 -1.066542 -0.655071 -0.278354
2016-03-03 0.571230 1.059514 -1.572226 0.998279
2016-03-04 -0.707941 0.490541 -0.508189 1.332583
2016-03-05 0.782082 -0.309733 0.385113 -2.001689
2016-03-06 0.155096 -0.276119 1.035860 -0.385331
A B C D
2016-03-01 0.687377 0.291196 0.788835 0.144668
2016-03-02 0.133120 -1.066542 -0.655071 -0.278354
A B C D
2016-03-04 -0.707941 0.490541 -0.508189 1.332583
2016-03-05 0.782082 -0.309733 0.385113 -2.001689
2016-03-06 0.155096 -0.276119 1.035860 -0.385331
index、columns、values
index返回行标签、columns返回列标签、values以二维数组的形式返回该二维数组的值。
import pandas as pd
import numpy as np
dates=pd.date_range('20160301',periods=6)
data=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
#使用np.random.randn()创建6x4的随机数组,使用dates作为行索引,列索引用ABCD表示。
print(data)
print(data.index)
print(data.columns)
print(data.values)
输出结果如下:
A B C D
2016-03-01 -0.627376 -0.631146 -0.768253 -0.365725
2016-03-02 0.274266 -2.671629 -0.971605 -1.304933
2016-03-03 -0.039024 -1.267995 1.343958 -1.553249
2016-03-04 -0.564481 -0.157503 -0.716591 -0.290638
2016-03-05 0.768433 -0.397582 0.678337 -0.707139
2016-03-06 -0.162657 -1.191969 0.439424 -3.304446
DatetimeIndex(['2016-03-01', '2016-03-02', '2016-03-03', '2016-03-04',
'2016-03-05', '2016-03-06'],
dtype='datetime64[ns]', freq='D')
Index(['A', 'B', 'C', 'D'], dtype='object')
[[-0.62737568 -0.63114571 -0.768253 -0.36572538]
[ 0.27426623 -2.67162941 -0.97160469 -1.3049329 ]
[-0.0390242 -1.26799508 1.34395777 -1.55324906]
[-0.56448052 -0.15750279 -0.71659108 -0.29063806]
[ 0.76843331 -0.39758175 0.67833652 -0.70713877]
[-0.16265721 -1.19196896 0.43942369 -3.30444622]]
describe查看数据整体情况
import pandas as pd
import numpy as np
dates=pd.date_range('20160301',periods=6)
data=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
#使用np.random.randn()创建6x4的随机数组,使用dates作为行索引,列索引用ABCD表示。
print(data)
print(data.describe())
输出结果如下:
A B C D
2016-03-01 1.223564 1.000462 1.752393 -0.894149
2016-03-02 0.541414 -1.811422 0.424858 -1.114216
2016-03-03 0.745621 -1.373831 0.679137 -0.055400
2016-03-04 -0.549163 -2.835349 1.528670 0.322692
2016-03-05 -0.649714 -0.589527 1.912596 -0.315114
2016-03-06 0.072042 -1.328182 -1.470634 -0.353867
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.230627 -1.156308 0.804504 -0.401676 #平均值
std 0.742422 1.288263 1.263873 0.530026 #方差
min -0.649714 -2.835349 -1.470634 -1.114216 #最小值
25% -0.393862 -1.702025 0.488428 -0.759078 #四分位
50% 0.306728 -1.351007 1.103904 -0.334490 #二分位
75% 0.694569 -0.774191 1.696463 -0.120329 #四分之三分位
max 1.223564 1.000462 1.912596 0.322692 #最大值
数据排序
通过索引标签(行、列)排序(sort_index)
import pandas as pd
import numpy as np
dates=pd.date_range('20160301',periods=6)
data=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
#使用np.random.randn()创建6x4的随机数组,使用dates作为行索引,列索引用ABCD表示。
print(data)
print(data.sort_index(axis=1,ascending=0).sort_index(axis=0,ascending=0))
#axis的值1表示按行标签排序、0表示按列标签排序;ascending的值1表示升序排列、0表示降序排列。
输出结果如下:
A B C D
2016-03-01 -0.420118 0.834011 1.527457 -0.038281
2016-03-02 -0.339938 -1.064339 -0.223201 0.428003
2016-03-03 0.036346 0.996947 -1.061392 0.218536
2016-03-04 0.415237 1.031722 -0.405188 0.498559
2016-03-05 0.259473 0.317152 1.367756 0.194774
2016-03-06 -0.576714 -0.484806 0.619151 0.444646
D C B A
2016-03-06 0.444646 0.619151 -0.484806 -0.576714
2016-03-05 0.194774 1.367756 0.317152 0.259473
2016-03-04 0.498559 -0.405188 1.031722 0.415237
2016-03-03 0.218536 -1.061392 0.996947 0.036346
2016-03-02 0.428003 -0.223201 -1.064339 -0.339938
2016-03-01 -0.038281 1.527457 0.834011 -0.420118
通过值排序(sort_values)
import pandas as pd
import numpy as np
dates=pd.date_range('20160301',periods=6)
data=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
#使用np.random.randn()创建6x4的随机数组,使用dates作为行索引,列索引用ABCD表示。
print(data)
print(data.sort_values(by='C',ascending=0))
#通过C列的值排序,ascending=0为降序,1为升序。)
输出结果如下:
A B C D
2016-03-01 -0.652637 1.907680 -0.217010 0.987138
2016-03-02 -0.015474 0.316740 -1.460764 0.416239
2016-03-03 0.978143 -1.095760 -0.694717 -1.278203
2016-03-04 -1.289825 -0.152098 0.216542 -0.220509
2016-03-05 2.125722 1.223592 -0.020528 -1.236328
2016-03-06 0.152470 1.457490 0.745391 -0.324019
A B C D
2016-03-06 0.152470 1.457490 0.745391 -0.324019
2016-03-04 -1.289825 -0.152098 0.216542 -0.220509
2016-03-05 2.125722 1.223592 -0.020528 -1.236328
2016-03-01 -0.652637 1.907680 -0.217010 0.987138
2016-03-03 0.978143 -1.095760 -0.694717 -1.278203
2016-03-02 -0.015474 0.316740 -1.460764 0.416239
数据索引
列索引
import pandas as pd
import numpy as np
dates=pd.date_range('20160301',periods=6)
data=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
#使用np.random.randn()创建6x4的随机数组,使用dates作为行索引,列索引用ABCD表示。
print(data)
print(data.A)
#也可以用下面的语句:
#print(data['A'])
输出结果如下:
A B C D
2016-03-01 1.084063 -1.159920 0.421726 -1.615109
2016-03-02 0.657053 -0.376875 1.119834 0.598359
2016-03-03 -0.373089 0.849793 -0.295477 -1.274574
2016-03-04 -0.385112 -0.594260 -0.625461 -0.643493
2016-03-05 0.088876 -0.297084 -0.382355 -1.342975
2016-03-06 -1.145772 -1.501243 0.530443 -0.286387
2016-03-01 1.084063
2016-03-02 0.657053
2016-03-03 -0.373089
2016-03-04 -0.385112
2016-03-05 0.088876
2016-03-06 -1.145772
Freq: D, Name: A, dtype: float64
行索引
import pandas as pd
import numpy as np
dates=pd.date_range('20160301',periods=6)
data=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
#使用np.random.randn()创建6x4的随机数组,使用dates作为行索引,列索引用ABCD表示。
print(data)
print(data[2:4])
#也可以通过行标签索引:
#print(data['20160303':'20160304'])
输出结果如下:
A B C D
2016-03-01 -0.566422 -0.843294 0.078921 -1.453978
2016-03-02 1.085810 -0.555267 0.403593 1.235611
2016-03-03 -0.812393 1.954781 -0.908624 1.615296
2016-03-04 0.194656 -0.100710 -0.607571 -0.698412
2016-03-05 0.825079 1.559980 -0.400987 1.071331
2016-03-06 0.393200 0.653905 1.040587 1.136798
A B C D
2016-03-03 -0.812393 1.954781 -0.908624 1.615296
2016-03-04 0.194656 -0.100710 -0.607571 -0.698412
数据按标签索引(loc)
索引某行列区间内的数据:
import pandas as pd
import numpy as np
dates=pd.date_range('20160301',periods=6)
data=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
#使用np.random.randn()创建6x4的随机数组,使用dates作为行索引,列索引用ABCD表示。
print(data)
print(data.loc['20160302':'20160304',['B','C']])
输出结果如下:
A B C D
2016-03-01 0.006156 1.764682 1.472136 1.390308
2016-03-02 -0.735844 -0.395434 0.439784 -0.967490
2016-03-03 -1.048223 -0.866617 0.394411 2.388338
2016-03-04 -0.746154 -0.334658 0.027457 -0.140987
2016-03-05 0.676363 -0.757669 -0.338700 -0.251449
2016-03-06 0.052976 0.322726 0.771191 -0.506898
B C
2016-03-02 -0.395434 0.439784
2016-03-03 -0.866617 0.394411
2016-03-04 -0.334658 0.027457
数据按位置索引(iloc)
import pandas as pd
import numpy as np
dates=pd.date_range('20160301',periods=6)
data=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
#使用np.random.randn()创建6x4的随机数组,使用dates作为行索引,列索引用ABCD表示。
print(data)
print(data.iloc[1])
print(data.iloc[1:3])
print(data.iloc[:,1:3])
print(data.iloc[1:3,2:4])
输出结果如下:
A B C D
2016-03-01 -0.536416 -1.176988 -0.114814 -0.012914
2016-03-02 0.116893 -1.280848 0.980185 -0.491851
2016-03-03 0.079618 0.644327 -0.265312 -1.358025
2016-03-04 -1.382691 1.050169 0.019854 1.363736
2016-03-05 0.906943 -0.751846 -0.065756 -0.548120
2016-03-06 -0.684887 -0.391439 0.396847 -0.235574
A 0.116893
B -1.280848
C 0.980185
D -0.491851
Name: 2016-03-02 00:00:00, dtype: float64
A B C D
2016-03-02 0.116893 -1.280848 0.980185 -0.491851
2016-03-03 0.079618 0.644327 -0.265312 -1.358025
B C
2016-03-01 -1.176988 -0.114814
2016-03-02 -1.280848 0.980185
2016-03-03 0.644327 -0.265312
2016-03-04 1.050169 0.019854
2016-03-05 -0.751846 -0.065756
2016-03-06 -0.391439 0.396847
C D
2016-03-02 0.980185 -0.491851
2016-03-03 -0.265312 -1.358025
数据过滤
按数据大小过滤
import pandas as pd
import numpy as np
dates=pd.date_range('20160301',periods=6)
data=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
#使用np.random.randn()创建6x4的随机数组,使用dates作为行索引,列索引用ABCD表示。
print(data[data.A>0])
print(data[data>0])
输出结果如下:
A B C D
2016-03-02 0.345941 1.037386 0.018049 0.526858
2016-03-03 1.272562 -1.128237 -0.541329 0.677266
2016-03-05 0.949473 0.546859 -0.375677 -0.186794
A B C D
2016-03-01 NaN 0.904253 0.156484 0.173791
2016-03-02 0.345941 1.037386 0.018049 0.526858
2016-03-03 1.272562 NaN NaN 0.677266
2016-03-04 NaN NaN 1.114209 1.981995
2016-03-05 0.949473 0.546859 NaN NaN
2016-03-06 NaN NaN NaN 0.208919
添加元素和isin方法过滤
import pandas as pd
import numpy as np
dates=pd.date_range('20160301',periods=6)
data=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
#使用np.random.randn()创建6x4的随机数组,使用dates作为行索引,列索引用ABCD表示。
tag=['a']*2+['b']*2+['c']*2
data['TAG']=tag
print(data)
print(data[data.TAG.isin(['a','c'])])
输出结果如下:
A B C D TAG
2016-03-01 -0.906126 0.015918 0.085215 -2.472274 a
2016-03-02 1.952455 -1.429034 0.693882 -1.219012 a
2016-03-03 -0.200522 -0.721406 1.363443 1.298041 b
2016-03-04 1.028430 0.493051 1.336037 -0.222807 b
2016-03-05 1.963239 -0.633410 0.861413 1.760964 c
2016-03-06 -0.237367 0.684996 1.539663 0.805262 c
A B C D TAG
2016-03-01 -0.906126 0.015918 0.085215 -2.472274 a
2016-03-02 1.952455 -1.429034 0.693882 -1.219012 a
2016-03-05 1.963239 -0.633410 0.861413 1.760964 c
2016-03-06 -0.237367 0.684996 1.539663 0.805262 c
修改元素
修改某元素(iat)
import pandas as pd
import numpy as np
dates=pd.date_range('20160301',periods=6)
data=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
#使用np.random.randn()创建6x4的随机数组,使用dates作为行索引,列索引用ABCD表示。
print(data)
data.iat[0,0]=100
print(data)
输出结果如下:
A B C D
2016-03-01 0.478121 0.968600 -1.174265 -1.175590
2016-03-02 -0.846485 -2.026002 -0.909115 -0.527671
2016-03-03 0.352549 0.415892 -1.353949 -0.561842
2016-03-04 1.428907 -0.747153 0.371942 -0.245003
2016-03-05 0.262357 -0.242971 -0.164828 1.126916
2016-03-06 1.806400 -0.243867 0.022820 0.400097
A B C D
2016-03-01 100.000000 0.968600 -1.174265 -1.175590
2016-03-02 -0.846485 -2.026002 -0.909115 -0.527671
2016-03-03 0.352549 0.415892 -1.353949 -0.561842
2016-03-04 1.428907 -0.747153 0.371942 -0.245003
2016-03-05 0.262357 -0.242971 -0.164828 1.126916
2016-03-06 1.806400 -0.243867 0.022820 0.400097
修改行或列
import pandas as pd
import numpy as np
dates=pd.date_range('20160301',periods=6)
data=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
#使用np.random.randn()创建6x4的随机数组,使用dates作为行索引,列索引用ABCD表示。
print(data)
data.A=range(6)
data.B=200
print(data)
输出结果如下:
A B C D
2016-03-01 1.714903 -1.317622 1.820566 0.500455
2016-03-02 -0.247550 -0.282191 -0.119152 2.436925
2016-03-03 -0.283482 -0.383136 1.784690 0.617285
2016-03-04 -0.454976 0.328370 1.161106 0.443733
2016-03-05 -1.884368 -0.734345 -0.631858 -0.238644
2016-03-06 0.070338 -0.950485 0.028665 0.254520
A B C D
2016-03-01 0 200 1.820566 0.500455
2016-03-02 1 200 -0.119152 2.436925
2016-03-03 2 200 1.784690 0.617285
2016-03-04 3 200 1.161106 0.443733
2016-03-05 4 200 -0.631858 -0.238644
2016-03-06 5 200 0.028665 0.254520
MORE
更多pandas使用方法可以查看文档10 Minutes to pandas
pandas应用(数据分析)实例(MovieLens电影数据分析)
基础运算
重新索引(reindex)
- fill_value用给定数据填充
- method='ffill’用前一数据填充
- method='bfill’用后一数据填充
其中method参数仅对行有效,对列无效。
import pandas as pd
import numpy as np
data=pd.DataFrame(np.random.randn(3,6),index=list('ACE'),columns=['one','two','three','four','five','six'])
print(data)
print(data.reindex(list('ABCDE')))
print(data.reindex(list('ABCDE'),fill_value=0))
print(data.reindex(list('ABCDE'),method='ffill'))
print(data.reindex(list('ABCDE'),method='bfill'))
输出结果如下:
one two three four five six
A 0.543125 -1.640770 -0.663088 0.400010 -0.769536 -0.037884
C -1.221736 -0.921086 1.052908 -0.060646 0.609456 0.370573
E 0.808179 0.146970 -1.095026 -1.648017 0.455950 -0.224299
one two three four five six
A 0.543125 -1.640770 -0.663088 0.400010 -0.769536 -0.037884
B NaN NaN NaN NaN NaN NaN
C -1.221736 -0.921086 1.052908 -0.060646 0.609456 0.370573
D NaN NaN NaN NaN NaN NaN
E 0.808179 0.146970 -1.095026 -1.648017 0.455950 -0.224299
one two three four five six
A 0.543125 -1.640770 -0.663088 0.400010 -0.769536 -0.037884
B 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
C -1.221736 -0.921086 1.052908 -0.060646 0.609456 0.370573
D 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
E 0.808179 0.146970 -1.095026 -1.648017 0.455950 -0.224299
one two three four five six
A 0.543125 -1.640770 -0.663088 0.400010 -0.769536 -0.037884
B 0.543125 -1.640770 -0.663088 0.400010 -0.769536 -0.037884
C -1.221736 -0.921086 1.052908 -0.060646 0.609456 0.370573
D -1.221736 -0.921086 1.052908 -0.060646 0.609456 0.370573
E 0.808179 0.146970 -1.095026 -1.648017 0.455950 -0.224299
one two three four five six
A 0.543125 -1.640770 -0.663088 0.400010 -0.769536 -0.037884
B -1.221736 -0.921086 1.052908 -0.060646 0.609456 0.370573
C -1.221736 -0.921086 1.052908 -0.060646 0.609456 0.370573
D 0.808179 0.146970 -1.095026 -1.648017 0.455950 -0.224299
E 0.808179 0.146970 -1.095026 -1.648017 0.455950 -0.224299
丢弃部分数据(drop)
行drop与列drop方法略有不同
import pandas as pd
import numpy as np
data=pd.DataFrame(np.random.randn(3,6),index=list('ACE'),columns=['one','two','three','four','five','six'])
print(data)
print(data.drop('A'))
print(data.drop(['one'],axis=1))
输出结果如下:
one two three four five six
A 0.205775 -1.035567 1.424270 0.865579 0.555840 1.183951
C -2.150617 -1.007972 1.837807 -1.324306 -0.890468 0.727608
E 0.731590 0.658749 -0.000566 1.304314 -0.434203 -1.060070
one two three four five six
C -2.150617 -1.007972 1.837807 -1.324306 -0.890468 0.727608
E 0.731590 0.658749 -0.000566 1.304314 -0.434203 -1.060070
two three four five six
A -1.035567 1.424270 0.865579 0.555840 1.183951
C -1.007972 1.837807 -1.324306 -0.890468 0.727608
E 0.658749 -0.000566 1.304314 -0.434203 -1.060070
映射函数
apply(配合lambda函数使用)
二维数组按行或列作运算
import pandas as pd
import numpy as np
data=pd.DataFrame(np.random.randn(3,6),index=list('ACE'),columns=['one','two','three','four','five','six'])
print(data)
print(data.apply(lambda s: max(s),axis=1))
输出如下:
one two three four five six
A 0.589514 -0.458795 -1.039271 -1.804765 -0.073647 0.592423
C -3.049245 -0.488300 -0.252437 -0.675911 -0.276973 -0.822922
E 0.949015 -0.476813 0.485910 -0.462323 -0.208239 -0.528279
A 0.592423
C -0.252437
E 0.949015
dtype: float64
applymap格式化输出
import pandas as pd
import numpy as np
data=pd.DataFrame(np.random.randn(3,6),index=list('ABC'),columns=['one','two','three','four','five','six'])
print(data)
print(data.applymap("{0:.02f}".format))
输出结果如下:
one two three four five six
A -2.681697 0.252926 -1.720396 -2.093779 0.816986 0.143577
B -0.339905 0.669807 -0.535729 2.057737 -2.631671 0.743504
C -1.338692 0.085534 -0.741325 -0.814479 -0.551937 -0.892762
one two three four five six
A -2.68 0.25 -1.72 -2.09 0.82 0.14
B -0.34 0.67 -0.54 2.06 -2.63 0.74
C -1.34 0.09 -0.74 -0.81 -0.55 -0.89
排序和排名
排序
import pandas as pd
import numpy as np
data=pd.DataFrame(np.random.randn(3,6),index=list('ABC'),columns=['one','two','three','four','five','six'])
print(data)
print(data.sort_values(by='one'))
输出结果如下:
one two three four five six
A -0.131662 -0.770629 0.815278 1.251899 0.023454 -1.022011
B -1.126394 2.241842 0.255889 0.674569 -0.936142 0.911843
C -0.265151 -0.690041 0.077091 0.123383 0.197835 0.160106
one two three four five six
B -1.126394 2.241842 0.255889 0.674569 -0.936142 0.911843
C -0.265151 -0.690041 0.077091 0.123383 0.197835 0.160106
A -0.131662 -0.770629 0.815278 1.251899 0.023454 -1.022011
排名
average方法
当出现相同数字时,相同数字按同号个数依次取索引的平均值排名
import pandas as pd
import numpy as np
s=pd.Series(['3','5','2','7','4','5'])
print(s.rank(method='average'))
输出结果如下:
0 2.0
1 4.5
2 1.0
3 6.0
4 3.0
5 4.5
dtype: float64
first方法
当出现相同数字时,先出现的排前面。
数字唯一性和成员资格验证
先创建一个如下索引:
import pandas as pd
import numpy as np
s=pd.Series(list('ajagsfdahsfdgasgas'))
print(s)
结果如下:
0 a
1 j
2 a
3 g
4 s
5 f
6 d
7 a
8 h
9 s
10 f
11 d
12 g
13 a
14 s
15 g
16 a
17 s
dtype: object
- 获取值的重复次数:value_counts()方法
- 获取值的列表:unique()方法
- 查看元素是否在需要的列表里:isin(list)
import pandas as pd
import numpy as np
s=pd.Series(list('ajagsfdahsfdgasgas'))
print(s)
print(s.isin(['a','j','g','s','d']))
输出结果如下:
0 a
1 j
2 a
3 g
4 s
5 f
6 d
7 a
8 h
9 s
10 f
11 d
12 g
13 a
14 s
15 g
16 a
17 s
dtype: object
0 True
1 True
2 True
3 True
4 True
5 False
6 True
7 True
8 False
9 True
10 False
11 True
12 True
13 True
14 True
15 True
16 True
17 True
dtype: bool