介绍:这部分程序是10分钟进入pandas教程. 教程在Windows10环境,使用Anaconda,jupyter下测试的。注意版本区别,若版本有更新某些函数被弃用,请以官方手册为准。文章是jupyter编辑器转化成的markdown,可能会有一些格式问题,但是不影响复现
import pandas as pd
import numpy as np
import matplotlib. pyplot as plt
import matplotlib
print ( pd. __version__)
print ( np. __version__)
print ( matplotlib. __version__)
1.0.1
1.18.1
3.1.3
s = pd. Series( [ 1 , 3 , 5 , np. nan, 6 , 8 ] )
s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
dates = pd. date_range( '20130101' , periods= 6 )
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
sales_vol = pd. DataFrame( np. random. randn( 6 , 4 ) , index= dates, columns= [ "Apple" , "Strawberry" , "Pineapple" , "Watermelon" ] )
sales_vol
Apple Strawberry Pineapple Watermelon 2013-01-01 1.516204 0.495784 -0.153495 0.477701 2013-01-02 -0.181439 -0.574517 -1.403139 -0.018221 2013-01-03 -0.379681 -1.423055 -1.034571 -0.235021 2013-01-04 -0.631352 -0.492635 -1.186038 0.574631 2013-01-05 -1.343225 1.091307 1.156641 -0.228815 2013-01-06 -0.513285 0.401775 -1.162886 1.530610
df2 = pd. DataFrame( { 'A' : 1 . ,
'Date' : pd. Timestamp( '20130102' ) ,
'C' : pd. Series( 1 , index= list ( range ( 4 ) ) , dtype= 'float32' ) ,
'D' : np. array( [ 3 ] * 4 , dtype= 'int32' ) ,
'E' : pd. Categorical( [ "test" , "train" , "test" , "train" ] ) ,
'F' : 'foo' } )
df2
A Date C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo
df2. dtypes
A float64
Date datetime64[ns]
C float32
D int32
E category
F object
dtype: object
查看数据的头部尾部
sales_vol. head( )
Apple Strawberry Pineapple Watermelon 2013-01-01 1.516204 0.495784 -0.153495 0.477701 2013-01-02 -0.181439 -0.574517 -1.403139 -0.018221 2013-01-03 -0.379681 -1.423055 -1.034571 -0.235021 2013-01-04 -0.631352 -0.492635 -1.186038 0.574631 2013-01-05 -1.343225 1.091307 1.156641 -0.228815
sales_vol. tail( )
Apple Strawberry Pineapple Watermelon 2013-01-02 -0.181439 -0.574517 -1.403139 -0.018221 2013-01-03 -0.379681 -1.423055 -1.034571 -0.235021 2013-01-04 -0.631352 -0.492635 -1.186038 0.574631 2013-01-05 -1.343225 1.091307 1.156641 -0.228815 2013-01-06 -0.513285 0.401775 -1.162886 1.530610
显示索引、列和底层的numpy数据:
sales_vol. index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
sales_vol. columns
Index(['Apple', 'Strawberry', 'Pineapple', 'Watermelon'], dtype='object')
sales_vol. values
array([[ 1.51620355, 0.49578439, -0.15349507, 0.4777014 ],
[-0.18143898, -0.57451669, -1.40313854, -0.01822073],
[-0.37968056, -1.42305498, -1.0345705 , -0.23502055],
[-0.63135233, -0.49263546, -1.186038 , 0.57463084],
[-1.34322502, 1.09130739, 1.15664088, -0.22881497],
[-0.51328525, 0.40177536, -1.16288585, 1.53061048]])
sales_vol. describe( )
Apple Strawberry Pineapple Watermelon count 6.000000 6.000000 6.000000 6.000000 mean -0.255463 -0.083557 -0.630581 0.350148 std 0.954013 0.911497 0.977109 0.674310 min -1.343225 -1.423055 -1.403139 -0.235021 25% -0.601836 -0.554046 -1.180250 -0.176166 50% -0.446483 -0.045430 -1.098728 0.229740 75% -0.230999 0.472282 -0.373764 0.550398 max 1.516204 1.091307 1.156641 1.530610
sales_vol. T
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06 Apple 1.516204 -0.181439 -0.379681 -0.631352 -1.343225 -0.513285 Strawberry 0.495784 -0.574517 -1.423055 -0.492635 1.091307 0.401775 Pineapple -0.153495 -1.403139 -1.034571 -1.186038 1.156641 -1.162886 Watermelon 0.477701 -0.018221 -0.235021 0.574631 -0.228815 1.530610
sales_vol. sort_index( axis= 1 , ascending= True )
Apple Pineapple Strawberry Watermelon 2013-01-01 1.516204 -0.153495 0.495784 0.477701 2013-01-02 -0.181439 -1.403139 -0.574517 -0.018221 2013-01-03 -0.379681 -1.034571 -1.423055 -0.235021 2013-01-04 -0.631352 -1.186038 -0.492635 0.574631 2013-01-05 -1.343225 1.156641 1.091307 -0.228815 2013-01-06 -0.513285 -1.162886 0.401775 1.530610
sales_vol. sort_values( by= 'Watermelon' , ascending= True )
Apple Strawberry Pineapple Watermelon 2013-01-03 -0.379681 -1.423055 -1.034571 -0.235021 2013-01-05 -1.343225 1.091307 1.156641 -0.228815 2013-01-02 -0.181439 -0.574517 -1.403139 -0.018221 2013-01-01 1.516204 0.495784 -0.153495 0.477701 2013-01-04 -0.631352 -0.492635 -1.186038 0.574631 2013-01-06 -0.513285 0.401775 -1.162886 1.530610
sales_vol[ 'Apple' ]
2013-01-01 1.516204
2013-01-02 -0.181439
2013-01-03 -0.379681
2013-01-04 -0.631352
2013-01-05 -1.343225
2013-01-06 -0.513285
Freq: D, Name: Apple, dtype: float64
sales_vol[ 0 : 3 ]
Apple Strawberry Pineapple Watermelon 2013-01-01 1.516204 0.495784 -0.153495 0.477701 2013-01-02 -0.181439 -0.574517 -1.403139 -0.018221 2013-01-03 -0.379681 -1.423055 -1.034571 -0.235021
sales_vol[ '2013-01-02' : '2013-01-04' ]
Apple Strawberry Pineapple Watermelon 2013-01-02 -0.181439 -0.574517 -1.403139 -0.018221 2013-01-03 -0.379681 -1.423055 -1.034571 -0.235021 2013-01-04 -0.631352 -0.492635 -1.186038 0.574631
sales_vol. loc[ dates[ 0 ] ]
Apple 1.516204
Strawberry 0.495784
Pineapple -0.153495
Watermelon 0.477701
Name: 2013-01-01 00:00:00, dtype: float64
sales_vol. loc[ : , [ 'Apple' , 'Strawberry' ] ]
Apple Strawberry 2013-01-01 1.516204 0.495784 2013-01-02 -0.181439 -0.574517 2013-01-03 -0.379681 -1.423055 2013-01-04 -0.631352 -0.492635 2013-01-05 -1.343225 1.091307 2013-01-06 -0.513285 0.401775
sales_vol. loc[ '20130102' : '20130104' , [ 'Apple' , 'Strawberry' ] ]
Apple Strawberry 2013-01-02 -0.181439 -0.574517 2013-01-03 -0.379681 -1.423055 2013-01-04 -0.631352 -0.492635
sales_vol. loc[ '20130102' , [ 'Apple' , 'Strawberry' ] ]
Apple -0.181439
Strawberry -0.574517
Name: 2013-01-02 00:00:00, dtype: float64
sales_vol. loc[ dates[ 0 ] , 'Apple' ]
1.5162035528655158
sales_vol. at[ dates[ 0 ] , 'Apple' ]
1.5162035528655158
sales_vol. iloc[ 3 ]
Apple -0.631352
Strawberry -0.492635
Pineapple -1.186038
Watermelon 0.574631
Name: 2013-01-04 00:00:00, dtype: float64
sales_vol. iloc[ 3 : 5 , 0 : 2 ]
sales_vol. iloc[ [ 1 , 2 , 4 ] , [ 0 , 2 ] ]
Apple Pineapple 2013-01-02 -0.181439 -1.403139 2013-01-03 -0.379681 -1.034571 2013-01-05 -1.343225 1.156641
sales_vol. iloc[ 1 : 3 , : ]
Apple Strawberry Pineapple Watermelon 2013-01-02 -0.181439 -0.574517 -1.403139 -0.018221 2013-01-03 -0.379681 -1.423055 -1.034571 -0.235021
sales_vol. iloc[ : , 1 : 3 ]
Strawberry Pineapple 2013-01-01 0.495784 -0.153495 2013-01-02 -0.574517 -1.403139 2013-01-03 -1.423055 -1.034571 2013-01-04 -0.492635 -1.186038 2013-01-05 1.091307 1.156641 2013-01-06 0.401775 -1.162886
sales_vol. iloc[ 1 , 1 ]
sales_vol. iat[ 1 , 1 ]
-0.574516686350344
布尔索引操作(查)
sales_vol[ sales_vol. Apple > 0 ]
Apple Strawberry Pineapple Watermelon 2013-01-01 1.516204 0.495784 -0.153495 0.477701
df2 = sales_vol. copy( )
df2[ 'E' ] = [ 'one' , 'one' , 'two' , 'three' , 'four' , 'three' ]
df2
Apple Strawberry Pineapple Watermelon E 2013-01-01 1.516204 0.495784 -0.153495 0.477701 one 2013-01-02 -0.181439 -0.574517 -1.403139 -0.018221 one 2013-01-03 -0.379681 -1.423055 -1.034571 -0.235021 two 2013-01-04 -0.631352 -0.492635 -1.186038 0.574631 three 2013-01-05 -1.343225 1.091307 1.156641 -0.228815 four 2013-01-06 -0.513285 0.401775 -1.162886 1.530610 three
df2[ df2[ 'E' ] . isin( [ 'one' , 'four' ] ) ]
Apple Strawberry Pineapple Watermelon E 2013-01-01 1.516204 0.495784 -0.153495 0.477701 one 2013-01-02 -0.181439 -0.574517 -1.403139 -0.018221 one 2013-01-05 -1.343225 1.091307 1.156641 -0.228815 four
df2[ 'E' ] . isin( [ 'one' , 'four' ] )
2013-01-01 True
2013-01-02 True
2013-01-03 False
2013-01-04 False
2013-01-05 True
2013-01-06 False
Freq: D, Name: E, dtype: bool
插入操作(增)
s1 = pd. Series( [ 1 , 2 , 3 , 4 , 5 , 6 ] , index= pd. date_range( '20130102' , periods= 6 ) )
sales_vol[ 'Cherry' ] = s1
sales_vol
Apple Strawberry Pineapple Watermelon Cherry 2013-01-01 1.516204 0.495784 -0.153495 0.477701 NaN 2013-01-02 -0.181439 -0.574517 -1.403139 -0.018221 1.0 2013-01-03 -0.379681 -1.423055 -1.034571 -0.235021 2.0 2013-01-04 -0.631352 -0.492635 -1.186038 0.574631 3.0 2013-01-05 -1.343225 1.091307 1.156641 -0.228815 4.0 2013-01-06 -0.513285 0.401775 -1.162886 1.530610 5.0
修改操作(改)
sales_vol. at[ dates[ 0 ] , 'Apple' ] = 0
sales_vol. iat[ 0 , 1 ] = 0
sales_vol. loc[ : , 'Pineapple' ] = np. array( [ 5 ] * len ( sales_vol) )
sales_vol
Apple Strawberry Pineapple Watermelon Cherry 2013-01-01 0.000000 0.000000 5 0.477701 NaN 2013-01-02 -0.181439 -0.574517 5 -0.018221 1.0 2013-01-03 -0.379681 -1.423055 5 -0.235021 2.0 2013-01-04 -0.631352 -0.492635 5 0.574631 3.0 2013-01-05 -1.343225 1.091307 5 -0.228815 4.0 2013-01-06 -0.513285 0.401775 5 1.530610 5.0
df2 = sales_vol. copy( )
df2[ df2 > 0 ] = - df2
df2
Apple Strawberry Pineapple Watermelon Cherry 2013-01-01 0.000000 0.000000 -5 -0.477701 NaN 2013-01-02 -0.181439 -0.574517 -5 -0.018221 -1.0 2013-01-03 -0.379681 -1.423055 -5 -0.235021 -2.0 2013-01-04 -0.631352 -0.492635 -5 -0.574631 -3.0 2013-01-05 -1.343225 -1.091307 -5 -0.228815 -4.0 2013-01-06 -0.513285 -0.401775 -5 -1.530610 -5.0
4. 缺失值处理
df1 = sales_vol. reindex( index= dates[ 0 : 4 ] , columns= list ( sales_vol. columns) + [ 'E' ] )
df1. loc[ dates[ 0 ] : dates[ 1 ] , 'E' ] = 1
df1
Apple Strawberry Pineapple Watermelon Cherry E 2013-01-01 0.000000 0.000000 5 0.477701 NaN 1.0 2013-01-02 -0.181439 -0.574517 5 -0.018221 1.0 1.0 2013-01-03 -0.379681 -1.423055 5 -0.235021 2.0 NaN 2013-01-04 -0.631352 -0.492635 5 0.574631 3.0 NaN
df_drop = df1. dropna( how= 'any' )
df_drop
Apple Strawberry Pineapple Watermelon Cherry E 2013-01-02 -0.181439 -0.574517 5 -0.018221 1.0 1.0
df_fill = df1. fillna( value= 5 )
df_fill
Apple Strawberry Pineapple Watermelon Cherry E 2013-01-01 0.000000 0.000000 5 0.477701 5.0 1.0 2013-01-02 -0.181439 -0.574517 5 -0.018221 1.0 1.0 2013-01-03 -0.379681 -1.423055 5 -0.235021 2.0 5.0 2013-01-04 -0.631352 -0.492635 5 0.574631 3.0 5.0
pd. isnull( df1)
Apple Strawberry Pineapple Watermelon Cherry E 2013-01-01 False False False False True False 2013-01-02 False False False False False False 2013-01-03 False False False False False True 2013-01-04 False False False False False True
5相关操作
sales_vol. mean( )
Apple -0.508164
Strawberry -0.166187
Pineapple 5.000000
Watermelon 0.350148
Cherry 3.000000
dtype: float64
sales_vol. mean( 1 )
2013-01-01 1.369425
2013-01-02 1.045165
2013-01-03 0.992449
2013-01-04 1.490129
2013-01-05 1.703853
2013-01-06 2.283820
Freq: D, dtype: float64
s = pd. Series( [ 1 , 3 , 5 , np. nan, 6 , 8 ] , index= dates) . shift( 2 )
s
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
sales_vol. sub( s, axis= 'index' )
Apple Strawberry Pineapple Watermelon Cherry 2013-01-01 0.000000 0.000000 5 0.477701 NaN 2013-01-02 -0.181439 -0.574517 5 -0.018221 1.0 2013-01-03 -0.379681 -1.423055 5 -0.235021 2.0 2013-01-04 -0.631352 -0.492635 5 0.574631 3.0 2013-01-05 -1.343225 1.091307 5 -0.228815 4.0 2013-01-06 -0.513285 0.401775 5 1.530610 5.0
sales_vol
Apple Strawberry Pineapple Watermelon Cherry 2013-01-01 0.000000 0.000000 5 0.477701 NaN 2013-01-02 -0.181439 -0.574517 5 -0.018221 1.0 2013-01-03 -0.379681 -1.423055 5 -0.235021 2.0 2013-01-04 -0.631352 -0.492635 5 0.574631 3.0 2013-01-05 -1.343225 1.091307 5 -0.228815 4.0 2013-01-06 -0.513285 0.401775 5 1.530610 5.0
sales_vol. apply ( np. cumsum)
Apple Strawberry Pineapple Watermelon Cherry 2013-01-01 0.000000 0.000000 5 0.477701 NaN 2013-01-02 -0.181439 -0.574517 10 0.459481 1.0 2013-01-03 -0.561120 -1.997572 15 0.224460 3.0 2013-01-04 -1.192472 -2.490207 20 0.799091 6.0 2013-01-05 -2.535697 -1.398900 25 0.570276 10.0 2013-01-06 -3.048982 -0.997124 30 2.100886 15.0
sales_vol. apply ( lambda x: x. max ( ) - x. min ( ) )
Apple 1.343225
Strawberry 2.514362
Pineapple 0.000000
Watermelon 1.765631
Cherry 4.000000
dtype: float64
s = pd. Series( np. random. randint( 0 , 7 , size= 10 ) )
s
0 4
1 5
2 3
3 2
4 2
5 6
6 1
7 1
8 4
9 6
dtype: int32
s. value_counts( )
6 2
4 2
2 2
1 2
5 1
3 1
dtype: int64
s = pd. Series( [ 'A' , 'B' , 'C' , 'Aaba' , 'Baca' , np. nan, 'CABA' , 'dog' , 'cat' ] )
s. str . lower( )
0 A
1 B
2 C
3 Aaba
4 Baca
5 NaN
6 CABA
7 dog
8 cat
dtype: object
合并
df = pd. DataFrame( np. random. randn( 10 , 4 ) )
df
0 1 2 3 0 -0.064357 -0.729935 -0.257929 0.658859 1 -0.087968 1.512643 1.432160 0.060939 2 0.721509 -0.417664 -0.785848 -0.440885 3 1.355722 -1.766013 -0.595825 -0.507181 4 0.127415 -0.330855 0.155734 -0.395429 5 0.110239 -2.378898 -0.139588 -2.090046 6 -0.875529 0.980170 2.073291 0.470351 7 -1.334029 0.873497 1.164064 0.624527 8 0.123787 -0.077460 -0.276321 2.126241 9 -0.914960 -1.583755 -0.528952 0.816373
pieces = [ df[ : 3 ] , df[ 3 : 7 ] , df[ 7 : ] ]
pieces
[ 0 1 2 3
0 -0.064357 -0.729935 -0.257929 0.658859
1 -0.087968 1.512643 1.432160 0.060939
2 0.721509 -0.417664 -0.785848 -0.440885,
0 1 2 3
3 1.355722 -1.766013 -0.595825 -0.507181
4 0.127415 -0.330855 0.155734 -0.395429
5 0.110239 -2.378898 -0.139588 -2.090046
6 -0.875529 0.980170 2.073291 0.470351,
0 1 2 3
7 -1.334029 0.873497 1.164064 0.624527
8 0.123787 -0.077460 -0.276321 2.126241
9 -0.914960 -1.583755 -0.528952 0.816373]
pd. concat( pieces)
0 1 2 3 0 -0.064357 -0.729935 -0.257929 0.658859 1 -0.087968 1.512643 1.432160 0.060939 2 0.721509 -0.417664 -0.785848 -0.440885 3 1.355722 -1.766013 -0.595825 -0.507181 4 0.127415 -0.330855 0.155734 -0.395429 5 0.110239 -2.378898 -0.139588 -2.090046 6 -0.875529 0.980170 2.073291 0.470351 7 -1.334029 0.873497 1.164064 0.624527 8 0.123787 -0.077460 -0.276321 2.126241 9 -0.914960 -1.583755 -0.528952 0.816373
left = pd. DataFrame( { 'key' : [ 'foo' , 'foo' ] , 'lval' : [ 1 , 2 ] } )
right = pd. DataFrame( { 'key' : [ 'foo' , 'foo' ] , 'rval' : [ 4 , 5 ] } )
pd. merge( left, right, on= 'key' )
key lval rval 0 foo 1 4 1 foo 1 5 2 foo 2 4 3 foo 2 5
df = pd. DataFrame( np. random. randn( 8 , 4 ) , columns= [ 'A' , 'B' , 'C' , 'D' ] )
s = df. iloc[ 3 ]
df. append( s, ignore_index= True )
A B C D 0 -1.022777 1.015712 -0.275568 -0.505822 1 -0.449308 -2.028096 0.976798 1.341397 2 0.150794 -0.277167 -0.324461 -0.700945 3 -0.043391 -1.090533 0.854648 0.778152 4 -1.391091 0.847236 -0.126084 -1.004139 5 0.065630 0.229489 -2.033006 0.825722 6 1.315419 0.377726 0.092337 0.290093 7 0.615620 -0.402542 -0.047299 -1.149747 8 -0.043391 -1.090533 0.854648 0.778152
7.分组
df = pd. DataFrame( { 'A' : [ 'foo' , 'bar' , 'foo' , 'bar' , 'foo' , 'bar' , 'foo' , 'foo' ] ,
'B' : [ 'one' , 'one' , 'two' , 'three' , 'two' , 'two' , 'one' , 'three' ] ,
'C' : np. random. randn( 8 ) ,
'D' : np. random. randn( 8 ) } )
df
A B C D 0 foo one 0.942255 0.461081 1 bar one -0.988118 1.099893 2 foo two 0.773705 0.513753 3 bar three 0.844189 0.341648 4 foo two 1.325598 -0.200073 5 bar two 0.313239 -0.007345 6 foo one 1.416963 2.522686 7 foo three -0.487402 1.928968
df. groupby( 'A' ) . sum ( )
C D A bar 0.169310 1.434197 foo 3.971119 5.226417
df. groupby( [ 'A' , 'B' ] ) . sum ( )
C D A B bar one -0.988118 1.099893 three 0.844189 0.341648 two 0.313239 -0.007345 foo one 2.359218 2.983768 three -0.487402 1.928968 two 2.099303 0.313681
8 reshape
tuples = list ( zip ( * [ [ 'bar' , 'bar' , 'baz' , 'baz' , 'foo' , 'foo' , 'qux' , 'qux' ] ,
[ 'one' , 'two' , 'one' , 'two' , 'one' , 'two' , 'one' , 'two' ] ] ) )
tuples
index = pd. MultiIndex. from_tuples( tuples, names= [ 'first' , 'second' ] )
df = pd. DataFrame( np. random. randn( 8 , 2 ) , index= index, columns= [ 'A' , 'B' ] )
df2 = df[ : 4 ]
df2
A B first second bar one 1.182011 -0.351810 two -1.464586 -0.377918 baz one -0.696995 1.227232 two 0.416556 -1.173721
stacked = df2. stack( )
stacked
first second
bar one A 1.182011
B -0.351810
two A -1.464586
B -0.377918
baz one A -0.696995
B 1.227232
two A 0.416556
B -1.173721
dtype: float64
stacked. unstack( )
A B first second bar one 1.182011 -0.351810 two -1.464586 -0.377918 baz one -0.696995 1.227232 two 0.416556 -1.173721
stacked. unstack( 1 )
second one two first bar A 1.182011 -1.464586 B -0.351810 -0.377918 baz A -0.696995 0.416556 B 1.227232 -1.173721
stacked. unstack( 0 )
first bar baz second one A 1.182011 -0.696995 B -0.351810 1.227232 two A -1.464586 0.416556 B -0.377918 -1.173721
df = pd. DataFrame( { 'A' : [ 'one' , 'one' , 'two' , 'three' ] * 3 ,
'B' : [ 'A' , 'B' , 'C' ] * 4 ,
'C' : [ 'foo' , 'foo' , 'foo' , 'bar' , 'bar' , 'bar' ] * 2 ,
'D' : np. random. randn( 12 ) ,
'E' : np. random. randn( 12 ) } )
df
A B C D E 0 one A foo 0.733791 2.254117 1 one B foo 0.889867 0.172840 2 two C foo -0.673481 -1.287785 3 three A bar 1.621192 0.286708 4 one B bar 1.827056 1.477342 5 one C bar -1.586255 1.145898 6 two A foo 0.932376 -1.259903 7 three B foo -2.246142 0.178643 8 one C foo -1.483878 0.055398 9 one A bar -0.134426 -0.629503 10 two B bar -1.784999 0.498491 11 three C bar 0.956670 -0.446973
pd. pivot_table( df, values= 'D' , index= [ 'A' , 'B' ] , columns= [ 'C' ] )
C bar foo A B one A -0.134426 0.733791 B 1.827056 0.889867 C -1.586255 -1.483878 three A 1.621192 NaN B NaN -2.246142 C 0.956670 NaN two A NaN 0.932376 B -1.784999 NaN C NaN -0.673481
9 时间序列
Pandas在对频率转换进行重新采样时拥有简单、强大且高效的功能(如将按秒采样的数据转换为按5分钟为单位进行采样的数据)。这种操作在金融领域非常常见。
rng = pd. date_range( '20120101' , periods= 100 , freq= 'S' )
rng
ts = pd. Series( np. random. randint( 0 , 500 , len ( rng) ) , index= rng)
ts. resample( '5Min' ) . sum ( )
2012-01-01 25643
Freq: 5T, dtype: int32
rng = pd. date_range( '3/6/2012 00:00' , periods= 5 , freq= 'D' )
ts = pd. Series( np. random. randn( len ( rng) ) , rng)
ts
2012-03-06 0.617556
2012-03-07 -0.140357
2012-03-08 0.720688
2012-03-09 -0.002317
2012-03-10 -1.039849
Freq: D, dtype: float64
ts_utc = ts. tz_localize( 'UTC' )
ts_utc
2012-03-06 00:00:00+00:00 0.617556
2012-03-07 00:00:00+00:00 -0.140357
2012-03-08 00:00:00+00:00 0.720688
2012-03-09 00:00:00+00:00 -0.002317
2012-03-10 00:00:00+00:00 -1.039849
Freq: D, dtype: float64
ts_utc. tz_convert( 'US/Eastern' )
2012-03-05 19:00:00-05:00 0.617556
2012-03-06 19:00:00-05:00 -0.140357
2012-03-07 19:00:00-05:00 0.720688
2012-03-08 19:00:00-05:00 -0.002317
2012-03-09 19:00:00-05:00 -1.039849
Freq: D, dtype: float64
rng = pd. date_range( '1/1/2012' , periods= 5 , freq= 'M' )
ts = pd. Series( np. random. randn( len ( rng) ) , index= rng)
ts
2012-01-31 0.228717
2012-02-29 -1.027588
2012-03-31 0.908395
2012-04-30 0.625547
2012-05-31 -0.109560
Freq: M, dtype: float64
ps = ts. to_period( )
ps
2012-01 0.228717
2012-02 -1.027588
2012-03 0.908395
2012-04 0.625547
2012-05 -0.109560
Freq: M, dtype: float64
ps. to_timestamp( )
2012-01-01 0.228717
2012-02-01 -1.027588
2012-03-01 0.908395
2012-04-01 0.625547
2012-05-01 -0.109560
Freq: MS, dtype: float64
ps. to_timestamp( )
Period('2012-01', 'M')
prng = pd. period_range( '1990Q1' , '2000Q4' , freq= 'Q-NOV' )
ts = pd. Series( np. random. randn( len ( prng) ) , prng)
ts. index = ( prng. asfreq( 'M' , 'e' ) + 1 ) . asfreq( 'H' , 's' ) + 9
ts. head( )
Categorical
df = pd. DataFrame( { "id" : [ 1 , 2 , 3 , 4 , 5 , 6 ] , "raw_grade" : [ 'a' , 'b' , 'b' , 'a' , 'a' , 'e' ] } )
df[ "grade" ] = df[ "raw_grade" ] . astype( "category" )
df[ "grade" ]
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
df[ "grade" ] . cat. categories = [ "very good" , "good" , "very bad" ]
df[ "grade" ] = df[ "grade" ] . cat. set_categories( [ "very bad" , "bad" , "medium" , "good" , "very good" ] )
df[ "grade" ]
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
df. sort_values( by= "grade" )
id raw_grade grade 5 6 e very bad 1 2 b good 2 3 b good 0 1 a very good 3 4 a very good 4 5 a very good
df. groupby( "grade" ) . size( )
画图
ts = pd. Series( np. random. randn( 1000 ) , index= pd. date_range( '1/1/2000' , periods= 1000 ) )
ts = ts. cumsum( )
ts. plot( )
<matplotlib.axes._subplots.AxesSubplot at 0x1788d4ef1c8>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Uw25svMu-1586607531658)(1.%20Pandas%E6%95%99%E7%A8%8B-10MintoPandas_files/1.%20Pandas%E6%95%99%E7%A8%8B-10MintoPandas_92_1.svg)]
df = pd. DataFrame( np. random. randn( 1000 , 4 ) , index= ts. index,
columns= [ 'A' , 'B' , 'C' , 'D' ] )
df = df. cumsum( )
plt. figure( ) ;
df. plot( ) ;
plt. legend( loc= 'best' )
<matplotlib.legend.Legend at 0x1788cbdbe88>
<Figure size 432x288 with 0 Axes>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-o37hZq0z-1586607531659)(1.%20Pandas%E6%95%99%E7%A8%8B-10MintoPandas_files/1.%20Pandas%E6%95%99%E7%A8%8B-10MintoPandas_93_2.svg)]
导入和保存数据
df. to_csv( 'foo.csv' )
pd. read_csv( 'foo.csv' )
Unnamed: 0 A B C D 0 2000-01-01 -1.125225 1.577122 -0.631104 0.226040 1 2000-01-02 -1.636618 0.825889 -1.702668 0.742720 2 2000-01-03 -1.874304 -0.924554 -0.396482 0.757374 3 2000-01-04 -2.804663 -1.015225 0.563958 1.102884 4 2000-01-05 -2.183829 -1.126147 0.987302 -0.351845 ... ... ... ... ... ... 995 2002-09-22 -10.052230 -5.444433 -49.408961 19.674901 996 2002-09-23 -9.058395 -5.363573 -47.815307 19.377475 997 2002-09-24 -8.024636 -6.063298 -47.262819 21.506791 998 2002-09-25 -8.128653 -5.967861 -46.710008 22.240115 999 2002-09-26 -9.313855 -5.906042 -47.302999 22.040534
1000 rows × 5 columns
df. to_hdf( 'foo.h5' , 'df' )
pd. read_hdf( 'foo.h5' , 'df' )
A B C D 2000-01-01 -1.125225 1.577122 -0.631104 0.226040 2000-01-02 -1.636618 0.825889 -1.702668 0.742720 2000-01-03 -1.874304 -0.924554 -0.396482 0.757374 2000-01-04 -2.804663 -1.015225 0.563958 1.102884 2000-01-05 -2.183829 -1.126147 0.987302 -0.351845 ... ... ... ... ... 2002-09-22 -10.052230 -5.444433 -49.408961 19.674901 2002-09-23 -9.058395 -5.363573 -47.815307 19.377475 2002-09-24 -8.024636 -6.063298 -47.262819 21.506791 2002-09-25 -8.128653 -5.967861 -46.710008 22.240115 2002-09-26 -9.313855 -5.906042 -47.302999 22.040534
1000 rows × 4 columns
df. to_excel( 'foo.xlsx' , sheet_name= 'Sheet1' )
pd. read_excel( 'foo.xlsx' , 'Sheet1' , index_col= None , na_values= [ 'NA' ] )
Unnamed: 0 A B C D 0 2000-01-01 -1.125225 1.577122 -0.631104 0.226040 1 2000-01-02 -1.636618 0.825889 -1.702668 0.742720 2 2000-01-03 -1.874304 -0.924554 -0.396482 0.757374 3 2000-01-04 -2.804663 -1.015225 0.563958 1.102884 4 2000-01-05 -2.183829 -1.126147 0.987302 -0.351845 ... ... ... ... ... ... 995 2002-09-22 -10.052230 -5.444433 -49.408961 19.674901 996 2002-09-23 -9.058395 -5.363573 -47.815307 19.377475 997 2002-09-24 -8.024636 -6.063298 -47.262819 21.506791 998 2002-09-25 -8.128653 -5.967861 -46.710008 22.240115 999 2002-09-26 -9.313855 -5.906042 -47.302999 22.040534
1000 rows × 5 columns
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
</tr>
2000-01-01 -1.125225 1.577122 -0.631104 0.226040 2000-01-02 -1.636618 0.825889 -1.702668 0.742720 2000-01-03 -1.874304 -0.924554 -0.396482 0.757374 2000-01-04 -2.804663 -1.015225 0.563958 1.102884 2000-01-05 -2.183829 -1.126147 0.987302 -0.351845 ... ... ... ... ... 2002-09-22 -10.052230 -5.444433 -49.408961 19.674901 2002-09-23 -9.058395 -5.363573 -47.815307 19.377475 2002-09-24 -8.024636 -6.063298 -47.262819 21.506791 2002-09-25 -8.128653 -5.967861 -46.710008 22.240115 2002-09-26 -9.313855 -5.906042 -47.302999 22.040534
1000 rows × 4 columns
df. to_excel( 'foo.xlsx' , sheet_name= 'Sheet1' )
pd. read_excel( 'foo.xlsx' , 'Sheet1' , index_col= None , na_values= [ 'NA' ] )
Unnamed: 0 A B C D 0 2000-01-01 -1.125225 1.577122 -0.631104 0.226040 1 2000-01-02 -1.636618 0.825889 -1.702668 0.742720 2 2000-01-03 -1.874304 -0.924554 -0.396482 0.757374 3 2000-01-04 -2.804663 -1.015225 0.563958 1.102884 4 2000-01-05 -2.183829 -1.126147 0.987302 -0.351845 ... ... ... ... ... ... 995 2002-09-22 -10.052230 -5.444433 -49.408961 19.674901 996 2002-09-23 -9.058395 -5.363573 -47.815307 19.377475 997 2002-09-24 -8.024636 -6.063298 -47.262819 21.506791 998 2002-09-25 -8.128653 -5.967861 -46.710008 22.240115 999 2002-09-26 -9.313855 -5.906042 -47.302999 22.040534
1000 rows × 5 columns
本博客的目的是总结和分享,大部分程序来自互联网,作者为标榜任何代码所有权。读者可以在未经本博客作者授权的情况下随意复制和使用。转发时若声明出处将不胜感激。另外如果程序难免有未优化或欢迎批评指正(chenaijun1298 at yeah dot net)