pandas入门
首先,下面所有的例子均需要导入包:
% matplotlib inline
import pandas as pd
import numpy as np
import matplotlib. pyplot as plt
一维数组:Series:
s = pd. Series( [ 1 , 3 , 5 , np. nan, 6 , 8 ] )
s
输出:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
二维数组创建:
dates = pd. date_range( '20160301' , periods= 6 )
df = pd. DataFrame( np. random. randn( 6 , 4 ) , index= dates, columns= list ( 'ABCD' ) )
df
输出
A B C D 2016-03-01 -0.859805 -0.069692 -0.905092 -0.553213 2016-03-02 -0.353785 0.031793 -0.785213 -0.212337 2016-03-03 1.719976 0.925145 0.241639 -0.490166 2016-03-04 -1.207854 -0.001647 -0.468976 -0.781144 2016-03-05 0.452034 1.371208 1.152729 1.470498 2016-03-06 1.378227 0.246941 -1.186630 -0.411647
df. values
array( [ [ - 8.59804740e-01 , - 6.96922374e-02 , - 9.05091676e-01 ,
- 5.53212518e-01 ] ,
[ - 3.53785450e-01 , 3.17933613e-02 , - 7.85212585e-01 ,
- 2.12337229e-01 ] ,
[ 1.71997643e+00 , 9.25144720e-01 , 2.41639347e-01 ,
- 4.90166361e-01 ] ,
[ - 1.20785422e+00 , - 1.64720630e-03 , - 4.68976209e-01 ,
- 7.81144372e-01 ] ,
[ 4.52033577e-01 , 1.37120779e+00 , 1.15272905e+00 ,
1.47049771e+00 ] ,
[ 1.37822701e+00 , 2.46941166e-01 , - 1.18662963e+00 ,
- 4.11647030e-01 ] ] )
用字典来构建DataFrame,如下:
df1 = pd. DataFrame( {
'A' : 1 ,
'B' : pd. Timestamp( '20160301' ) ,
'C' : range ( 4 ) ,
'D' : np. arange( 5 , 9 ) ,
'E' : 'text' ,
'F' : [ 'AA' , 'BB' , 'CC' , 'DD' ] } )
df1
A B C D E F 0 1 2016-03-01 0 5 text AA 1 1 2016-03-01 1 6 text BB 2 1 2016-03-01 2 7 text CC 3 1 2016-03-01 3 8 text DD
各个列可以通过DataFrame的属性来访问,如
df. A、df. B
type ( df. A)
pandas.core.series.Series
可以看出,每一列或者每一行,都是Series数据
df. types
df. shape
df. head( )
df. tail( )
df. index
df. columns
df. values
df. describe( )
df. T
df. sort_index( )
df. sort_values( by= 'A' )
df. [ 'A' ]
df. A
df[ 2 : 4 ]
df. loc[ ]
df. iloc[ ]
df. loc[ : , [ 'B' , 'C' ] ] , 也可以访问某个特定的值
df. at[ , ]
df. iat[ , ]
df[ df. A> 0 ]
df[ df> 0 ]
矩阵运算
df[ 'tag' ] = [ 'a' ] * 2 + [ 'b' ] * 2 + [ 'c' ] * 2
A B C D tag 2016-03-01 0.033228 2.307123 -0.585367 -1.671832 a 2016-03-02 -1.967299 0.727670 -0.190863 -0.163514 a 2016-03-03 0.065359 0.696804 -0.550040 0.717347 b 2016-03-04 0.234850 0.289520 -1.087173 1.534277 b 2016-03-05 -1.459620 1.040987 0.220130 -0.068131 c 2016-03-06 0.865402 2.650889 -0.015460 -0.111889 c
df[ df. tag. isin( [ 'a' , 'c' ] ) ]
A B C D tag 2016-03-01 0.033228 2.307123 -0.585367 -1.671832 a 2016-03-02 -1.967299 0.727670 -0.190863 -0.163514 a 2016-03-05 -1.459620 1.040987 0.220130 -0.068131 c 2016-03-06 0.865402 2.650889 -0.015460 -0.111889 c
### 修改元素
s = pd. Series( np. arange( 6 ) , index= pd. date_range( '20160301' , periods= 6 ) )
df[ 'E' ] = s
df. at[ pd. Timestamp( '20160301' ) , 'A' ] = 0.4
df. B = 200
df. iloc[ : , 2 : 5 ] = 1000
要结束网页的jupyter notebook,在控制台两次ctrl+c
重建索引
dates = pd. date_range( '20160301' , periods= 6 )
df = pd. DataFrame( data= np. random. randn( 6 , 4 ) , index= dates, columns= list ( 'ABCD' ) )
df
A B C D 2016-03-01 -0.985666 0.240058 0.716721 0.352009 2016-03-02 -1.563644 0.091766 1.081764 0.951541 2016-03-03 0.279760 -0.316136 1.198073 -0.562947 2016-03-04 1.174777 -0.225305 -0.280256 -0.074768 2016-03-05 2.173366 0.907038 -1.104678 -0.921779 2016-03-06 0.200422 0.442619 1.970330 -0.609867
df1 = df. reindex( index= dates[ 0 : 4 ] , columns= list ( df. columns) + [ 'E' ] )
df1
A B C D E 2016-03-01 -0.985666 0.240058 0.716721 0.352009 NaN 2016-03-02 -1.563644 0.091766 1.081764 0.951541 NaN 2016-03-03 0.279760 -0.316136 1.198073 -0.562947 NaN 2016-03-04 1.174777 -0.225305 -0.280256 -0.074768 NaN
### 处理丢失数据
df1. loc[ dates[ 1 : 3 ] , 'E' ] = 1
df1
A B C D E 2016-03-01 -0.985666 0.240058 0.716721 0.352009 NaN 2016-03-02 -1.563644 0.091766 1.081764 0.951541 1 2016-03-03 0.279760 -0.316136 1.198073 -0.562947 1 2016-03-04 1.174777 -0.225305 -0.280256 -0.074768 NaN
此时的E列有两个空值
df1. dropna( )
df1. fillna( value= 5 )
pd. isnull( df1)
pd. isnull( df1) . any ( )
pd. isnull( df1) . any ( ) . any ( )
统计
空数据是不参与统计计算的
df1. mean( )
df. sum ( axis= 'columns' )
df. cumsum( )
s = pd. Series( [ 1 , 3 , 5 , np. nan, 6 , 8 ] , index= dates) . shift( 2 )
df. sub( s, axis= 'index' )
df. apply ( np. cumsum)
df. apply ( lambda x: x. max ( ) - x. min ( ) )
s = pd. Series( np. random. randint( 0 , 7 , size= 10 ) ) ¥不包括最后一个数,但python自带函数是包括的
s. value_counts( )
s. mode( )
数据合并
df = pd. DataFrame( np. random. randn( 10 , 4 ) , columns= list ( 'ABCD' ) )
A B C D 0 1.098103 -0.843356 -0.379135 0.419353 1 -0.177702 -0.225926 -0.363542 -0.153022 2 1.938231 0.154881 0.291382 0.152774 3 -0.460645 -0.268697 -1.509469 0.698776 4 -0.397048 -0.958223 0.212833 -0.435485 5 0.525406 -0.177595 0.453216 -0.093792 6 0.531912 -0.832667 0.200721 0.943878 7 -0.740845 0.098634 0.274020 1.671997 8 2.182379 1.729010 1.306269 0.580677 9 -0.031538 0.159714 0.736667 -0.122326
df1 = pd. concat( [ df. iloc[ : 3 ] , df. iloc[ 3 : 7 ] , df. iloc[ 7 : ] ] )
( df == df1) . all ( ) . all ( )
left = pd. DataFrame( { 'key' : [ 'foo' , 'foo' ] , 'lval' : [ 1 , 2 ] } )
right = pd. DataFrame( { 'key' : [ 'foo' , 'foo' ] , 'rval' : [ 4 , 5 ] } )
pd. merge( left, right, on= 'key' )
s = pd. Series( np. random. randint( 1 , 5 , size= 4 ) , index= list ( 'ABCD' ) )
df. append( s, ignore_index= True )
如果s有五列,在插入后,多出来新的一列,除了最后一行以外,其他都是NaN
分组统计
df = pd. DataFrame( { 'A' : [ 'foo' , 'bar' , 'foo' , 'bar' ,
'foo' , 'bar' , 'foo' , 'foo' ] ,
'B' : [ 'one' , 'one' , 'two' , 'three' ,
'two' , 'two' , 'one' , 'three' ] ,
'C' : np. random. randn( 8 ) ,
'D' : np. random. randn( 8 ) } )
df
A B C D 0 foo one -0.580320 -1.460149 1 bar one 1.471201 -1.079598 2 foo two 0.094836 1.513204 3 bar three -1.498810 0.754968 4 foo two 0.180709 0.415266 5 bar two 0.358515 -0.341988 6 foo one -0.121082 -0.408148 7 foo three 0.404648 -0.320882
df. groupby( 'A' ) . sum ( )
C D A bar 0.330906 -0.666618 foo -0.021208 -0.260709
df. groupby( [ 'A' , 'B' ] ) . sum ( )
C D A B bar one 1.471201 -1.079598 three -1.498810 0.754968 two 0.358515 -0.341988 foo one -0.701402 -1.868297 three 0.404648 -0.320882 two 0.275545 1.928470
形成了双索引的结构
数据整形
tuples = list ( zip ( * [ [ 'bar' , 'bar' , 'baz' , 'baz' ,
'foo' , 'foo' , 'qux' , 'qux' ] ,
[ 'one' , 'two' , 'one' , 'two' ,
'one' , 'two' , 'one' , 'two' ] ] ) )
index = pd. MultiIndex. from_tuples( tuples, names= [ 'first' , 'second' ] )
df = pd. DataFrame( np. random. randn( 8 , 2 ) , index= index, columns= [ 'A' , 'B' ] )
A B first second bar one 0.072026 0.422077 two -1.099181 -0.354796 baz one 1.285500 -1.185525 two 0.645316 -0.660115 foo one 0.696443 -1.664527 two 0.718399 -0.154125 qux one -0.740052 0.713089 two -0.672748 -1.346843
stacked = df. stack( )
first second
bar one A 0.072026
B 0.422077
two A -1.099181
B -0.354796
baz one A 1.285500
B -1.185525
two A 0.645316
B -0.660115
foo one A 0.696443
B -1.664527
two A 0.718399
B -0.154125
qux one A -0.740052
B 0.713089
two A -0.672748
B -1.346843
dtype: float64
stacked. unstack( )
stacked. unstack( ) . unstack( )
A B second one two one two first bar 0.072026 -1.099181 0.422077 -0.354796 baz 1.285500 0.645316 -1.185525 -0.660115 foo 0.696443 0.718399 -1.664527 -0.154125 qux -0.740052 -0.672748 0.713089 -1.346843
stacked. unstack( 1 )
< div>
< table border= "1" class = "dataframe" >
< thead>
< tr style= "text-align: right;" >
< th> < / th>
< th> second< / th>
< th> one< / th>
< th> two< / th>
< / tr>
< tr>
< th> first< / th>
< th> < / th>
< th> < / th>
< th> < / th>
< / tr>
< / thead>
< tbody>
< tr>
< th rowspan= "2" valign= "top" > bar< / th>
< th> A< / th>
< td> 0.072026 < / td>
< td> - 1.099181 < / td>
< / tr>
< tr>
< th> B< / th>
< td> 0.422077 < / td>
< td> - 0.354796 < / td>
< / tr>
< tr>
< th rowspan= "2" valign= "top" > baz< / th>
< th> A< / th>
< td> 1.285500 < / td>
< td> 0.645316 < / td>
< / tr>
< tr>
< th> B< / th>
< td> - 1.185525 < / td>
< td> - 0.660115 < / td>
< / tr>
< tr>
< th rowspan= "2" valign= "top" > foo< / th>
< th> A< / th>
< td> 0.696443 < / td>
< td> 0.718399 < / td>
< / tr>
< tr>
< th> B< / th>
< td> - 1.664527 < / td>
< td> - 0.154125 < / td>
< / tr>
< tr>
< th rowspan= "2" valign= "top" > qux< / th>
< th> A< / th>
< td> - 0.740052 < / td>
< td> - 0.672748 < / td>
< / tr>
< tr>
< th> B< / th>
< td> 0.713089 < / td>
< td> - 1.346843 < / td>
< / tr>
< / tbody>
< / table>
< / div>
数据透视表
数据透视是指查看里面的一部分数据
df = pd. DataFrame( { 'A' : [ 'one' , 'one' , 'two' , 'three' ] * 3 ,
'B' : [ 'A' , 'B' , 'C' ] * 4 ,
'C' : [ 'foo' , 'foo' , 'foo' , 'bar' , 'bar' , 'bar' ] * 2 ,
'D' : np. random. randn( 12 ) ,
'E' : np. random. randn( 12 ) } )
df
A B C D E 0 one A foo 1.477533 1.557713 1 one B foo 0.019528 2.483014 2 two C foo -0.912452 0.409732 3 three A bar 0.502807 -0.462401 4 one B bar 1.709597 -1.739413 5 one C bar -0.658155 1.302735 6 two A foo 0.007806 0.782926 7 three B foo -0.067922 -0.193820 8 one C foo 0.806713 0.383870 9 one A bar 0.794017 0.749756 10 two B bar -0.532554 -0.811900 11 three C bar 0.464731 1.168423
pd. pivot_table( df, values= 'D' , index= [ 'A' , 'B' ] , columns= [ 'C' ] )
C bar foo A B one A 0.794017 1.477533 B 1.709597 0.019528 C -0.658155 0.806713 three A 0.502807 NaN B NaN -0.067922 C 0.464731 NaN two A NaN 0.007806 B -0.532554 NaN C NaN -0.912452
**当这个数据透视表对应多个值时,回去取平均值**
时间序列
rng = pd. date_range( '20160301' , periods= 600 , freq= 's' )
rng
DatetimeIndex(['2016-03-01 00:00:00', '2016-03-01 00:00:01',
'2016-03-01 00:00:02', '2016-03-01 00:00:03',
'2016-03-01 00:00:04', '2016-03-01 00:00:05',
'2016-03-01 00:00:06', '2016-03-01 00:00:07',
'2016-03-01 00:00:08', '2016-03-01 00:00:09',
...
'2016-03-01 00:09:50', '2016-03-01 00:09:51',
'2016-03-01 00:09:52', '2016-03-01 00:09:53',
'2016-03-01 00:09:54', '2016-03-01 00:09:55',
'2016-03-01 00:09:56', '2016-03-01 00:09:57',
'2016-03-01 00:09:58', '2016-03-01 00:09:59'],
dtype='datetime64[ns]', length=600, freq='S')
ts = pd. Series( np. random. randint( 0 , 500 , len ( rng) ) , index= rng)
ts
2016-03-01 00:00:00 34
2016-03-01 00:00:01 4
2016-03-01 00:00:02 382
2016-03-01 00:00:03 164
2016-03-01 00:00:04 178
2016-03-01 00:00:05 421
2016-03-01 00:00:06 34
2016-03-01 00:00:07 71
2016-03-01 00:00:08 316
2016-03-01 00:00:09 201
2016-03-01 00:00:10 214
2016-03-01 00:00:11 443
2016-03-01 00:00:12 185
2016-03-01 00:00:13 79
2016-03-01 00:00:14 38
2016-03-01 00:00:15 465
2016-03-01 00:00:16 309
2016-03-01 00:00:17 93
2016-03-01 00:00:18 20
2016-03-01 00:00:19 338
2016-03-01 00:00:20 149
2016-03-01 00:00:21 34
2016-03-01 00:00:22 257
2016-03-01 00:00:23 462
2016-03-01 00:00:24 41
2016-03-01 00:00:25 471
2016-03-01 00:00:26 313
2016-03-01 00:00:27 224
2016-03-01 00:00:28 78
2016-03-01 00:00:29 498
...
2016-03-01 00:09:30 61
2016-03-01 00:09:31 315
2016-03-01 00:09:32 388
2016-03-01 00:09:33 391
2016-03-01 00:09:34 263
2016-03-01 00:09:35 11
2016-03-01 00:09:36 61
2016-03-01 00:09:37 400
2016-03-01 00:09:38 109
2016-03-01 00:09:39 135
2016-03-01 00:09:40 267
2016-03-01 00:09:41 248
2016-03-01 00:09:42 469
2016-03-01 00:09:43 155
2016-03-01 00:09:44 284
2016-03-01 00:09:45 168
2016-03-01 00:09:46 228
2016-03-01 00:09:47 244
2016-03-01 00:09:48 442
2016-03-01 00:09:49 450
2016-03-01 00:09:50 226
2016-03-01 00:09:51 370
2016-03-01 00:09:52 192
2016-03-01 00:09:53 325
2016-03-01 00:09:54 82
2016-03-01 00:09:55 154
2016-03-01 00:09:56 285
2016-03-01 00:09:57 22
2016-03-01 00:09:58 48
2016-03-01 00:09:59 171
Freq: S, dtype: int32
ts. resample( '2Min' , how= 'sum' )
2016-03-01 00:00:00 28595
2016-03-01 00:02:00 29339
2016-03-01 00:04:00 28991
2016-03-01 00:06:00 30789
2016-03-01 00:08:00 30131
Freq: 2T, dtype: int32
prng = pd. period_range( '1990Q1' , '2000Q4' , freq= 'Q-NOV' )
prng
PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',
'1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',
'1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',
'1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',
'1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',
'1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',
'1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',
'2000Q3', '2000Q4'],
dtype='int64', freq='Q-NOV')
prng. to_timestamp( )
pd. Timestamp( '20160301' ) - pd. Timestamp( '2016' )
类别数据
df = pd. DataFrame( { "id" : [ 1 , 2 , 3 , 4 , 5 , 6 ] , "raw_grade" : [ 'a' , 'b' , 'b' , 'a' , 'a' , 'e' ] } )
df
id raw_grade 0 1 a 1 2 b 2 3 b 3 4 a 4 5 a 5 6 e
df[ "grade" ] = df[ "raw_grade" ] . astype( "category" )
df
id raw_grade grade 0 1 a a 1 2 b b 2 3 b b 3 4 a a 4 5 a a 5 6 e e
df[ "grade" ] . cat. categories
Index([u'a', u'b', u'e'], dtype='object')
df[ "grade" ] . cat. categories = [ "very good" , "good" , "very bad" ]
df
id raw_grade grade 0 1 a very good 1 2 b good 2 3 b good 3 4 a very good 4 5 a very good 5 6 e very bad
df. sort_values( by= 'grade' , ascending= True )
id raw_grade grade 0 1 a very good 3 4 a very good 4 5 a very good 1 2 b good 2 3 b good 5 6 e very bad
画图
ts = pd. Series( np. random. randn( 1000 ) , index= pd. date_range( '20000101' , periods= 1000 ) )
ts = ts. cumsum( )
ts
2000-01-01 0.416424
2000-01-02 0.603304
2000-01-03 -0.237965
2000-01-04 0.317450
2000-01-05 0.665045
2000-01-06 2.468087
2000-01-07 2.758852
2000-01-08 2.271343
2000-01-09 3.129609
2000-01-10 5.171241
2000-01-11 5.049896
2000-01-12 5.185316
2000-01-13 4.169058
2000-01-14 2.862306
2000-01-15 4.018617
2000-01-16 4.456694
2000-01-17 5.824236
2000-01-18 6.094983
2000-01-19 5.880954
2000-01-20 5.875111
2000-01-21 6.008481
2000-01-22 6.835501
2000-01-23 7.480405
2000-01-24 6.849335
2000-01-25 7.608887
2000-01-26 9.029474
2000-01-27 8.859222
2000-01-28 7.162806
2000-01-29 7.398013
2000-01-30 7.391844
...
2002-08-28 21.728409
2002-08-29 21.757852
2002-08-30 21.047643
2002-08-31 20.114996
2002-09-01 18.769902
2002-09-02 17.417680
2002-09-03 17.917688
2002-09-04 18.064786
2002-09-05 19.312356
2002-09-06 18.633479
2002-09-07 17.711879
2002-09-08 19.162369
2002-09-09 19.697896
2002-09-10 18.895018
2002-09-11 18.590989
2002-09-12 17.278925
2002-09-13 17.730168
2002-09-14 19.058526
2002-09-15 18.898382
2002-09-16 17.048621
2002-09-17 16.443233
2002-09-18 16.842284
2002-09-19 14.627031
2002-09-20 15.500982
2002-09-21 14.640444
2002-09-22 13.183795
2002-09-23 13.383657
2002-09-24 13.006229
2002-09-25 12.311008
2002-09-26 11.674804
Freq: D, dtype: float64
ts. plot( )
数据读写
df = pd. DataFrame( np. random. randn( 100 , 4 ) , columns= list ( 'ABCD' ) )
df
A B C D 0 -1.052421 -0.164992 3.098604 -0.966960 1 1.194177 0.086880 0.496095 0.265308 2 0.297724 1.284297 -0.130855 -0.229570 3 -0.787063 0.553680 0.546853 -0.322599 4 0.033174 -1.222281 0.320090 -1.749333 5 0.109575 0.310684 1.620296 -0.928869 6 0.761408 -0.027630 0.458341 -0.785370 7 -1.150479 -0.718584 1.028866 0.419026 8 -2.906881 -0.295700 -0.342306 -0.765172 9 0.916363 -1.181429 -1.559657 -1.171191 10 0.578659 0.804726 1.299496 0.176843 11 0.150659 -0.162833 -1.086055 1.240432 12 -0.819219 1.668234 0.217604 -0.779170 13 -0.550658 -0.672640 -0.674157 -0.637602 14 0.901584 0.046023 0.244370 0.374293 15 0.971181 -0.442618 0.179083 0.086095 16 -0.570786 -1.019239 1.684833 0.539140 17 -1.432314 1.369588 2.091300 0.733526 18 -1.115526 -0.115884 2.636074 -0.788859 19 1.601554 1.226182 0.169308 -0.616585 20 0.571316 0.542432 0.306595 0.780939 21 -0.540414 1.036656 0.683224 -0.116963 22 1.319110 -1.265207 1.371924 0.881560 23 1.584346 -1.719633 -1.365020 -0.617224 24 -0.440420 -0.799265 0.376128 -0.654581 25 -0.261730 -0.046325 -0.289009 0.505634 26 0.385047 0.112723 0.428345 -0.008455 27 -0.921668 1.609848 1.592532 -0.623103 28 0.280799 -0.231821 -1.589829 -1.791286 29 0.661562 0.621305 0.921586 -0.312834 ... ... ... ... ... 70 0.064385 0.669585 -1.347073 0.941348 71 -1.534420 -1.227736 0.459771 -1.150254 72 0.010741 0.062820 -1.098301 1.268482 73 -1.183586 1.159889 -0.186617 -0.847210 74 -0.705815 -0.371896 0.313020 0.035314 75 -2.945315 -0.421227 -0.403479 1.387825 76 -0.122383 0.474282 -2.039155 -0.155960 77 0.921353 -0.430436 -0.599253 0.911030 78 0.018444 0.098611 0.320480 0.001282 79 -0.188301 -2.015690 -0.427172 -0.146939 80 -0.006022 0.213421 1.358382 -0.414890 81 0.596546 0.042708 1.325342 -0.800222 82 -1.736245 -0.056213 -0.415892 -0.360570 83 0.463591 -0.404202 0.577191 0.336023 84 -1.397557 0.442012 0.007915 -1.305628 85 -0.137766 -0.771713 0.200956 -0.365344 86 0.988833 -0.165965 -0.893573 -0.318324 87 1.093799 1.694406 -0.868420 0.100202 88 -0.240628 0.539268 -1.094841 1.737569 89 1.850923 -0.472270 -2.317345 -0.544395 90 0.617284 1.224130 -1.722366 0.236574 91 1.282967 0.738570 1.748848 -0.106646 92 0.775707 -0.494293 -1.098466 0.372206 93 -0.846466 0.735144 1.456520 1.622817 94 -0.860999 1.146650 -1.064013 1.400919 95 -0.095498 -1.849518 2.303532 0.688425 96 -0.017921 -0.558700 -1.061605 0.781250 97 -1.069070 1.106837 -1.936800 -0.782616 98 0.436267 0.463537 0.614982 -0.123774 99 -1.440635 -1.506836 -0.386824 1.118260
100 rows × 4 columns
df. to_csv( 'data.csv' )
% ls
% more data. csv
pd. read_csv( 'data.csv' , index_col= 0 )
电影数据分析
准备工作
从网站 grouplens.org/datasets/movielens 下载 MovieLens 1M Dataset 数据。
数据说明
参阅数据介绍文件 README.txt
利用 Pandas 分析电影评分数据
数据读取 数据合并 统计电影平均得分 统计活跃电影 -> 获得评分的次数越多说明电影越活跃 女生最喜欢的电影排行榜 男生最喜欢的电影排行榜 男女生评分差距最大的电影 -> 某类电影女生喜欢,但男生不喜欢 最具争议的电影排行榜 -> 评分的方差最大
% matplotlib inline
import pandas as pd
import numpy as np
import matplotlib. pyplot as plt
user_names = [ 'user_id' , 'gender' , 'age' , 'occupation' , 'zip' ]
users = pd. read_table( 'ml-1m/users.dat' , sep= '::' , header= None , names= user_names, engine= 'python' )
rating_names = [ 'user_id' , 'movie_id' , 'rating' , 'timestamp' ]
ratings = pd. read_table( 'ml-1m/ratings.dat' , sep= '::' , header= None , names= rating_names, engine= 'python' )
movie_names = [ 'movie_id' , 'title' , 'genres' ]
movies = pd. read_table( 'ml-1m/movies.dat' , sep= '::' , header= None , names= movie_names, engine= 'python' )
pandas内部有两种实现方式,一种是C语言,一种是python,这里指定python,因为C语言功能不如python多。
print len ( users)
users. head( 5 )
6040
user_id gender age occupation zip 0 1 F 1 10 48067 1 2 M 56 16 70072 2 3 M 25 15 55117 3 4 M 45 7 02460 4 5 M 25 20 55455
数据合并
data = pd. merge( pd. merge( users, ratings) , movies)
len ( data)
data. head( 5 )
user_id gender age occupation zip movie_id rating timestamp title genres 0 1 F 1 10 48067 1193 5 978300760 One Flew Over the Cuckoo's Nest (1975) Drama 1 2 M 56 16 70072 1193 5 978298413 One Flew Over the Cuckoo's Nest (1975) Drama 2 12 M 25 12 32793 1193 4 978220179 One Flew Over the Cuckoo's Nest (1975) Drama 3 15 M 25 7 22903 1193 4 978199279 One Flew Over the Cuckoo's Nest (1975) Drama 4 17 M 50 1 95350 1193 5 978158471 One Flew Over the Cuckoo's Nest (1975) Drama
data[ data. user_id == 1 ]
mean_ratings_gender = data. pivot_table( values= 'rating' , index= 'title' , columns= 'gender' , aggfunc= 'mean' )
mean_ratings_gender. head( 5 )
gender F M title $1,000,000 Duck (1971) 3.375000 2.761905 'Night Mother (1986) 3.388889 3.352941 'Til There Was You (1997) 2.675676 2.733333 'burbs, The (1989) 2.793478 2.962085 ...And Justice for All (1979) 3.828571 3.689024
mean_ratings_gender[ 'diff' ] = mean_ratings_gender. F - mean_ratings_gender. M
mean_ratings_gender. head( 5 )
gender F M diff title $1,000,000 Duck (1971) 3.375000 2.761905 0.613095 'Night Mother (1986) 3.388889 3.352941 0.035948 'Til There Was You (1997) 2.675676 2.733333 -0.057658 'burbs, The (1989) 2.793478 2.962085 -0.168607 ...And Justice for All (1979) 3.828571 3.689024 0.139547
mean_ratings_gender. sort_values( by= 'diff' , ascending= True ) . head( 10 )
gender F M diff title Tigrero: A Film That Was Never Made (1994) 1 4.333333 -3.333333 Neon Bible, The (1995) 1 4.000000 -3.000000 Enfer, L' (1994) 1 3.750000 -2.750000 Stalingrad (1993) 1 3.593750 -2.593750 Killer: A Journal of Murder (1995) 1 3.428571 -2.428571 Dangerous Ground (1997) 1 3.333333 -2.333333 In God's Hands (1998) 1 3.333333 -2.333333 Rosie (1998) 1 3.333333 -2.333333 Flying Saucer, The (1950) 1 3.300000 -2.300000 Jamaica Inn (1939) 1 3.142857 -2.142857
ratings_by_movie_title = data. groupby( 'title' ) . size( )
ratings_by_movie_title. head( 5 )
title
$1,000,000 Duck (1971) 37
'Night Mother (1986) 70
'Til There Was You (1997) 52
'burbs, The (1989) 303
...And Justice for All (1979) 199
dtype: int64
mean_ratings = data. pivot_table( values= 'rating' , index= 'title' , aggfunc= 'mean' )
top_20_mean_ratings = mean_ratings. sort_values( ascending= False ) . head( 20 )
top_20_mean_ratings
title
Gate of Heavenly Peace, The (1995) 5.000000
Lured (1947) 5.000000
Ulysses (Ulisse) (1954) 5.000000
Smashing Time (1967) 5.000000
Follow the Bitch (1998) 5.000000
Song of Freedom (1936) 5.000000
Bittersweet Motel (2000) 5.000000
Baby, The (1973) 5.000000
One Little Indian (1973) 5.000000
Schlafes Bruder (Brother of Sleep) (1995) 5.000000
I Am Cuba (Soy Cuba/Ya Kuba) (1964) 4.800000
Lamerica (1994) 4.750000
Apple, The (Sib) (1998) 4.666667
Sanjuro (1962) 4.608696
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) 4.560510
Shawshank Redemption, The (1994) 4.554558
Godfather, The (1972) 4.524966
Close Shave, A (1995) 4.520548
Usual Suspects, The (1995) 4.517106
Schindler's List (1993) 4.510417
Name: rating, dtype: float64
mean_ratings[ top_10_ratings. index]
title
American Beauty (1999) 4.317386
Star Wars: Episode IV - A New Hope (1977) 4.453694
Star Wars: Episode V - The Empire Strikes Back (1980) 4.292977
Star Wars: Episode VI - Return of the Jedi (1983) 4.022893
Jurassic Park (1993) 3.763847
Saving Private Ryan (1998) 4.337354
Terminator 2: Judgment Day (1991) 4.058513
Matrix, The (1999) 4.315830
Back to the Future (1985) 3.990321
Silence of the Lambs, The (1991) 4.351823
Name: rating, dtype: float64
ratings_by_movie_title[ top_20_mean_ratings. index]
title
Gate of Heavenly Peace, The (1995) 3
Lured (1947) 1
Ulysses (Ulisse) (1954) 1
Smashing Time (1967) 2
Follow the Bitch (1998) 1
Song of Freedom (1936) 1
Bittersweet Motel (2000) 1
Baby, The (1973) 1
One Little Indian (1973) 1
Schlafes Bruder (Brother of Sleep) (1995) 1
I Am Cuba (Soy Cuba/Ya Kuba) (1964) 5
Lamerica (1994) 8
Apple, The (Sib) (1998) 9
Sanjuro (1962) 69
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) 628
Shawshank Redemption, The (1994) 2227
Godfather, The (1972) 2223
Close Shave, A (1995) 657
Usual Suspects, The (1995) 1783
Schindler's List (1993) 2304
dtype: int64
top_10_movies = mean_ratings[ top_ratings. index] . sort_values( ascending= False ) . head( 10 )
top_10_movies
title
Shawshank Redemption, The (1994) 4.554558
Godfather, The (1972) 4.524966
Usual Suspects, The (1995) 4.517106
Schindler's List (1993) 4.510417
Raiders of the Lost Ark (1981) 4.477725
Rear Window (1954) 4.476190
Star Wars: Episode IV - A New Hope (1977) 4.453694
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963) 4.449890
Casablanca (1942) 4.412822
Sixth Sense, The (1999) 4.406263
Name: rating, dtype: float64
pandas核心数据结构
import pandas as pd
import numpy as np
Series
Series 是一维带标签的数组 ,数组里可以放任意的数据(整数,浮点数,字符串,Python Object)。其基本的创建函数是:
s = pd. Series( data, index= index)
其中 index 是一个列表,用来作为数据的标签。data 可以是不同的数据类型:
Python 字典 ndarray 对象 一个标量值,如 5
从 ndaray 创建
s = pd. Series( np. random. randn( 5 ) , index= [ 'a' , 'b' , 'c' , 'd' , 'e' ] )
s
a 0.747292
b -1.120276
c -0.132692
d -0.267813
e -0.590904
dtype: float64
s. index
Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
s = pd. Series( np. random. randn( 5 ) )
s
0 0.324214
1 -0.183776
2 -0.518808
3 0.866421
4 -0.601668
dtype: float64
s. index
Int64Index([0, 1, 2, 3, 4], dtype='int64')
从字典创建
d = { 'a' : 0 . , 'b' : 1 . , 'd' : 3 }
s = pd. Series( d, index= list ( 'abcd' ) )
s
a 0
b 1
c NaN
d 3
dtype: float64
从标量创建
pd. Series( 3 , index= list ( 'abcde' ) )
a 3
b 3
c 3
d 3
e 3
dtype: int64
Series特性
Series 是类 ndarray 对象
熟悉 numpy 的同学对下面的操作应该不会陌生。我们在 numpy 简介里也介绍过下面的索引方式。
s = pd. Series( np. random. randn( 5 ) )
s
0 0.882069
1 -0.134360
2 -0.925088
3 0.191072
4 2.546704
dtype: float64
s[ 0 ]
0.88206876023157332
s[ : 3 ]
0 0.882069
1 -0.134360
2 -0.925088
dtype: float64
s[ [ 1 , 3 , 4 ] ]
1 -0.134360
3 0.191072
4 2.546704
dtype: float64
np. exp( s)
0 2.415892
1 0.874275
2 0.396497
3 1.210546
4 12.764963
dtype: float64
np. sin( s)
0 0.772055
1 -0.133957
2 -0.798673
3 0.189911
4 0.560416
dtype: float64
Series 是类字典对象
s = pd. Series( np. random. randn( 5 ) , index= [ 'a' , 'b' , 'c' , 'd' , 'e' ] )
s
a -2.149840
b -0.924115
c 0.481231
d 1.033813
e -0.462794
dtype: float64
s[ 'a' ]
-2.1498403551053218
s[ 'e' ] = 5
s
a -2.149840
b -0.924115
c 0.481231
d 1.033813
e 5.000000
dtype: float64
s[ 'g' ] = 100
s
a -2.149840
b -0.924115
c 0.481231
d 1.033813
e 5.000000
g 100.000000
dtype: float64
'e' in s
True
'f' in s
False
print s. get( 'f' )
None
print s. get( 'f' , np. nan)
nan
标签对齐操作
s1 = pd. Series( np. random. randn( 3 ) , index= [ 'a' , 'c' , 'e' ] )
s2 = pd. Series( np. random. randn( 3 ) , index= [ 'a' , 'd' , 'e' ] )
print '{0}\n\n{1}' . format ( s1, s2)
a -0.917905
c -0.744616
e 0.114522
dtype: float64
a 0.721087
d -0.471575
e 0.796093
dtype: float64
s1 + s2
name 属性
s = pd. Series( np. random. randn( 5 ) , name= 'Some Thing' )
s
0 0.623787
1 0.517239
2 1.551314
3 1.414463
4 -1.224611
Name: Some Thing, dtype: float64
s. name
'Some Thing'
a -0.196818
c NaN
d NaN
e 0.910615
dtype: float64
DataFrame
DataFrame 是二维带行标签和列标签的数组 。可以把 DataFrame 想你成一个 Excel 表格或一个 SQL 数据库的表格,还可以相像成是一个 Series 对象字典。它是 Pandas 里最常用的数据结构。
创建 DataFrame 的基本格式是:
df = pd. DataFrame( data, index= index, columns= columns)
其中 index 是行标签,columns 是列标签,data 可以是下面的数据:
由一维 numpy 数组,list,Series 构成的字典 二维 numpy 数组 一个 Series 另外的 DataFrame 对象
d = { 'one' : pd. Series( [ 1 , 2 , 3 ] , index= [ 'a' , 'b' , 'c' ] ) ,
'two' : pd. Series( [ 1 , 2 , 3 , 4 ] , index= [ 'a' , 'b' , 'c' , 'd' ] ) }
pd. DataFrame( d)
pd. DataFrame( d, index= [ 'd' , 'b' , 'a' ] )
pd. DataFrame( d, index= [ 'd' , 'b' , 'a' ] , columns= [ 'two' , 'three' ] )
如果是列表,必须元素个数一样,否则会报错
d = { 'one' : [ 1 , 2 , 3 , 4 ] ,
'two' : [ 21 , 22 , 23 , 24 ] }
pd. DataFrame( d)
pd. DataFrame( d, index= [ 'a' , 'b' , 'c' , 'd' ] )
从结构化数据中创建
data = [ ( 1 , 2.2 , 'Hello' ) , ( 2 , 3 . , "World" ) ]
pd. DataFrame( data)
pd. DataFrame( data, index= [ 'first' , 'second' ] , columns= [ 'A' , 'B' , 'C' ] )
A B C first 1 2.2 Hello second 2 3.0 World
从字典列表创建
data = [ { 'a' : 1 , 'b' : 2 } , { 'a' : 5 , 'b' : 10 , 'c' : 20 } ]
pd. DataFrame( data)
```python pd.DataFrame(data, index=['first', 'second']) ```
pd. DataFrame( data, columns= [ 'a' , 'b' ] )
从元组字典创建
了解其创建的原理,实际应用中,会通过数据清洗的方式,把数据整理成方便 Pandas 导入且可读性好的格式。最后再通过 reindex/groupby 等方式转换成复杂数据结构。
d = { ( 'a' , 'b' ) : { ( 'A' , 'B' ) : 1 , ( 'A' , 'C' ) : 2 } ,
( 'a' , 'a' ) : { ( 'A' , 'C' ) : 3 , ( 'A' , 'B' ) : 4 } ,
( 'a' , 'c' ) : { ( 'A' , 'B' ) : 5 , ( 'A' , 'C' ) : 6 } ,
( 'b' , 'a' ) : { ( 'A' , 'C' ) : 7 , ( 'A' , 'B' ) : 8 } ,
( 'b' , 'b' ) : { ( 'A' , 'D' ) : 9 , ( 'A' , 'B' ) : 10 } }
pd. DataFrame( d)
a b a b c a b A B 4 1 5 8 10 C 3 2 6 7 NaN D NaN NaN NaN NaN 9
从 Series 创建
s = pd. Series( np. random. randn( 5 ) , index= [ 'a' , 'b' , 'c' , 'd' , 'e' ] )
pd. DataFrame( s)
0 a -0.789343 b 0.127384 c 1.084005 d -0.755011 e -0.963299
pd. DataFrame( s, index= [ 'a' , 'c' , 'd' ] )
0 a -0.789343 c 1.084005 d -0.755011
如果这里指定多列,就会报错,不同于字典作为参数进行创建
pd. DataFrame( s, index= [ 'a' , 'c' , 'd' ] , columns= [ 'A' ] )
A a -0.789343 c 1.084005 d -0.755011
列选择/增加/删除
df = pd. DataFrame( np. random. randn( 6 , 4 ) , columns= [ 'one' , 'two' , 'three' , 'four' ] )
df
one two three four 0 2.045300 -0.981722 -0.656081 -0.639517 1 -0.550780 0.248781 -0.146424 0.217392 2 1.702775 0.103998 -0.662138 -0.534071 3 -2.035681 0.015025 1.368209 0.178378 4 -1.092208 0.091108 -0.892496 -0.611198 5 0.093502 0.267428 1.189654 -0.258723
df[ 'one' ]
0 2.045300
1 -0.550780
2 1.702775
3 -2.035681
4 -1.092208
5 0.093502
Name: one, dtype: float64
df[ 'three' ] = df[ 'one' ] + df[ 'two' ]
df
one two three four 0 2.045300 -0.981722 1.063578 -0.639517 1 -0.550780 0.248781 -0.301999 0.217392 2 1.702775 0.103998 1.806773 -0.534071 3 -2.035681 0.015025 -2.020656 0.178378 4 -1.092208 0.091108 -1.001100 -0.611198 5 0.093502 0.267428 0.360931 -0.258723
df[ 'flag' ] = df[ 'one' ] > 0
df
one two three four flag 0 2.045300 -0.981722 1.063578 -0.639517 True 1 -0.550780 0.248781 -0.301999 0.217392 False 2 1.702775 0.103998 1.806773 -0.534071 True 3 -2.035681 0.015025 -2.020656 0.178378 False 4 -1.092208 0.091108 -1.001100 -0.611198 False 5 0.093502 0.267428 0.360931 -0.258723 True
del df[ 'three' ]
df
one two four flag 0 2.045300 -0.981722 -0.639517 True 1 -0.550780 0.248781 0.217392 False 2 1.702775 0.103998 -0.534071 True 3 -2.035681 0.015025 0.178378 False 4 -1.092208 0.091108 -0.611198 False 5 0.093502 0.267428 -0.258723 True
four = df. pop( 'four' )
four
0 -0.639517
1 0.217392
2 -0.534071
3 0.178378
4 -0.611198
5 -0.258723
Name: four, dtype: float64
df
one two flag 0 2.045300 -0.981722 True 1 -0.550780 0.248781 False 2 1.702775 0.103998 True 3 -2.035681 0.015025 False 4 -1.092208 0.091108 False 5 0.093502 0.267428 True
df[ 'five' ] = 5
df
one two flag five 0 2.045300 -0.981722 True 5 1 -0.550780 0.248781 False 5 2 1.702775 0.103998 True 5 3 -2.035681 0.015025 False 5 4 -1.092208 0.091108 False 5 5 0.093502 0.267428 True 5
df[ 'one_trunc' ] = df[ 'one' ] [ : 2 ]
df
one two flag five one_trunc 0 2.045300 -0.981722 True 5 2.04530 1 -0.550780 0.248781 False 5 -0.55078 2 1.702775 0.103998 True 5 NaN 3 -2.035681 0.015025 False 5 NaN 4 -1.092208 0.091108 False 5 NaN 5 0.093502 0.267428 True 5 NaN
df. insert( 1 , 'bar' , df[ 'one' ] )
df
one bar two flag five one_trunc 0 2.045300 2.045300 -0.981722 True 5 2.04530 1 -0.550780 -0.550780 0.248781 False 5 -0.55078 2 1.702775 1.702775 0.103998 True 5 NaN 3 -2.035681 -2.035681 0.015025 False 5 NaN 4 -1.092208 -1.092208 0.091108 False 5 NaN 5 0.093502 0.093502 0.267428 True 5 NaN
使用 assign() 方法来插入新列
更方便地使用 methd chains 的方法来实现
df = pd. DataFrame( np. random. randint( 1 , 5 , ( 6 , 4 ) ) , columns= list ( 'ABCD' ) )
df
A B C D 0 4 3 3 4 1 1 4 4 2 2 1 4 4 3 3 2 4 4 3 4 1 2 4 2 5 3 4 1 4
df. assign( Ratio = df[ 'A' ] / df[ 'B' ] )
A B C D Ratio 0 4 3 3 4 1.333333 1 1 4 4 2 0.250000 2 1 4 4 3 0.250000 3 2 4 4 3 0.500000 4 1 2 4 2 0.500000 5 3 4 1 4 0.750000
insert是直接改变原DataFrame,assign是生成一个新的,assign的另一个特性是可以传入函数进行计算。
df. assign( AB_Ratio = lambda x: x. A / x. B, CD_Ratio = lambda x: x. C - x. D)
A B C D AB_Ratio CD_Ratio 0 4 3 3 4 1.333333 -1 1 1 4 4 2 0.250000 2 2 1 4 4 3 0.250000 1 3 2 4 4 3 0.500000 1 4 1 2 4 2 0.500000 2 5 3 4 1 4 0.750000 -3
df. assign( AB_Ratio = lambda x: x. A / x. B) . assign( ABD_Ratio = lambda x: x. AB_Ratio * x. D)
A B C D AB_Ratio ABD_Ratio 0 4 3 3 4 1.333333 5.333333 1 1 4 4 2 0.250000 0.500000 2 1 4 4 3 0.250000 0.750000 3 2 4 4 3 0.500000 1.500000 4 1 2 4 2 0.500000 1.000000 5 3 4 1 4 0.750000 3.000000
索引和选择
对应的操作,语法和返回结果
选择一列 -> df[col] -> Series 根据行标签选择一行 -> df.loc[label] -> Series 根据行位置选择一行 -> df.iloc[label] -> Series 选择多行 -> df[5:10] -> DataFrame 根据布尔向量选择多行 -> df[bool_vector] -> DataFrame
df = pd. DataFrame( np. random. randint( 1 , 10 , ( 6 , 4 ) ) , index= list ( 'abcdef' ) , columns= list ( 'ABCD' ) )
df
A B C D a 2 2 6 6 b 8 3 5 7 c 4 6 8 3 d 7 8 3 9 e 8 4 4 2 f 4 2 4 3
df[ 'A' ]
a 2
b 8
c 4
d 7
e 8
f 4
Name: A, dtype: int32
df. loc[ 'a' ]
A 2
B 2
C 6
D 6
Name: a, dtype: int32
df. iloc[ 0 ]
A 2
B 2
C 6
D 6
Name: a, dtype: int32
df[ 1 : 4 ]
使用iloc比这样直接访问效率要高df.iloc[1:4]
df[ [ False , True , True , False , True , False ] ]
数据对齐
DataFrame 在进行数据计算时,会自动按行和列进行数据对齐。最终的计算结果会合并两个 DataFrame。
df1 = pd. DataFrame( np. random. randn( 10 , 4 ) , index= list ( 'abcdefghij' ) , columns= [ 'A' , 'B' , 'C' , 'D' ] )
df1
A B C D a 0.576428 -0.037913 -0.329787 -1.752916 b 0.406743 -1.044561 -0.724447 0.374599 c 0.073578 0.423914 -1.499770 -0.488374 d -0.377609 1.137422 -1.951169 -0.814306 e -2.171648 -2.364502 -0.833594 0.168636 f -1.134800 -0.927469 0.886889 0.542603 g 0.625104 0.115953 -1.282609 1.031292 h 0.403509 0.263207 0.403614 -0.177888 i 0.148494 -2.034253 0.134859 -0.960650 j 0.094200 -1.803288 0.057472 -0.338958
df2 = pd. DataFrame( np. random. randn( 7 , 3 ) , index= list ( 'cdefghi' ) , columns= [ 'A' , 'B' , 'C' ] )
df2
A B C c 0.884518 0.337344 -1.072027 d 0.264036 -0.152542 -0.225544 e 1.048813 -1.496442 1.022348 f 0.895314 -0.890236 1.230465 g -0.588162 -0.492354 -0.739563 h -2.580322 1.104810 -0.167137 i -0.842738 0.171735 0.847714
df1 + df2
A B C D a NaN NaN NaN NaN b NaN NaN NaN NaN c 0.958096 0.761259 -2.571797 NaN d -0.113573 0.984880 -2.176713 NaN e -1.122834 -3.860944 0.188754 NaN f -0.239486 -1.817705 2.117354 NaN g 0.036942 -0.376401 -2.022171 NaN h -2.176813 1.368016 0.236476 NaN i -0.694245 -1.862517 0.982573 NaN j NaN NaN NaN NaN
df1 - df1. iloc[ 0 ]
A B C D a 0.000000 0.000000 0.000000 0.000000 b -0.169685 -1.006648 -0.394660 2.127515 c -0.502850 0.461827 -1.169983 1.264541 d -0.954037 1.175335 -1.621382 0.938610 e -2.748076 -2.326589 -0.503807 1.921551 f -1.711228 -0.889556 1.216676 2.295518 g 0.048676 0.153866 -0.952822 2.784208 h -0.172919 0.301119 0.733400 1.575028 i -0.427934 -1.996340 0.464646 0.792265 j -0.482228 -1.765375 0.387259 1.413957
这里两个运算对象不是同类型,用到广播原理,生成新的DataFrame,是原来DataFrame的每行减去第0行。
使用 numpy 函数
Pandas 与 numpy 在核心数据结构上是完全兼容的
df = pd. DataFrame( np. random. randn( 10 , 4 ) , columns= [ 'one' , 'two' , 'three' , 'four' ] )
df
one two three four 0 -1.121818 1.233686 0.681618 -0.502204 1 1.469664 -0.060555 -0.044857 0.725021 2 1.219670 0.108709 1.806063 0.332685 3 -0.190615 1.244102 -0.863850 1.795335 4 -0.133109 -0.101591 0.818724 1.246230 5 0.729804 0.716593 2.472841 -0.078224 6 0.010136 1.725441 -1.071194 1.602945 7 1.002507 -1.122593 -0.147411 -1.678843 8 -0.550077 0.230777 -0.658470 -1.680395 9 1.006271 0.455683 -2.279833 -0.823792
np. exp( df)
one two three four 0 0.325687 3.433864 1.977073 0.605196 1 4.347774 0.941242 0.956134 2.064774 2 3.386069 1.114838 6.086440 1.394708 3 0.826450 3.469817 0.421536 6.021490 4 0.875369 0.903399 2.267604 3.477210 5 2.074675 2.047446 11.856082 0.924757 6 1.010187 5.614995 0.342599 4.967641 7 2.725105 0.325435 0.862939 0.186590 8 0.576905 1.259578 0.517643 0.186300 9 2.735382 1.577250 0.102301 0.438765
np. asarray( df) == df. values
array([[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True]], dtype=bool)
type ( np. asarray( df) )
numpy.ndarray
np. asarray( df) == df
one two three four 0 True True True True 1 True True True True 2 True True True True 3 True True True True 4 True True True True 5 True True True True 6 True True True True 7 True True True True 8 True True True True 9 True True True True
Panel
Panel 是三维带标签的数组。实际上,Pandas 的名称由来就是由 Panel 演进的,即 pan(el)-da(ta)-s。Panel 比较少用,但依然是最重要的基础数据结构之一。
items: 坐标轴 0,索引对应的元素是一个 DataFrame major_axis: 坐标轴 1, DataFrame 里的行标签 minor_axis: 坐标轴 2, DataFrame 里的列标签
data = { 'Item1' : pd. DataFrame( np. random. randn( 4 , 3 ) ) ,
'Item2' : pd. DataFrame( np. random. randn( 4 , 2 ) ) }
pn = pd. Panel( data)
pn
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2
pn[ 'Item1' ]
0 1 2 0 0.638298 -1.600822 3.112210 1 0.394099 0.184129 0.438450 2 0.427692 -0.294556 0.039430 3 1.555046 0.933749 0.218616
pn. items
Index([u'Item1', u'Item2'], dtype='object')
pn. major_axis
Int64Index([0, 1, 2, 3], dtype='int64')
pn. minor_axis
Int64Index([0, 1, 2], dtype='int64')
pn. major_xs( pn. major_axis[ 0 ] )
Item1 Item2 0 0.638298 -1.427579 1 -1.600822 -0.778090 2 3.112210 NaN
pn. minor_xs( pn. major_axis[ 1 ] )
Item1 Item2 0 -1.600822 -0.778090 1 0.184129 0.698347 2 -0.294556 -0.167423 3 0.933749 0.205092
pn. to_frame( )
Item1 Item2 major minor 0 0 0.638298 -1.427579 1 -1.600822 -0.778090 1 0 0.394099 -0.999929 1 0.184129 0.698347 2 0 0.427692 0.559905 1 -0.294556 -0.167423 3 0 1.555046 -1.992102 1 0.933749 0.205092
自动去除了NaN一列
基础运算
import pandas as pd
import numpy as np
重新索引
Series
s = pd. Series( [ 1 , 3 , 5 , 6 , 8 ] , index= list ( 'acefh' ) )
s
a 1
c 3
e 5
f 6
h 8
dtype: int64
s. reindex( list ( 'abcdefgh' ) )
a 1
b NaN
c 3
d NaN
e 5
f 6
g NaN
h 8
dtype: float64
s. reindex( list ( 'abcdefgh' ) , fill_value= 0 )
a 1
b 0
c 3
d 0
e 5
f 6
g 0
h 8
dtype: int64
s. reindex( list ( 'abcdefgh' ) , method= 'ffill' )
a 1
b 1
c 3
d 3
e 5
f 6
g 6
h 8
dtype: int64
DataFrame
df = pd. DataFrame( np. random. randn( 4 , 6 ) , index= list ( 'ADFH' ) , columns= [ 'one' , 'two' , 'three' , 'four' , 'five' , 'six' ] )
df
one two three four five six A -0.049437 -0.526499 1.780662 1.154747 2.434957 -1.579278 D -0.075226 0.552163 -0.462732 -0.936051 -0.590041 0.484505 F 1.486168 0.725907 0.598127 -0.704809 -2.815687 -0.062462 H -0.900819 -0.177751 -0.232796 0.234088 -1.758574 1.255955
df2 = df. reindex( index= list ( 'ABCDEFGH' ) )
df2
默认也可以不写index,但columns得写
one two three four five six A -0.049437 -0.526499 1.780662 1.154747 2.434957 -1.579278 B NaN NaN NaN NaN NaN NaN C NaN NaN NaN NaN NaN NaN D -0.075226 0.552163 -0.462732 -0.936051 -0.590041 0.484505 E NaN NaN NaN NaN NaN NaN F 1.486168 0.725907 0.598127 -0.704809 -2.815687 -0.062462 G NaN NaN NaN NaN NaN NaN H -0.900819 -0.177751 -0.232796 0.234088 -1.758574 1.255955
reindex是拷贝一份数据出来
df. reindex( columns= [ 'one' , 'three' , 'five' , 'seven' ] )
one three five seven A 100.000000 1.780662 2.434957 NaN D -0.075226 -0.462732 -0.590041 NaN F 1.486168 0.598127 -2.815687 NaN H -0.900819 -0.232796 -1.758574 NaN
df. reindex( columns= [ 'one' , 'three' , 'five' , 'seven' ] , fill_value= 0 )
one three five seven A 100.000000 1.780662 2.434957 0 D -0.075226 -0.462732 -0.590041 0 F 1.486168 0.598127 -2.815687 0 H -0.900819 -0.232796 -1.758574 0
df. reindex( columns= [ 'one' , 'three' , 'five' , 'seven' ] , method= 'ffill' )
one three five seven A 100.000000 1.780662 2.434957 NaN D -0.075226 -0.462732 -0.590041 NaN F 1.486168 0.598127 -2.815687 NaN H -0.900819 -0.232796 -1.758574 NaN
df. reindex( index= list ( 'ABCDEFGH' ) , method= 'ffill' )
one two three four five six A 100.000000 -0.526499 1.780662 1.154747 2.434957 -1.579278 B 100.000000 -0.526499 1.780662 1.154747 2.434957 -1.579278 C 100.000000 -0.526499 1.780662 1.154747 2.434957 -1.579278 D -0.075226 0.552163 -0.462732 -0.936051 -0.590041 0.484505 E -0.075226 0.552163 -0.462732 -0.936051 -0.590041 0.484505 F 1.486168 0.725907 0.598127 -0.704809 -2.815687 -0.062462 G 1.486168 0.725907 0.598127 -0.704809 -2.815687 -0.062462 H -0.900819 -0.177751 -0.232796 0.234088 -1.758574 1.255955
丢弃部分数据
df = pd. DataFrame( np. random. randn( 4 , 6 ) , index= list ( 'ABCD' ) , columns= [ 'one' , 'two' , 'three' , 'four' , 'five' , 'six' ] )
df
one two three four five six A -0.665415 -0.061367 0.075058 0.626415 -1.748458 -0.608540 B -1.455186 1.846691 0.234276 0.660298 -2.169835 -1.476485 C 0.322281 0.505378 0.198458 -0.831919 -0.630789 0.762524 D 0.703684 -0.827597 0.178063 0.108453 -0.418992 0.242912
df. drop( 'A' )
one two three four five six B -1.455186 1.846691 0.234276 0.660298 -2.169835 -1.476485 C 0.322281 0.505378 0.198458 -0.831919 -0.630789 0.762524 D 0.703684 -0.827597 0.178063 0.108453 -0.418992 0.242912
df2 = df. drop( [ 'two' , 'four' ] , axis= 1 )
df2
one three five six A -0.665415 0.075058 -1.748458 -0.608540 B -1.455186 0.234276 -2.169835 -1.476485 C 0.322281 0.198458 -0.630789 0.762524 D 0.703684 0.178063 -0.418992 0.242912
drop是是拷贝一份数据出来的
广播运算
df = pd. DataFrame( np. arange( 12 ) . reshape( 4 , 3 ) , index= [ 'one' , 'two' , 'three' , 'four' ] , columns= list ( 'ABC' ) )
df
A B C one 0 1 2 two 3 4 5 three 6 7 8 four 9 10 11
df. loc[ 'one' ]
A 0
B 1
C 2
Name: one, dtype: int64
df - df. loc[ 'one' ]
A B C one 0 0 0 two 3 3 3 three 6 6 6 four 9 9 9
函数应用
apply: 将数据按行或列进行计算 applymap: 将数据按元素为进行计算
df = pd. DataFrame( np. arange( 12 ) . reshape( 4 , 3 ) , index= [ 'one' , 'two' , 'three' , 'four' ] , columns= list ( 'ABC' ) )
df
A B C one 0 1 2 two 3 4 5 three 6 7 8 four 9 10 11
df. apply ( lambda x: x. max ( ) - x. min ( ) )
A 9
B 9
C 9
dtype: int64
df. apply ( lambda x: x. max ( ) - x. min ( ) , axis= 1 )
one 2
two 2
three 2
four 2
dtype: int64
def min_max ( x) :
return pd. Series( [ x. min ( ) , x. max ( ) ] , index= [ 'min' , 'max' ] )
df. apply ( min_max, axis= 1 )
min max one 0 2 two 3 5 three 6 8 four 9 11
df = pd. DataFrame( np. random. randn( 4 , 3 ) , index= [ 'one' , 'two' , 'three' , 'four' ] , columns= list ( 'ABC' ) )
df
A B C one -1.126089 -0.286584 1.538841 two 1.804348 -0.709293 -0.400643 three -1.008037 -0.791648 0.388505 four -0.071827 0.659098 -0.505030
formater = '{0:.02f}' . format
df. applymap( formater)
0表示第0个参数,formater的写法也是一种函数
A B C one -1.13 -0.29 1.54 two 1.80 -0.71 -0.40 three -1.01 -0.79 0.39 four -0.07 0.66 -0.51
排序和排名
df = pd. DataFrame( np. random. randint( 1 , 10 , ( 4 , 3 ) ) , index= list ( 'ABCD' ) , columns= [ 'one' , 'two' , 'three' ] )
df
one two three A 7 8 5 B 2 6 4 C 4 5 4 D 6 2 1
df. sort_values( by= 'one' )
one two three B 2 6 4 C 4 5 4 D 6 2 1 A 7 8 5
s = pd. Series( [ 3 , 6 , 2 , 6 , 4 ] )
s. rank( )
0 2.0
1 4.5
2 1.0
3 4.5
4 3.0
dtype: float64
s. rank( method= 'first' , ascending= False )
method='first’表示并列时,先出现排前面 默认是average,还有一种是last
0 4
1 1
2 5
3 2
4 3
dtype: float64
DataFrame默认是按照每列进行排名,与sort_values不同
数据唯一性及成员资格
适用于 Series
s = pd. Series( list ( 'abbcdabacad' ) )
s
0 a
1 b
2 b
3 c
4 d
5 a
6 b
7 a
8 c
9 a
10 d
dtype: object
s. unique( )
array(['a', 'b', 'c', 'd'], dtype=object)
s. value_counts( )
a 4
b 3
d 2
c 2
dtype: int64
s. isin( [ 'a' , 'b' , 'c' ] )
0 True
1 True
2 True
3 True
4 False
5 True
6 True
7 True
8 True
9 True
10 False
dtype: bool