Python点滴(四)—pandas快速入门使用

    本篇博文主要介绍数据分析包pandas的使用,主要参考资料来自pandas官网,掌握以下内容可以帮助data scientist快速理解pandas日常的数据分析操作,读者可以自己定义一些数据,跟着练习一下,pandas确实挺强大的,比自己一点点的去写numpy要省事许多,可以为大家省下不少时间精力将工作重点放在算法或者业务的深入理解方面,内容so young so naive,但是仔细读来也可以have some fun!



In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [5]:
s=pd.Series([1,3,5,np.nan,6,8])   #产生一个序列
In [6]:
s
Out[6]:
0     1
1     3
2     5
3   NaN
4     6
5     8
dtype: float64
In [7]:
dates=pd.date_range('20160501',periods=6)
In [8]:
dates
Out[8]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2016-05-01, ..., 2016-05-06]
Length: 6, Freq: D, Timezone: None
In [10]:
df=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
In [11]:
df
Out[11]:
  A B C D
2016-05-01 0.695733 0.070252 0.131565 -0.897799
2016-05-02 -0.279032 1.430353 0.246015 1.247282
2016-05-03 -0.727831 -0.422250 0.435165 0.089780
2016-05-04 -0.899446 -0.372851 0.811766 -1.071206
2016-05-05 -1.250048 -0.019416 1.248958 -0.900567
2016-05-06 0.605871 -1.975671 0.202865 -1.491969
In [12]:
df2=pd.DataFrame({'A':1,
                  'B':pd.Timestamp('20160501'),
                  'C':pd.Series(1,index=list(range(4)),dtype='float32'),
                  'D':np.array([3]*4,dtype='int32'),
                  'E':pd.Categorical(["test","train","test","train"]),
                  'F':'foo'})
In [13]:
df2
Out[13]:
  A B C D E F
0 1 2016-05-01 1 3 test foo
1 1 2016-05-01 1 3 train foo
2 1 2016-05-01 1 3 test foo
3 1 2016-05-01 1 3 train foo
In [14]:
df2.dtypes
Out[14]:
A             int64
B    datetime64[ns]
C           float32
D             int32
E            object
F            object
dtype: object
In [16]:
df.head()
Out[16]:
  A B C D
2016-05-01 0.695733 0.070252 0.131565 -0.897799
2016-05-02 -0.279032 1.430353 0.246015 1.247282
2016-05-03 -0.727831 -0.422250 0.435165 0.089780
2016-05-04 -0.899446 -0.372851 0.811766 -1.071206
2016-05-05 -1.250048 -0.019416 1.248958 -0.900567
In [17]:
df.tail(3)
Out[17]:
  A B C D
2016-05-04 -0.899446 -0.372851 0.811766 -1.071206
2016-05-05 -1.250048 -0.019416 1.248958 -0.900567
2016-05-06 0.605871 -1.975671 0.202865 -1.491969
In [18]:
df.index
Out[18]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2016-05-01, ..., 2016-05-06]
Length: 6, Freq: D, Timezone: None
In [21]:
df.columns
Out[21]:
Index([u'A', u'B', u'C', u'D'], dtype='object')
In [22]:
df.values
Out[22]:
array([[ 0.69573306,  0.07025152,  0.13156519, -0.89779855],
       [-0.27903191,  1.43035294,  0.24601539,  1.24728205],
       [-0.72783059, -0.4222495 ,  0.43516501,  0.08977958],
       [-0.89944607, -0.37285103,  0.81176596, -1.07120601],
       [-1.25004794, -0.01941648,  1.2489576 , -0.90056661],
       [ 0.60587066, -1.97567133,  0.20286487, -1.49196862]])
In [23]:
df.describe()
Out[23]:
  A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.309125 -0.214931 0.512722 -0.504080
std 0.807093 1.094956 0.435869 1.002967
min -1.250048 -1.975671 0.131565 -1.491969
25% -0.856542 -0.409900 0.213652 -1.028546
50% -0.503431 -0.196134 0.340590 -0.899183
75% 0.384645 0.047835 0.717616 -0.157115
max 0.695733 1.430353 1.248958 1.247282
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dates=pd.date_range('20160501',periods=6)
df=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
In [3]:
df
Out[3]:
  A B C D
2016-05-01 0.732630 2.323166 -0.196964 -1.026370
2016-05-02 1.253825 -2.343766 1.069097 0.228692
2016-05-03 -0.212081 0.088675 -0.942775 0.033656
2016-05-04 -0.044256 -1.017814 -1.263595 -0.696767
2016-05-05 -0.948611 1.062373 2.667495 0.350915
2016-05-06 -0.621065 0.081559 0.721918 0.084405
In [4]:
df.T
Out[4]:
  2016-05-01 00:00:00 2016-05-02 00:00:00 2016-05-03 00:00:00 2016-05-04 00:00:00 2016-05-05 00:00:00 2016-05-06 00:00:00
A 0.732630 1.253825 -0.212081 -0.044256 -0.948611 -0.621065
B 2.323166 -2.343766 0.088675 -1.017814 1.062373 0.081559
C -0.196964 1.069097 -0.942775 -1.263595 2.667495 0.721918
D -1.026370 0.228692 0.033656 -0.696767 0.350915 0.084405
In [5]:
df
Out[5]:
  A B C D
2016-05-01 0.732630 2.323166 -0.196964 -1.026370
2016-05-02 1.253825 -2.343766 1.069097 0.228692
2016-05-03 -0.212081 0.088675 -0.942775 0.033656
2016-05-04 -0.044256 -1.017814 -1.263595 -0.696767
2016-05-05 -0.948611 1.062373 2.667495 0.350915
2016-05-06 -0.621065 0.081559 0.721918 0.084405
In [6]:
df.sort_index(axis=1,ascending=False)   #以索引降序排列 横向
Out[6]:
  D C B A
2016-05-01 -1.026370 -0.196964 2.323166 0.732630
2016-05-02 0.228692 1.069097 -2.343766 1.253825
2016-05-03 0.033656 -0.942775 0.088675 -0.212081
2016-05-04 -0.696767 -1.263595 -1.017814 -0.044256
2016-05-05 0.350915 2.667495 1.062373 -0.948611
2016-05-06 0.084405 0.721918 0.081559 -0.621065
In [12]:
df['A']
Out[12]:
2016-05-01    0.732630
2016-05-02    1.253825
2016-05-03   -0.212081
2016-05-04   -0.044256
2016-05-05   -0.948611
2016-05-06   -0.621065
Freq: D, Name: A, dtype: float64
In [13]:
df[0:3]  #获取前三行012
Out[13]:
  A B C D
2016-05-01 0.732630 2.323166 -0.196964 -1.026370
2016-05-02 1.253825 -2.343766 1.069097 0.228692
2016-05-03 -0.212081 0.088675 -0.942775 0.033656
In [15]:
df['20160502':'20160504']
Out[15]:
  A B C D
2016-05-02 1.253825 -2.343766 1.069097 0.228692
2016-05-03 -0.212081 0.088675 -0.942775 0.033656
2016-05-04 -0.044256 -1.017814 -1.263595 -0.696767
In [16]:
df.loc[dates[1]]    #按日期截取
Out[16]:
A    1.253825
B   -2.343766
C    1.069097
D    0.228692
Name: 2016-05-02 00:00:00, dtype: float64
In [17]:
df.loc[:,['A','B']]
Out[17]:
  A B
2016-05-01 0.732630 2.323166
2016-05-02 1.253825 -2.343766
2016-05-03 -0.212081 0.088675
2016-05-04 -0.044256 -1.017814
2016-05-05 -0.948611 1.062373
2016-05-06 -0.621065 0.081559
In [18]:
df.loc['20160502':'20160505',['A','B']]     #使用loc指定截取
Out[18]:
  A B
2016-05-02 1.253825 -2.343766
2016-05-03 -0.212081 0.088675
2016-05-04 -0.044256 -1.017814
2016-05-05 -0.948611 1.062373
In [19]:
df.loc[dates[1],'A']
Out[19]:
1.2538250114640803
In [20]:
df.iloc[5]      #根据行数直接选定
Out[20]:
A   -0.621065
B    0.081559
C    0.721918
D    0.084405
Name: 2016-05-06 00:00:00, dtype: float64
In [24]:
df.iloc[4:6,0:2]   # iloc直接确定行列数
Out[24]:
  A B
2016-05-05 -0.948611 1.062373
2016-05-06 -0.621065 0.081559
In [25]:
df.iloc[1:3,:]
Out[25]:
  A B C D
2016-05-02 1.253825 -2.343766 1.069097 0.228692
2016-05-03 -0.212081 0.088675 -0.942775 0.033656
In [26]:
df[df.A>0]   
Out[26]:
  A B C D
2016-05-01 0.732630 2.323166 -0.196964 -1.026370
2016-05-02 1.253825 -2.343766 1.069097 0.228692
In [27]:
df[df>0]    #只显示大于零的
Out[27]:
  A B C D
2016-05-01 0.732630 2.323166 NaN NaN
2016-05-02 1.253825 NaN 1.069097 0.228692
2016-05-03 NaN 0.088675 NaN 0.033656
2016-05-04 NaN NaN NaN NaN
2016-05-05 NaN 1.062373 2.667495 0.350915
2016-05-06 NaN 0.081559 0.721918 0.084405
In [28]:
df2=df.copy()
In [29]:
df2['E']=['one','one','two','three','four','three']
In [30]:
df2
Out[30]:
  A B C D E
2016-05-01 0.732630 2.323166 -0.196964 -1.026370 one
2016-05-02 1.253825 -2.343766 1.069097 0.228692 one
2016-05-03 -0.212081 0.088675 -0.942775 0.033656 two
2016-05-04 -0.044256 -1.017814 -1.263595 -0.696767 three
2016-05-05 -0.948611 1.062373 2.667495 0.350915 four
2016-05-06 -0.621065 0.081559 0.721918 0.084405 three
In [31]:
df2[df2['E'].isin(['two','four'])]
Out[31]:
  A B C D E
2016-05-03 -0.212081 0.088675 -0.942775 0.033656 two
2016-05-05 -0.948611 1.062373 2.667495 0.350915 four
In [32]:
s1=pd.Series([1,2,3,4,5,6],index=pd.date_range('20160501',periods=6))
In [33]:
s1
Out[33]:
2016-05-01    1
2016-05-02    2
2016-05-03    3
2016-05-04    4
2016-05-05    5
2016-05-06    6
Freq: D, dtype: int64
In [34]:
df['F']=s1
In [35]:
df
Out[35]:
  A B C D F
2016-05-01 0.732630 2.323166 -0.196964 -1.026370 1
2016-05-02 1.253825 -2.343766 1.069097 0.228692 2
2016-05-03 -0.212081 0.088675 -0.942775 0.033656 3
2016-05-04 -0.044256 -1.017814 -1.263595 -0.696767 4
2016-05-05 -0.948611 1.062373 2.667495 0.350915 5
2016-05-06 -0.621065 0.081559 0.721918 0.084405 6
In [36]:
df.at[dates[0],'A']=0
In [37]:
df
Out[37]:
  A B C D F
2016-05-01 0.000000 2.323166 -0.196964 -1.026370 1
2016-05-02 1.253825 -2.343766 1.069097 0.228692 2
2016-05-03 -0.212081 0.088675 -0.942775 0.033656 3
2016-05-04 -0.044256 -1.017814 -1.263595 -0.696767 4
2016-05-05 -0.948611 1.062373 2.667495 0.350915 5
2016-05-06 -0.621065 0.081559 0.721918 0.084405 6
In [38]:
df.loc[:,'D']=np.array([5]*len(df))     #把D列替换掉为5
In [39]:
df
Out[39]:
  A B C D F
2016-05-01 0.000000 2.323166 -0.196964 5 1
2016-05-02 1.253825 -2.343766 1.069097 5 2
2016-05-03 -0.212081 0.088675 -0.942775 5 3
2016-05-04 -0.044256 -1.017814 -1.263595 5 4
2016-05-05 -0.948611 1.062373 2.667495 5 5
2016-05-06 -0.621065 0.081559 0.721918 5 6
In [40]:
df.iat[0,1]=0     #iat直接确定行列
In [41]:
df
Out[41]:
  A B C D F
2016-05-01 0.000000 0.000000 -0.196964 5 1
2016-05-02 1.253825 -2.343766 1.069097 5 2
2016-05-03 -0.212081 0.088675 -0.942775 5 3
2016-05-04 -0.044256 -1.017814 -1.263595 5 4
2016-05-05 -0.948611 1.062373 2.667495 5 5
2016-05-06 -0.621065 0.081559 0.721918 5 6
In [43]:
del df['F']
In [44]:
df
Out[44]:
  A B C D
2016-05-01 0.000000 0.000000 -0.196964 5
2016-05-02 1.253825 -2.343766 1.069097 5
2016-05-03 -0.212081 0.088675 -0.942775 5
2016-05-04 -0.044256 -1.017814 -1.263595 5
2016-05-05 -0.948611 1.062373 2.667495 5
2016-05-06 -0.621065 0.081559 0.721918 5
In [46]:
s1=pd.Series([1,2,3,4,5,6],index=pd.date_range('20160502',periods=6))
In [48]:
s1            # 注意1号是缺失的
Out[48]:
2016-05-02    1
2016-05-03    2
2016-05-04    3
2016-05-05    4
2016-05-06    5
2016-05-07    6
Freq: D, dtype: int64
In [49]:
df['F']=s1
In [50]:
df
Out[50]:
  A B C D F
2016-05-01 0.000000 0.000000 -0.196964 5 NaN
2016-05-02 1.253825 -2.343766 1.069097 5 1
2016-05-03 -0.212081 0.088675 -0.942775 5 2
2016-05-04 -0.044256 -1.017814 -1.263595 5 3
2016-05-05 -0.948611 1.062373 2.667495 5 4
2016-05-06 -0.621065 0.081559 0.721918 5 5
In [51]:
df2=df.copy()
df2[df2>0]=-df2           #全改为负的
df2                   
Out[51]:
  A B C D F
2016-05-01 0.000000 0.000000 -0.196964 -5 NaN
2016-05-02 -1.253825 -2.343766 -1.069097 -5 -1
2016-05-03 -0.212081 -0.088675 -0.942775 -5 -2
2016-05-04 -0.044256 -1.017814 -1.263595 -5 -3
2016-05-05 -0.948611 -1.062373 -2.667495 -5 -4
2016-05-06 -0.621065 -0.081559 -0.721918 -5 -5
In [52]:
df1=df.reindex(index=dates[0:4],columns=list(df.columns)+['E'])
df1.loc[dates[0]:dates[1],'E']=1    #只是对E前两个数进行了赋值,后面的未定
df1
Out[52]:
  A B C D F E
2016-05-01 0.000000 0.000000 -0.196964 5 NaN 1
2016-05-02 1.253825 -2.343766 1.069097 5 1 1
2016-05-03 -0.212081 0.088675 -0.942775 5 2 NaN
2016-05-04 -0.044256 -1.017814 -1.263595 5 3 NaN
In [53]:
df1.dropna(how='any')       #把含有不确定值的样本去掉
Out[53]:
  A B C D F E
2016-05-02 1.253825 -2.343766 1.069097 5 1 1
In [54]:
df1.fillna(value=5)    #填补不确定的NaN数值为5
Out[54]:
  A B C D F E
2016-05-01 0.000000 0.000000 -0.196964 5 5 1
2016-05-02 1.253825 -2.343766 1.069097 5 1 1
2016-05-03 -0.212081 0.088675 -0.942775 5 2 5
2016-05-04 -0.044256 -1.017814 -1.263595 5 3 5
In [55]:
pd.isnull(df1)
Out[55]:
  A B C D F E
2016-05-01 False False False False True False
2016-05-02 False False False False False False
2016-05-03 False False False False False True
2016-05-04 False False False False False True
In [56]:
df
Out[56]:
  A B C D F
2016-05-01 0.000000 0.000000 -0.196964 5 NaN
2016-05-02 1.253825 -2.343766 1.069097 5 1
2016-05-03 -0.212081 0.088675 -0.942775 5 2
2016-05-04 -0.044256 -1.017814 -1.263595 5 3
2016-05-05 -0.948611 1.062373 2.667495 5 4
2016-05-06 -0.621065 0.081559 0.721918 5 5
In [57]:
df.mean()      #每一列的均值
Out[57]:
A   -0.095365
B   -0.354829
C    0.342529
D    5.000000
F    3.000000
dtype: float64
In [58]:
df.mean(1)      #每一行的均值
Out[58]:
2016-05-01    1.200759
2016-05-02    1.195831
2016-05-03    1.186764
2016-05-04    1.134867
2016-05-05    2.356251
2016-05-06    2.036483
Freq: D, dtype: float64
In [59]:
s=pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)  #右移两位
s
Out[59]:
2016-05-01   NaN
2016-05-02   NaN
2016-05-03     1
2016-05-04     3
2016-05-05     5
2016-05-06   NaN
Freq: D, dtype: float64
In [60]:
df.sub(s,axis='index')    # 对应位置相减去
Out[60]:
  A B C D F
2016-05-01 NaN NaN NaN NaN NaN
2016-05-02 NaN NaN NaN NaN NaN
2016-05-03 -1.212081 -0.911325 -1.942775 4 1
2016-05-04 -3.044256 -4.017814 -4.263595 2 0
2016-05-05 -5.948611 -3.937627 -2.332505 0 -1
2016-05-06 NaN NaN NaN NaN NaN
In [61]:
df.apply(np.cumsum)  #一列一列累计求和
Out[61]:
  A B C D F
2016-05-01 0.000000 0.000000 -0.196964 5 NaN
2016-05-02 1.253825 -2.343766 0.872133 10 1
2016-05-03 1.041744 -2.255091 -0.070642 15 3
2016-05-04 0.997488 -3.272905 -1.334236 20 6
2016-05-05 0.048877 -2.210533 1.333259 25 10
2016-05-06 -0.572188 -2.128973 2.055176 30 15
In [62]:
df
Out[62]:
  A B C D F
2016-05-01 0.000000 0.000000 -0.196964 5 NaN
2016-05-02 1.253825 -2.343766 1.069097 5 1
2016-05-03 -0.212081 0.088675 -0.942775 5 2
2016-05-04 -0.044256 -1.017814 -1.263595 5 3
2016-05-05 -0.948611 1.062373 2.667495 5 4
2016-05-06 -0.621065 0.081559 0.721918 5 5
In [63]:
df.apply(lambda x: x.max()-x.min())        # 应用函数 按列计算   
Out[63]:
A    2.202436
B    3.406138
C    3.931090
D    0.000000
F    4.000000
dtype: float64
In [64]:
s=pd.Series(['A','B','C','Aaba','Baca',np.nan,'CABA','dog','cat'])
In [65]:
s.str.lower()
Out[65]:
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object
In [66]:
df=pd.DataFrame(np.random.randn(10,4))
In [67]:
df
Out[67]:
  0 1 2 3
0 -0.459784 0.117408 -1.491735 -0.598449
1 -0.868217 -0.809730 1.470170 1.204157
2 -1.080249 -1.709686 -0.506906 0.562758
3 -0.694457 0.871048 -0.499943 -0.664619
4 -1.090272 0.770311 0.013386 -0.379529
5 0.730921 0.897383 0.170490 1.194727
6 -1.742158 -1.410038 1.676567 -0.791688
7 -0.355975 -1.123030 -0.005195 -0.422889
8 -1.003355 1.150477 1.541654 0.525633
9 -0.385119 2.059225 0.888651 -0.425348
In [68]:
pieces=[df[:3],df[3:7],df[7:]]    #按行切片  貌似“数字”就是表示行
pd.concat(pieces)
Out[68]:
  0 1 2 3
0 -0.459784 0.117408 -1.491735 -0.598449
1 -0.868217 -0.809730 1.470170 1.204157
2 -1.080249 -1.709686 -0.506906 0.562758
3 -0.694457 0.871048 -0.499943 -0.664619
4 -1.090272 0.770311 0.013386 -0.379529
5 0.730921 0.897383 0.170490 1.194727
6 -1.742158 -1.410038 1.676567 -0.791688
7 -0.355975 -1.123030 -0.005195 -0.422889
8 -1.003355 1.150477 1.541654 0.525633
9 -0.385119 2.059225 0.888651 -0.425348
In [69]:
left=pd.DataFrame({'key':['foo','foo'],'Ival':[1,2]})
right=pd.DataFrame({'key':['foo','foo'],'rval':[4,5]})
In [70]:
left
Out[70]:
  Ival key
0 1 foo
1 2 foo
In [71]:
right
Out[71]:
  key rval
0 foo 4
1 foo 5
In [72]:
pd.merge(left,right,on='key')
Out[72]:
  Ival key rval
0 1 foo 4
1 1 foo 5
2 2 foo 4
3 2 foo 5
In [73]:
df=pd.DataFrame(np.random.randn(8,4),columns=['A','B','C','D'])
df
Out[73]:
  A B C D
0 -1.706760 -0.359324 0.312720 0.682060
1 -1.052203 -0.299854 -1.212988 2.027105
2 -0.171547 -0.165153 0.877990 -1.011546
3 0.886335 1.414201 1.246123 1.352631
4 2.646394 -0.690779 0.847747 -0.989298
5 2.048570 -0.208953 1.204827 -1.017852
6 0.344992 0.569395 2.375587 0.438514
7 -2.042951 1.691899 0.487587 -1.226935
In [74]:
s=df.iloc[3]       # iloc直接用“数字”定位第四行
df.append(s,ignore_index=True)
Out[74]:
  A B C D
0 -1.706760 -0.359324 0.312720 0.682060
1 -1.052203 -0.299854 -1.212988 2.027105
2 -0.171547 -0.165153 0.877990 -1.011546
3 0.886335 1.414201 1.246123 1.352631
4 2.646394 -0.690779 0.847747 -0.989298
5 2.048570 -0.208953 1.204827 -1.017852
6 0.344992 0.569395 2.375587 0.438514
7 -2.042951 1.691899 0.487587 -1.226935
8 0.886335 1.414201 1.246123 1.352631
In [77]:
df=pd.DataFrame({'A':['foo','bar','foo','bar',
                       'foo','bar','foo','foo'],
                  'B':['one','one','two','three',
                       'two','two','one','three'],
                  'C':np.random.randn(8),
                  'D':np.random.randn(8)})
df
Out[77]:
  A B C D
0 foo one 0.585883 -0.132037
1 bar one -1.191658 0.214056
2 foo two 0.069688 0.404364
3 bar three -0.141388 0.140177
4 foo two 1.624389 -2.049529
5 bar two 0.981508 -0.267810
6 foo one -0.144395 0.413784
7 foo three -0.304201 -0.465118
In [78]:
df.groupby('A').sum()
Out[78]:
  C D
A    
bar -0.351538 0.086424
foo 1.831365 -1.828536
In [79]:
df.groupby(['A','B']).sum()    #AB都是标签索引
Out[79]:
    C D
A B    
bar one -1.191658 0.214056
three -0.141388 0.140177
two 0.981508 -0.267810
foo one 0.441488 0.281747
three -0.304201 -0.465118
two 1.694077 -1.645165
In [80]:
tuples=list(zip(*[['bar','bar','baz','baz',
                   'foo','foo','qux','qux'],
                  ['one','two','one','two',
                   'one','two','one','two']]))
index=pd.MultiIndex.from_tuples(tuples,names=['first','second'])    #多指标
index
Out[80]:
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=[u'first', u'second'])
In [81]:
df=pd.DataFrame(np.random.randn(8,2),index=index,columns=['A','B'])  #以index索引指标
df
Out[81]:
    A B
first second    
bar one -2.304056 -0.278421
two 0.105729 0.559201
baz one 0.669632 1.074084
two 1.503771 1.101276
foo one 1.414275 -0.308551
two 0.609448 -2.069294
qux one -1.493660 -0.658796
two 0.956393 -0.393197
In [82]:
df2=df[:4]          #截取前四行
df2
Out[82]:
    A B
first second    
bar one -2.304056 -0.278421
two 0.105729 0.559201
baz one 0.669632 1.074084
two 1.503771 1.101276
In [83]:
stacked=df2.stack()
stacked
Out[83]:
first  second   
bar    one     A   -2.304056
               B   -0.278421
       two     A    0.105729
               B    0.559201
baz    one     A    0.669632
               B    1.074084
       two     A    1.503771
               B    1.101276
dtype: float64
In [84]:
stacked.unstack()
Out[84]:
    A B
first second    
bar one -2.304056 -0.278421
two 0.105729 0.559201
baz one 0.669632 1.074084
two 1.503771 1.101276
In [85]:
stacked.unstack(1)
Out[85]:
  second one two
first      
bar A -2.304056 0.105729
B -0.278421 0.559201
baz A 0.669632 1.503771
B 1.074084 1.101276
In [86]:
stacked.unstack(0)
Out[86]:
  first bar baz
second      
one A -2.304056 0.669632
B -0.278421 1.074084
two A 0.105729 1.503771
B 0.559201 1.101276
In [87]:
df=pd.DataFrame({'A':['one','one','two','three']*3,
                 'B':['A','B','C']*4,
                 'C':['foo','foo','foo','bar','bar','bar']*2,
                 'D':np.random.randn(12),
                 'E':np.random.randn(12)})
df 
Out[87]:
  A B C D E
0 one A foo -0.927428 0.703806
1 one B foo 1.118431 -0.203965
2 two C foo -0.341868 1.573454
3 three A bar 0.006382 0.688805
4 one B bar -2.000140 -1.140942
5 one C bar 0.106369 -0.552740
6 two A foo 0.676691 -0.842269
7 three B foo 0.766754 0.568882
8 one C foo -0.923519 -0.079561
9 one A bar -1.489467 0.071690
10 two B bar 0.803268 0.554549
11 three C bar -0.340346 -1.006487
In [89]:
pd.pivot_table(df,values='D',index=['A','B'],columns=['C'])  #透视表
Out[89]:
  C bar foo
A B    
one A -1.489467 -0.927428
B -2.000140 1.118431
C 0.106369 -0.923519
three A 0.006382 NaN
B NaN 0.766754
C -0.340346 NaN
two A NaN 0.676691
B 0.803268 NaN
C NaN -0.341868
In [93]:
rng=pd.date_range('1/1/2016',periods=100,freq='S')  #周期为100 单位为秒
rng
Out[93]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2016-01-01 00:00:00, ..., 2016-01-01 00:01:39]
Length: 100, Freq: S, Timezone: None
In [94]:
ts=pd.Series(np.random.randint(0,500,len(rng)),index=rng)       #产生对应序列
In [95]:
ts
Out[95]:
2016-01-01 00:00:00    383
2016-01-01 00:00:01    495
2016-01-01 00:00:02     67
2016-01-01 00:00:03    187
2016-01-01 00:00:04    416
2016-01-01 00:00:05    415
2016-01-01 00:00:06    495
2016-01-01 00:00:07    490
2016-01-01 00:00:08    110
2016-01-01 00:00:09    215
2016-01-01 00:00:10      1
2016-01-01 00:00:11    187
2016-01-01 00:00:12    201
2016-01-01 00:00:13    115
2016-01-01 00:00:14    136
...
2016-01-01 00:01:25     52
2016-01-01 00:01:26     78
2016-01-01 00:01:27    280
2016-01-01 00:01:28    147
2016-01-01 00:01:29    301
2016-01-01 00:01:30    135
2016-01-01 00:01:31    191
2016-01-01 00:01:32     96
2016-01-01 00:01:33    312
2016-01-01 00:01:34    123
2016-01-01 00:01:35    466
2016-01-01 00:01:36    253
2016-01-01 00:01:37     88
2016-01-01 00:01:38     28
2016-01-01 00:01:39    185
Freq: S, Length: 100
In [102]:
ts.resample('20S')       #以二十秒为间隔进行重新采样  .sum()求和
Out[102]:
2016-01-01 00:00:00    269.75
2016-01-01 00:00:20    181.10
2016-01-01 00:00:40    238.15
2016-01-01 00:01:00    250.20
2016-01-01 00:01:20    192.25
Freq: 20S, dtype: float64
In [103]:
ts.resample('5Min')    #以五分钟间隔重新采样
Out[103]:
2016-01-01    226.29
Freq: 5T, dtype: float64
In [104]:
ts.resample('5Min').sum()
Out[104]:
226.28999999999999
In [106]:
rng=pd.date_range('6/1/2016 00:00',periods=5,freq='D')
ts=pd.Series(np.random.randn(len(rng)),rng)
ts
Out[106]:
2016-06-01    0.077363
2016-06-02   -0.470935
2016-06-03   -1.704306
2016-06-04   -2.667275
2016-06-05    0.239514
Freq: D, dtype: float64
In [107]:
ts_utc=ts.tz_localize('UTC')
ts_utc
Out[107]:
2016-06-01 00:00:00+00:00    0.077363
2016-06-02 00:00:00+00:00   -0.470935
2016-06-03 00:00:00+00:00   -1.704306
2016-06-04 00:00:00+00:00   -2.667275
2016-06-05 00:00:00+00:00    0.239514
Freq: D, dtype: float64
In [108]:
ts_utc.tz_convert('US/Eastern')
Out[108]:
2016-05-31 20:00:00-04:00    0.077363
2016-06-01 20:00:00-04:00   -0.470935
2016-06-02 20:00:00-04:00   -1.704306
2016-06-03 20:00:00-04:00   -2.667275
2016-06-04 20:00:00-04:00    0.239514
Freq: D, dtype: float64
In [110]:
rng=pd.date_range('1/1/2016',periods=5,freq='M')
rng
Out[110]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2016-01-31, ..., 2016-05-31]
Length: 5, Freq: M, Timezone: None
In [112]:
ts=pd.Series(np.random.randn(len(rng)),index=rng)
ts
Out[112]:
2016-01-31    1.383489
2016-02-29   -1.361366
2016-03-31   -1.234081
2016-04-30    1.500934
2016-05-31   -0.859164
Freq: M, dtype: float64
In [114]:
ps=ts.to_period()
ps
Out[114]:
2016-01    1.383489
2016-02   -1.361366
2016-03   -1.234081
2016-04    1.500934
2016-05   -0.859164
Freq: M, dtype: float64
In [115]:
ps.to_timestamp()
Out[115]:
2016-01-01    1.383489
2016-02-01   -1.361366
2016-03-01   -1.234081
2016-04-01    1.500934
2016-05-01   -0.859164
Freq: MS, dtype: float64
In [119]:
df=pd.DataFrame({"id":[1,2,3,4,5,6],"raw_grade":['a','b','b','a','a','e']})     # DataFrame在pandas里真的好常用啊!!
df
Out[119]:
  id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 e
In [136]:
ts=pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2016',periods=1000))
ts=ts.cumsum()
ts.plot()
import pylab
pylab.show()               #这里画出一条

In [133]:
df=pd.DataFrame(np.random.randn(1000,4),index=ts.index,
                columns=['A','B','C','D'])
df=df.cumsum()
plt.figure();df.plot();plt.legend(loc='best')
pylab.show(plt)            #这里会画出四条线

In [134]:
df.to_csv('foo.csv')
In [135]:
pd.read_csv('foo.csv')
Out[135]:
  Unnamed: 0 A B C D
0 2016-01-01 -0.060778 0.445654 -0.352386 -0.609622
1 2016-01-02 -0.225878 -0.223601 0.166339 -0.168708
2 2016-01-03 -0.584332 -0.398799 0.452762 0.145729
3 2016-01-04 -0.893308 -1.069849 -0.880419 0.625882
4 2016-01-05 0.177206 -1.696962 -1.651330 -0.397824
5 2016-01-06 0.210031 -1.284241 -2.067825 0.869942
6 2016-01-07 0.731707 -0.752098 -0.811682 0.172029
7 2016-01-08 0.025544 0.247914 0.441612 -0.017262
8 2016-01-09 1.190667 -0.119616 0.996169 -0.310887
9 2016-01-10 0.765977 -0.395972 0.257438 -2.957702
10 2016-01-11 1.946373 -1.427946 0.277539 -4.032154
11 2016-01-12 2.859597 -3.338857 -0.297688 -3.732874
12 2016-01-13 2.745899 -3.995169 -0.847225 -4.406678
13 2016-01-14 3.035527 -4.108064 -0.229614 -6.916703
14 2016-01-15 3.612329 -5.105051 0.408813 -7.405939
15 2016-01-16 4.266282 -3.667975 -0.535546 -7.343270
16 2016-01-17 4.946040 -4.205214 0.142485 -7.716980
17 2016-01-18 5.480254 -3.838008 0.630041 -7.853502
18 2016-01-19 5.732786 -5.089840 1.458250 -5.912834
19 2016-01-20 7.154406 -7.120037 2.173687 -6.321057
20 2016-01-21 6.093153 -8.702487 4.186862 -5.993653
21 2016-01-22 5.721862 -7.544379 3.898040 -7.736576
22 2016-01-23 5.834695 -6.941320 4.738928 -7.903169
23 2016-01-24 4.509326 -7.338737 6.270476 -7.876277
24 2016-01-25 5.352635 -8.037991 6.476991 -6.437261
25 2016-01-26 5.660139 -8.993605 6.768491 -6.830139
26 2016-01-27 6.523809 -8.701512 5.878055 -7.364075
27 2016-01-28 5.093959 -9.886362 6.435082 -6.988506
28 2016-01-29 6.851939 -10.621511 5.217927 -5.940278
29 2016-01-30 8.887495 -12.405823 4.735677 -5.162080
... ... ... ... ... ...
970 2018-08-28 -12.585631 21.117138 46.406459 29.434988
971 2018-08-29 -11.693035 21.089037 44.626627 29.102557
972 2018-08-30 -12.754592 20.168260 46.033262 29.625529
973 2018-08-31 -13.004323 18.298922 46.389874 29.823431
974 2018-09-01 -13.400461 21.166139 45.972786 29.883859
975 2018-09-02 -12.912455 20.822041 44.497434 29.433800
976 2018-09-03 -13.664545 19.558600 45.080095 29.243190
977 2018-09-04 -12.878333 20.092918 45.126348 28.184262
978 2018-09-05 -12.616144 19.951420 43.783920 28.859123
979 2018-09-06 -13.594572 21.404156 43.447826 29.281237
980 2018-09-07 -13.769105 21.547793 44.023320 29.684169
981 2018-09-08 -14.214899 20.473841 45.688251 30.801261
982 2018-09-09 -14.424237 21.834532 44.206935 29.692357
983 2018-09-10 -14.544344 23.023552 43.536564 29.879090
984 2018-09-11 -14.622561 22.498657 43.229307 31.090382
985 2018-09-12 -14.793304 21.686580 42.387027 30.960854
986 2018-09-13 -15.097856 20.427290 42.517566 29.111917
987 2018-09-14 -16.029041 22.117407 41.985718 28.646375
988 2018-09-15 -16.366028 22.329802 43.023612 28.016292
989 2018-09-16 -15.263399 20.977358 43.459743 28.958903
990 2018-09-17 -15.815869 19.748197 43.901657 27.582531
991 2018-09-18 -16.331221 21.476128 43.958389 26.769234
992 2018-09-19 -15.624969 19.748947 45.209086 27.094194
993 2018-09-20 -13.684556 19.023781 46.029573 24.591141
994 2018-09-21 -14.540087 19.139586 44.868671 26.183916
995 2018-09-22 -14.228209 19.500605 43.479828 25.862778
996 2018-09-23 -15.186773 17.678064 43.289105 26.264116
997 2018-09-24 -15.940376 18.740444 43.572656 24.962051
998 2018-09-25 -16.894603 17.720135 44.261559 25.487902
999 2018-09-26 -17.316531 14.945545 44.298645 24.699319

1000 rows × 5 columns

In []:
 



从pandas 0.15开始,有了categorical的 DataFrame

In [122]: df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
In [123]: df["grade"] = df["raw_grade"].astype("category")       #转化为categorical数据类型

In [124]: df["grade"]
Out[124]: 
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

三个类别 a b e

In [125]: df["grade"].cat.categories = ["very good", "good", "very bad"]
In [126]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
                        #对应五个类别成绩
In [127]: df["grade"]
Out[127]: 
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
In [128]: df.sort_values(by="grade")     #对成绩进行一个排序,默认从低到高
Out[128]: 
   id raw_grade      grade
5   6         e   very bad
1   2         b       good
2   3         b       good
0   1         a  very good
3   4         a  very good
4   5         a  very good
In [129]: df.groupby("grade").size()      #找出成绩相同的个数
Out[129]: 
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64




参考资料:http://pandas.pydata.org/pandas-docs/stable/10min.html


评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值