1.reindex
pandas中的reindex方法可以为series和dataframe添加或者删除索引。
方法:serise.reindex()、dataframe.reindex()
In[2]: import numpy as np
In[3]: import pandas as pd
In[6]: df=pd.DataFrame(np.random.random((6,4)),index=dates,columns=list('ABCD'))
In[7]: df
Out[7]:
A B C D
2016-01-01 0.196507 0.019824 0.330309 0.289678
2016-01-02 0.658054 0.236342 0.429518 0.322824
2016-01-03 0.369265 0.855443 0.161498 0.763341
2016-01-04 0.383210 0.953314 0.364178 0.719806
2016-01-05 0.135650 0.713118 0.609763 0.752052
2016-01-06 0.470165 0.002966 0.393910 0.137355
In[10]: df1=df.reindex(index=dates[0:4],columns=list(df.columns)+['E'])
In[11]: df1
Out[11]:
A B C D E
2016-01-01 0.196507 0.019824 0.330309 0.289678 NaN
2016-01-02 0.658054 0.236342 0.429518 0.322824 NaN
2016-01-03 0.369265 0.855443 0.161498 0.763341 NaN
2016-01-04 0.383210 0.953314 0.364178 0.719806 NaN
2.处理NaN值有关的
.fillna(value):将NaN的值填为value
.isnull():返回一个全是bool型的和原dataframe大小相同的,判断每个位置元素值是不是NaN
.any():查看各列或行或是列中元素是否有NaN
NaN不参与求均值求和等的计算
In[12]: df.columns
Out[12]: Index(['A', 'B', 'C', 'D'], dtype='object')
In[13]: df1.loc[dates[1:3],'E']=2
In[14]: df1
Out[14]:
A B C D E
2016-01-01 0.196507 0.019824 0.330309 0.289678 NaN
2016-01-02 0.658054 0.236342 0.429518 0.322824 2.0
2016-01-03 0.369265 0.855443 0.161498 0.763341 2.0
2016-01-04 0.383210 0.953314 0.364178 0.719806 NaN
In[15]: df1.dropna()
Out[15]:
A B C D E
2016-01-02 0.658054 0.236342 0.429518 0.322824 2.0
2016-01-03 0.369265 0.855443 0.161498 0.763341 2.0
In[16]: df1.fillna(value=5)
Out[16]:
A B C D E
2016-01-01 0.196507 0.019824 0.330309 0.289678 5.0
2016-01-02 0.658054 0.236342 0.429518 0.322824 2.0
2016-01-03 0.369265 0.855443 0.161498 0.763341 2.0
2016-01-04 0.383210 0.953314 0.364178 0.719806 5.0
In[17]: pd.isnull(df1)
Out[17]:
A B C D E
2016-01-01 False False False False True
2016-01-02 False False False False False
2016-01-03 False False False False False
2016-01-04 False False False False True
In[18]: pd.isnull(df1).any()
Out[18]:
A False
B False
C False
D False
E True
dtype: bool
In[19]: pd.isnull(df1).any().any()
Out[19]: True
In[20]: df1.mean() #空数据不参与平均值的计算
Out[20]:
A 0.401759
B 0.516231
C 0.321376
D 0.523912
E 2.000000
dtype: float64
In[21]: df1.cumsum()
Out[21]:
A B C D E
2016-01-01 0.196507 0.019824 0.330309 0.289678 NaN
2016-01-02 0.854561 0.256167 0.759827 0.612502 2.0
2016-01-03 1.223826 1.111609 0.921325 1.375843 4.0
2016-01-04 1.607035 2.064923 1.285503 2.095649 NaN
3.一些函数
.apply(函数名称):对pandas数据应用参数中的函数
.val_counts():每个值以及这个值的元素个数
.mode():众数
In[29]: df.apply(np.cumsum)
Out[29]:
A B C D
2016-01-01 0.196507 0.019824 0.330309 0.289678
2016-01-02 0.854561 0.256167 0.759827 0.612502
2016-01-03 1.223826 1.111609 0.921325 1.375843
2016-01-04 1.607035 2.064923 1.285503 2.095649
2016-01-05 1.742685 2.778041 1.895266 2.847701
2016-01-06 2.212849 2.781007 2.289176 2.985057
In[30]: df.apply(lambda x:x.max()-x.min())
Out[30]:
A 0.522404
B 0.950347
C 0.448265
D 0.625986
dtype: float64
In[31]: def _sum(x):
...: print(type(x))
...: return x.sum()
...: df.apply(_sum)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
Out[31]:
A 2.212849
B 2.781007
C 2.289176
D 2.985057
dtype: float64
In[32]: s=pd.Series(np.random.randint(10,20,size=20))
In[33]: s
Out[33]:
0 18
1 16
2 11
3 11
4 17
5 14
6 18
7 17
8 13
9 13
10 11
11 11
12 16
13 17
14 17
15 13
16 12
17 15
18 14
19 10
dtype: int32
In[34]: s.value_counts()
Out[34]:
17 4
11 4
13 3
18 2
16 2
14 2
15 1
12 1
10 1
dtype: int64
In[35]: s.mode()
Out[35]:
0 11
1 17
dtype: int32
.concat():合并多个pandas里的数据结构
.all():类似于.any(),对行或列的操作
In[36]: df=pd.DataFrame(np.random.random((10,4)),columns=list('ABCD'))
In[37]: df
Out[37]:
A B C D
0 0.309187 0.109764 0.237555 0.878088
1 0.008201 0.082768 0.939499 0.755231
2 0.203507 0.882972 0.166033 0.899489
3 0.528112 0.976442 0.405005 0.476885
4 0.557219 0.404936 0.975680 0.312243
5 0.990264 0.643145 0.396265 0.936465
6 0.027994 0.552443 0.277969 0.985753
7 0.325813 0.469911 0.432550 0.276821
8 0.805665 0.230130 0.561014 0.673377
9 0.010867 0.485834 0.512464 0.527696
In[38]: df.iloc[1:3]
Out[38]:
A B C D
1 0.008201 0.082768 0.939499 0.755231
2 0.203507 0.882972 0.166033 0.899489
In[39]: df.iloc[3:7]
Out[39]:
A B C D
3 0.528112 0.976442 0.405005 0.476885
4 0.557219 0.404936 0.975680 0.312243
5 0.990264 0.643145 0.396265 0.936465
6 0.027994 0.552443 0.277969 0.985753
In[40]: df.iloc[7:]
Out[40]:
A B C D
7 0.325813 0.469911 0.432550 0.276821
8 0.805665 0.230130 0.561014 0.673377
9 0.010867 0.485834 0.512464 0.527696
In[41]: df1=pd.concat([df.iloc[1:3],df.iloc[3:7],df.iloc[7:]])
In[42]: df1
Out[42]:
A B C D
1 0.008201 0.082768 0.939499 0.755231
2 0.203507 0.882972 0.166033 0.899489
3 0.528112 0.976442 0.405005 0.476885
4 0.557219 0.404936 0.975680 0.312243
5 0.990264 0.643145 0.396265 0.936465
6 0.027994 0.552443 0.277969 0.985753
7 0.325813 0.469911 0.432550 0.276821
8 0.805665 0.230130 0.561014 0.673377
9 0.010867 0.485834 0.512464 0.527696
In[46]: df1=pd.concat([df.iloc[:3],df.iloc[3:7],df.iloc[7:]])
In[47]: df1
Out[47]:
A B C D
0 0.309187 0.109764 0.237555 0.878088
1 0.008201 0.082768 0.939499 0.755231
2 0.203507 0.882972 0.166033 0.899489
3 0.528112 0.976442 0.405005 0.476885
4 0.557219 0.404936 0.975680 0.312243
5 0.990264 0.643145 0.396265 0.936465
6 0.027994 0.552443 0.277969 0.985753
7 0.325813 0.469911 0.432550 0.276821
8 0.805665 0.230130 0.561014 0.673377
9 0.010867 0.485834 0.512464 0.527696
In[48]: df==df1
Out[48]:
A B C D
0 True True True True
1 True True True True
2 True True True True
3 True True True True
4 True True True True
5 True True True True
6 True True True True
7 True True True True
8 True True True True
9 True True True True
In[49]: (df==df1).all()
Out[49]:
A True
B True
C True
D True
dtype: bool
In[50]: (df==df1).all().all()
Out[50]: True
.merge():合并pandas的数据结构,如果指定on=’’,相当于数据库的内连接操作,保留公共主键
.append():可以在表格追加一行Series这样的结构
In[4]: left=pd.DataFrame({'key':['foo','foo'],'lval':[1,2]})
In[5]: right=pd.DataFrame({'key':['foo','foo'],'rval':[4,5]})
In[6]: left
Out[6]:
key lval
0 foo 1
1 foo 2
In[7]: right
Out[7]:
key rval
0 foo 4
1 foo 5
In[9]: pd.merge(left,right,on='key')
Out[9]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
In[10]: df=pd.DataFrame(np.random.random((10,4)),columns=list('ABCD'))
In[11]: df
Out[11]:
A B C D
0 0.917928 0.396669 0.610322 0.666497
1 0.508628 0.718526 0.111458 0.577437
2 0.638503 0.974669 0.321745 0.873634
3 0.792875 0.402313 0.730289 0.441393
4 0.554515 0.133467 0.598767 0.487647
5 0.366506 0.740182 0.365446 0.167777
6 0.483763 0.080658 0.259861 0.766983
7 0.754270 0.632751 0.117197 0.149026
8 0.041530 0.155985 0.052181 0.960280
9 0.469961 0.825286 0.781265 0.888180
In[12]: s=pd.Series(np.random.randint(1,5,size=4),index=list('ABCD'))
In[13]: df.append(s,ignore_index=True)
Out[13]:
A B C D
0 0.917928 0.396669 0.610322 0.666497
1 0.508628 0.718526 0.111458 0.577437
2 0.638503 0.974669 0.321745 0.873634
3 0.792875 0.402313 0.730289 0.441393
4 0.554515 0.133467 0.598767 0.487647
5 0.366506 0.740182 0.365446 0.167777
6 0.483763 0.080658 0.259861 0.766983
7 0.754270 0.632751 0.117197 0.149026
8 0.041530 0.155985 0.052181 0.960280
9 0.469961 0.825286 0.781265 0.888180
10 2.000000 2.000000 3.000000 2.000000
In[15]: s=pd.Series(np.random.randint(1,5,size=5),index=list('ABCDE'))
In[16]: s=pd.Series(np.random.randint(1,5,size=5),index=list('ABCDE'))
In[17]: df.append(s)
In[18]:df.append(s,ignore_index=True)
Out[18]:
A B C D E
0 0.917928 0.396669 0.610322 0.666497 NaN
1 0.508628 0.718526 0.111458 0.577437 NaN
2 0.638503 0.974669 0.321745 0.873634 NaN
3 0.792875 0.402313 0.730289 0.441393 NaN
4 0.554515 0.133467 0.598767 0.487647 NaN
5 0.366506 0.740182 0.365446 0.167777 NaN
6 0.483763 0.080658 0.259861 0.766983 NaN
7 0.754270 0.632751 0.117197 0.149026 NaN
8 0.041530 0.155985 0.052181 0.960280 NaN
9 0.469961 0.825286 0.781265 0.888180 NaN
10 1.000000 4.000000 4.000000 4.000000 1.0