pandas
常见的数据结构
series:代表一行或者一列
dataframe:一个二维列表
另外
pandas.date_range(…)用于生成一个固定频率的时间索引,在调用构造方法时,必须指定start、end、periods中的两个参数值,否则报错。
In[45]: import pandas as pd
In[46]: import numpy as np
In[47]: s=pd.Series([1,4,6,8,9])
In[48]: s
Out[48]:
0 1
1 4
2 6
3 8
4 9
dtype: int64
In[49]: dates=pd.date_range('20000101',periods=6)
In[50]: dates
Out[50]:
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
'2000-01-05', '2000-01-06'],
dtype='datetime64[ns]', freq='D')
In[53]: data=pd.DataFrame(np.random.random((6,4)),index=dates,columns=list('ABCD'))
In[54]: data
Out[54]:
A B C D
2000-01-01 0.898498 0.550647 0.053829 0.529634
2000-01-02 0.888596 0.249757 0.791731 0.795984
2000-01-03 0.495294 0.991960 0.711085 0.768610
2000-01-04 0.596663 0.518386 0.583147 0.833397
2000-01-05 0.994959 0.073198 0.152926 0.978413
2000-01-06 0.437817 0.122992 0.990374 0.318606
In[55]: data.shape
Out[55]: (6, 4)
In[56]: data.values
Out[56]:
array([[0.89849834, 0.55064726, 0.05382852, 0.52963363],
[0.88859551, 0.24975664, 0.79173082, 0.79598432],
[0.49529418, 0.99196023, 0.71108504, 0.76860962],
[0.59666325, 0.51838575, 0.58314702, 0.83339727],
[0.99495878, 0.07319768, 0.15292637, 0.97841344],
[0.43781651, 0.1229924 , 0.99037399, 0.3186058 ]])
还可以通过字典来定义dataframe
df.dtypes:查看每一列数据的dtype
df.列名称:访问某一列
dataframe是由series组成的
In[57]: d={'A':1,'B':pd.Timestamp('20000201'),'C':range(4),'D':np.arange(4)}
In[58]: df=pd.DataFrame(d)
In[60]: df
Out[60]:
A B C D
0 1 2000-02-01 0 0
1 1 2000-02-01 1 1
2 1 2000-02-01 2 2
3 1 2000-02-01 3 3
In[61]: df.dtypes
Out[61]:
A int64
B datetime64[ns]
C int64
D int32
dtype: object
In[62]: df.A
Out[62]:
0 1
1 1
2 1
3 1
Name: A, dtype: int64
In[63]: df.B
Out[63]:
0 2000-02-01
1 2000-02-01
2 2000-02-01
3 2000-02-01
Name: B, dtype: datetime64[ns]
In[64]: type(df.A)
Out[64]: pandas.core.series.Series
data.head()查看df的前n行,参数是行数,默认前5行
data.tail()同上,他是查看最后n行
data.index():查看所有行的索引
data.columns():查看所有列名称
data.values():查看df组成的表中所有的值
data.describe():表格的一些属性
data.T:和numpy的作用一样,相当于转置矩阵
In[65]: data.head()
Out[65]:
A B C D
2000-01-01 0.898498 0.550647 0.053829 0.529634
2000-01-02 0.888596 0.249757 0.791731 0.795984
2000-01-03 0.495294 0.991960 0.711085 0.768610
2000-01-04 0.596663 0.518386 0.583147 0.833397
2000-01-05 0.994959 0.073198 0.152926 0.978413
In[66]: data.head(2)
Out[66]:
A B C D
2000-01-01 0.898498 0.550647 0.053829 0.529634
2000-01-02 0.888596 0.249757 0.791731 0.795984
In[67]: data.tail(3)
Out[67]:
A B C D
2000-01-04 0.596663 0.518386 0.583147 0.833397
2000-01-05 0.994959 0.073198 0.152926 0.978413
2000-01-06 0.437817 0.122992 0.990374 0.318606
In[68]: data.index
Out[68]:
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
'2000-01-05', '2000-01-06'],
dtype='datetime64[ns]', freq='D')
In[69]: data.columns
Out[69]: Index(['A', 'B', 'C', 'D'], dtype='object')
In[70]: data.values
Out[70]:
array([[0.89849834, 0.55064726, 0.05382852, 0.52963363],
[0.88859551, 0.24975664, 0.79173082, 0.79598432],
[0.49529418, 0.99196023, 0.71108504, 0.76860962],
[0.59666325, 0.51838575, 0.58314702, 0.83339727],
[0.99495878, 0.07319768, 0.15292637, 0.97841344],
[0.43781651, 0.1229924 , 0.99037399, 0.3186058 ]])
In[71]: data.describe()
Out[71]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.718638 0.417823 0.547182 0.704107
std 0.237154 0.343897 0.369653 0.238166
min 0.437817 0.073198 0.053829 0.318606
25% 0.520636 0.154683 0.260482 0.589378
50% 0.742629 0.384071 0.647116 0.782297
75% 0.896023 0.542582 0.771569 0.824044
max 0.994959 0.991960 0.990374 0.978413
In[72]: data.T
Out[72]:
2000-01-01 2000-01-02 2000-01-03 2000-01-04 2000-01-05 2000-01-06
A 0.898498 0.888596 0.495294 0.596663 0.994959 0.437817
B 0.550647 0.249757 0.991960 0.518386 0.073198 0.122992
C 0.053829 0.791731 0.711085 0.583147 0.152926 0.990374
D 0.529634 0.795984 0.768610 0.833397 0.978413 0.318606
几种排序:
data.sort_index():按照列名称的顺序排序,参数中ascending=False代表按降序排序
data.sort_values():按照某一列的值进行排序
In[73]: data.sort_index(axis=1)
Out[73]:
A B C D
2000-01-01 0.898498 0.550647 0.053829 0.529634
2000-01-02 0.888596 0.249757 0.791731 0.795984
2000-01-03 0.495294 0.991960 0.711085 0.768610
2000-01-04 0.596663 0.518386 0.583147 0.833397
2000-01-05 0.994959 0.073198 0.152926 0.978413
2000-01-06 0.437817 0.122992 0.990374 0.318606
In[74]: data.sort_index(axis=1,ascending=False)
Out[74]:
D C B A
2000-01-01 0.529634 0.053829 0.550647 0.898498
2000-01-02 0.795984 0.791731 0.249757 0.888596
2000-01-03 0.768610 0.711085 0.991960 0.495294
2000-01-04 0.833397 0.583147 0.518386 0.596663
2000-01-05 0.978413 0.152926 0.073198 0.994959
2000-01-06 0.318606 0.990374 0.122992 0.437817
In[75]: data.sort_index(axis=0,ascending=False)
Out[75]:
A B C D
2000-01-06 0.437817 0.122992 0.990374 0.318606
2000-01-05 0.994959 0.073198 0.152926 0.978413
2000-01-04 0.596663 0.518386 0.583147 0.833397
2000-01-03 0.495294 0.991960 0.711085 0.768610
2000-01-02 0.888596 0.249757 0.791731 0.795984
2000-01-01 0.898498 0.550647 0.053829 0.529634
In[76]: data.sort_values(by='A')
Out[76]:
A B C D
2000-01-06 0.437817 0.122992 0.990374 0.318606
2000-01-03 0.495294 0.991960 0.711085 0.768610
2000-01-04 0.596663 0.518386 0.583147 0.833397
2000-01-02 0.888596 0.249757 0.791731 0.795984
2000-01-01 0.898498 0.550647 0.053829 0.529634
2000-01-05 0.994959 0.073198 0.152926 0.978413
dataframe中访问元素的方式
data[…]:传统的方式,括号中可以是行,可以是列,可以是列名称行名称的字符串,可以是数字(逗号前面是行数,后面是列数,只有一个数字代表行数)
data.loc[…]:特别的方法,括号中必须是代表行列名称的字符串
data.iloc[…]:括号里必须是数字
data.at[…]:和data.loc类似,但括号里必须是pandas的原生数据,比如行名称是用的时间,那么用这种方式访问也必须是用的时间类型
data.iat[…]:类似于data.iloc
In[77]: data['A']
Out[77]:
2000-01-01 0.898498
2000-01-02 0.888596
2000-01-03 0.495294
2000-01-04 0.596663
2000-01-05 0.994959
2000-01-06 0.437817
Freq: D, Name: A, dtype: float64
In[78]: data.A
Out[78]:
2000-01-01 0.898498
2000-01-02 0.888596
2000-01-03 0.495294
2000-01-04 0.596663
2000-01-05 0.994959
2000-01-06 0.437817
Freq: D, Name: A, dtype: float64
In[79]: data[2:4]
Out[79]:
A B C D
2000-01-03 0.495294 0.991960 0.711085 0.768610
2000-01-04 0.596663 0.518386 0.583147 0.833397
In[80]: data['20000102':'20000104']
Out[80]:
A B C D
2000-01-02 0.888596 0.249757 0.791731 0.795984
2000-01-03 0.495294 0.991960 0.711085 0.768610
2000-01-04 0.596663 0.518386 0.583147 0.833397
In[81]: data.loc['20000102':'20000104']
Out[81]:
A B C D
2000-01-02 0.888596 0.249757 0.791731 0.795984
2000-01-03 0.495294 0.991960 0.711085 0.768610
2000-01-04 0.596663 0.518386 0.583147 0.833397
In[82]: data.iloc[2:4]
Out[82]:
A B C D
2000-01-03 0.495294 0.991960 0.711085 0.768610
2000-01-04 0.596663 0.518386 0.583147 0.833397
In[84]: data.loc[:,['B','C']]
Out[84]:
B C
2000-01-01 0.550647 0.053829
2000-01-02 0.249757 0.791731
2000-01-03 0.991960 0.711085
2000-01-04 0.518386 0.583147
2000-01-05 0.073198 0.152926
2000-01-06 0.122992 0.990374
In[85]: data.loc['20000102':'20000105',['B','C']]
Out[85]:
B C
2000-01-02 0.249757 0.791731
2000-01-03 0.991960 0.711085
2000-01-04 0.518386 0.583147
2000-01-05 0.073198 0.152926
In[86]: data.loc['20000102','B']
Out[86]: 0.2497566358873604
In[87]: data.at[pd.Timestamp('20000102'),'B'] #必须是原生类型,否则会报错
Out[87]: 0.2497566358873604
In[88]: data.iloc[1]
Out[88]:
A 0.888596
B 0.249757
C 0.791731
D 0.795984
Name: 2000-01-02 00:00:00, dtype: float64
In[89]: data.iloc[1:3,2:4]
Out[89]:
C D
2000-01-02 0.791731 0.795984
2000-01-03 0.711085 0.768610
In[90]: data.iloc[1:3]
Out[90]:
A B C D
2000-01-02 0.888596 0.249757 0.791731 0.795984
2000-01-03 0.495294 0.991960 0.711085 0.768610
In[91]: data.iloc[1,1]
Out[91]: 0.2497566358873604
In[92]: data.iat[1,1]
Out[92]: 0.2497566358873604
dataframe[…]:括号里填一个条件,可以筛选符合条件的部分打印出来,不符合的显示NaN,这和numpy类似
In[105]: data[data>0.5]
Out[105]:
A B C D
2000-01-01 0.898498 0.550647 NaN 0.529634
2000-01-02 0.888596 NaN 0.791731 0.795984
2000-01-03 NaN 0.991960 0.711085 0.768610
2000-01-04 0.596663 0.518386 0.583147 0.833397
2000-01-05 0.994959 NaN NaN 0.978413
2000-01-06 NaN NaN 0.990374 NaN
修改表格中的值:按照上述某些方式访问,再赋值即可
.isin(…):参数是一个列表,表示判断某列中的元素是否属于列表中的某个值
In[95]: data2=data.copy()
In[96]: tag=['a']*2+['b']*2+['c']*2
In[97]: data2['TAG']=data2
In[100]: data2[data2.TAG.isin(['a','c'])]
Out[100]:
A B C D TAG
2000-01-01 0.898498 0.550647 0.053829 0.529634 a
2000-01-02 0.888596 0.249757 0.791731 0.795984 a
2000-01-05 0.994959 0.073198 0.152926 0.978413 c
2000-01-06 0.437817 0.122992 0.990374 0.318606 c
In[101]: data2.B=200
In[102]: data2
Out[102]:
A B C D TAG
2000-01-01 0.898498 200 0.053829 0.529634 a
2000-01-02 0.888596 200 0.791731 0.795984 a
2000-01-03 0.495294 200 0.711085 0.768610 b
2000-01-04 0.596663 200 0.583147 0.833397 b
2000-01-05 0.994959 200 0.152926 0.978413 c
2000-01-06 0.437817 200 0.990374 0.318606 c
In[103]: data2.iloc[:,2:5]=200
In[104]: data2
Out[104]:
A B C D TAG
2000-01-01 0.898498 200 200 200 200
2000-01-02 0.888596 200 200 200 200
2000-01-03 0.495294 200 200 200 200
2000-01-04 0.596663 200 200 200 200
2000-01-05 0.994959 200 200 200 200
2000-01-06 0.437817 200 200 200 200