pandas
和numpy不同的是,pandas更像一个字典型的,而numpy是类似列表的
会像个字典一样把每个数据上都加上序列
import pandas as pd
import numpy as np
s = pd.Series([1,2,3,6,np.nan,44,1])
print(s)
#result
0 1.0
1 2.0
2 3.0
3 6.0
4 NaN
5 44.0
6 1.0
dtype: float64
像列表 表格一样输出
import pandas as pd
import numpy as np
dates = pd.date_range('20160101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=['a','b','c','d'])
print(df)
print(df.columns)
#result
a b c d
2016-01-01 -1.430430 -0.421334 1.219365 -0.480058
2016-01-02 0.612175 1.118887 0.678260 -0.689190
2016-01-03 0.983088 3.028284 0.020579 -0.251909
2016-01-04 -0.669926 -0.019545 1.813316 1.129999
2016-01-05 0.436789 -0.832122 -0.713937 1.164483
2016-01-06 0.430476 2.399838 0.299447 0.523971
Index(['a', 'b', 'c', 'd'], dtype='object')
#index come from data
使用describe来描述该列表的性质
print(df.describe()) #analys the value of columns
#describe the characteristic of this
a b c d
count 6.000000 6.000000 6.000000 6.000000
mean 0.319191 0.191965 0.536882 -0.213288
std 0.802585 2.000006 0.635241 0.982923
min -1.037829 -2.291519 0.007462 -1.677567
25% -0.053965 -1.172490 0.161186 -0.757976
50% 0.650846 0.066436 0.330561 -0.139355
75% 0.876507 1.334585 0.611195 0.515499
max 0.988458 3.138598 1.743238 0.906949
使用sort_index 进行排序
axis 表示是对列还是行
ascending 表示是正序还是反序
print(df.sort_index(axis=1,ascending=False))
d c b a
2016-01-01 -0.656583 1.181235 -1.499221 -0.951707
2016-01-02 0.142750 0.854452 1.219795 0.876144
2016-01-03 1.213446 0.362272 -1.255121 -0.876265
2016-01-04 0.667960 -0.696974 -0.162850 0.005028
2016-01-05 -2.494620 -1.073663 0.380002 -1.473647
2016-01-06 1.118285 0.134734 1.144273 0.048522
选定具体数据
- loc : select by location
- iloc : select by label
- df.x < condition : select by condition
import pandas as pd
import numpy as np
dates = pd.date_range('20160101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['a','b','c','d'])
print(df.loc['20160101']) #select by label:loc
print(df.loc['20160101',['a','b']]) #
print(df.iloc[3:5,1]) #select by position: iloc
print(df[df.a>8]) #boolean indexing
a 0
b 1
c 2
d 3
Name: 2016-01-01 00:00:00, dtype: int32
a 0
b 1
Name: 2016-01-01 00:00:00, dtype: int32
2016-01-04 13
2016-01-05 17
Freq: D, Name: b, dtype: int32
a b c d
2016-01-04 12 13 14 15
2016-01-05 16 17 18 19
2016-01-06 20 21 22 23
重新定义值
import pandas as pd
import numpy as np
dates = pd.date_range('20160101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['a','b','c','d'])
df.iloc[2,2] = 111 # use the position to change the value
df.loc['20160101','b'] = 222 #use label to change the value
df.a[df.a>4] = 0 #boolearn change
df['f'] = pd.Series([1,2,3,4,5,6],index=pd.date_range('20160101',periods=6))
# to add values
print(df)
# result
a b c d f
2016-01-01 0 222 2 3 1
2016-01-02 4 5 6 7 2
2016-01-03 0 9 111 11 3
2016-01-04 0 13 14 15 4
2016-01-05 0 17 18 19 5
2016-01-06 0 21 22 23 6
处理丢失数据
import pandas as pd
import numpy as np
dates = pd.date_range('20160101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['a','b','c','d'])
df.iloc[0,1] = np.nan
df.iloc[1,2] = np.nan
print(df.dropna(axis=0,how='any')) #how = {'any'} if nan , dropout this line
print(df.fillna(value=0)) # fill the nan
print(df.isnull()) #cheak if nan is exist
a b c d
2016-01-03 8 9.0 10.0 11
2016-01-04 12 13.0 14.0 15
2016-01-05 16 17.0 18.0 19
2016-01-06 20 21.0 22.0 23
a b c d
2016-01-01 0 0.0 2.0 3
2016-01-02 4 5.0 0.0 7
2016-01-03 8 9.0 10.0 11
2016-01-04 12 13.0 14.0 15
2016-01-05 16 17.0 18.0 19
2016-01-06 20 21.0 22.0 23
a b c d
2016-01-01 False True False False
2016-01-02 False False True False
2016-01-03 False False False False
2016-01-04 False False False False
2016-01-05 False False False False
2016-01-06 False False False False