pandas 的基本介绍
pandas更像一个字典形式的numpy。如果说numpy是一个列表的话,pandas可以说成字典。
# 因为可以给不同的行不同列重新命名。 一般是结合使用
import numpy as np
import pandas as pd
s = pd.Series([1,3,6,np.nan,44,1])
print(s)
dates = pd.date_range('20180101',periods=6)
dates
#DataFrame 类似于numpy的矩阵
df = pd.DataFrame(np.random.randn(),index=dates,columns=['a','b','c','d']) #行是index 或者是row
print(df)
df2 = pd.DataFrame(np.arange(12).reshape((3,4)))
print(df2)
#输出
0 1.0
1 3.0
2 6.0
3 NaN
4 44.0
5 1.0
dtype: float64
a b c d
2018-01-01 -2.110566 -2.110566 -2.110566 -2.110566
2018-01-02 -2.110566 -2.110566 -2.110566 -2.110566
2018-01-03 -2.110566 -2.110566 -2.110566 -2.110566
2018-01-04 -2.110566 -2.110566 -2.110566 -2.110566
2018-01-05 -2.110566 -2.110566 -2.110566 -2.110566
2018-01-06 -2.110566 -2.110566 -2.110566 -2.110566
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
#另一种方式生成DataFrame
#用字典的方式,一列代表一行数据。
df3 = pd.DataFrame({'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo'})
print(df3)
print(df3.dtypes) #与numpy一样,d代表有多少维度的数据。 打印出每一列的数据类型
print(df3.index) # 每一行的名字
print(df3.columns) # 每一列的名字
print(df3.values) #打印值
print(df3.describe()) # 只能对数字行进行统计。
print(df3.T) #同样转置
print(df3.sort_index(axis=1,ascending=False)) #对index进行倒序排序
#除了排序index,还可以排序values
print(df3.sort_values(by='E'))
#输出
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
Int64Index([0, 1, 2, 3], dtype='int64')
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
[[1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'test' 'foo']
[1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'train' 'foo']
[1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'test' 'foo']
[1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'train' 'foo']]
A C D
count 4.0 4.0 4.0
mean 1.0 1.0 3.0
std 0.0 0.0 0.0
min 1.0 1.0 3.0
25% 1.0 1.0 3.0
50% 1.0 1.0 3.0
75% 1.0 1.0 3.0
max 1.0 1.0 3.0
0 1 2 \
A 1 1 1
B 2013-01-02 00:00:00 2013-01-02 00:00:00 2013-01-02 00:00:00
C 1 1 1
D 3 3 3
E test train test
F foo foo foo
3
A 1
B 2013-01-02 00:00:00
C 1
D 3
E train
F foo
F E D C B A
0 foo test 3 1.0 2013-01-02 1.0
1 foo train 3 1.0 2013-01-02 1.0
2 foo test 3 1.0 2013-01-02 1.0
3 foo train 3 1.0 2013-01-02 1.0
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
2 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
3 1.0 2013-01-02 1.0 3 train foo
pandas选择数据
import numpy as np
import pandas as pd
dates = pd.date_range('20180101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D']) #行是index 或者是row
print(df)
#选择某一列
print(df['A'],df.A)
print(df[0:3]) #第0行到第3行
print(df['20180102':'20180104']) #第0行到第3行
#输出
A B C D
2018-01-01 0 1 2 3
2018-01-02 4 5 6 7
2018-01-03 8 9 10 11
2018-01-04 12 13 14 15
2018-01-05 16 17 18 19
2018-01-06 20 21 22 23
2018-01-01 0
2018-01-02 4
2018-01-03 8
2018-01-04 12
2018-01-05 16
2018-01-06 20
Freq: D, Name: A, dtype: int32 2018-01-01 0
2018-01-02 4
2018-01-03 8
2018-01-04 12
2018-01-05 16
2018-01-06 20
Freq: D, Name: A, dtype: int32
A B C D
2018-01-01 0 1 2 3
2018-01-02 4 5 6 7
2018-01-03 8 9 10 11
A B C D
2018-01-02 4 5 6 7
2018-01-03 8 9 10 11
2018-01-04 12 13 14 15
#select by label: loc
print(df.loc['20180101'])
print(df.loc[:,(['A','B'])])
print(df.loc['20180101',(['A','B'])])
#输出
A 0
B 1
C 2
D 3
Name: 2018-01-01 00:00:00, dtype: int32
A B
2018-01-01 0 1
2018-01-02 4 5
2018-01-03 8 9
2018-01-04 12 13
2018-01-05 16 17
2018-01-06 20 21
A 0
B 1
Name: 2018-01-01 00:00:00, dtype: int32
#select by position: iloc
print(df.iloc[3])
print(df.iloc[3:5,1:3])
print(df.iloc[[1,3,4],1:3])
#输出
A 12
B 13
C 14
D 15
Name: 2018-01-04 00:00:00, dtype: int32
B C
2018-01-04 13 14
2018-01-05 17 18
B C
2018-01-02 5 6
2018-01-04 13 14
2018-01-05 17 18
# select by mix position: ix
print(df.ix[:3,['A','c']])
# Boolean indexing 是或者否的删选
print(df)
print(df[df.A>8]) #筛选小于8的
#输出
A C
2018-01-01 0 2
2018-01-02 4 6
2018-01-03 8 10
A B C D
2018-01-01 0 1 2 3
2018-01-02 4 5 6 7
2018-01-03 8 9 10 11
2018-01-04 12 13 14 15
2018-01-05 16 17 18 19
2018-01-06 20 21 22 23
A B C D
2018-01-04 12 13 14 15
2018-01-05 16 17 18 19
2018-01-06 20 21 22 23