一般与numpy一起使用
1,基本属性
(1)pd.Series 序列
把数标上一个序号
用法
import numpy as np
import pandas as pd
s=pd.Series([1,3,6,np.nan,44,1])
print(s)
结果
0 1.0
1 3.0
2 6.0
3 NaN
4 44.0
5 1.0
dtype: float64
(2)pd.DataFrame
dates=pd.date_range('20220101',periods=6)
df=pd.DataFrame(np.random.randn(6,4),index=dates,columns=['a','b','c','d'])
pd.date_range在这里生成了六个数据
np.random.randn(6,4)定义为六行四列
index行的索引
columns列的名字
这里也可以不写index和columns使用默认的
运行结果
a b c d
2022-01-01 1.326867 -0.389798 0.186750 1.986374
2022-01-02 0.664617 2.271001 -0.487486 0.421387
2022-01-03 -1.309948 1.147211 -1.573718 1.434102
2022-01-04 0.700662 -1.795573 -0.043631 2.298165
2022-01-05 -0.285750 -0.473514 1.030131 0.104888
2022-01-06 0.887182 1.187699 1.017360 -1.431845
还可以用字典的方式定义
df=pd.DataFrame({'A':1.,
'B':pd.Timestamp('20220102'),
'C':pd.Series(1,index=list(range(4)),dtype='float32'),
'D':np.array([3]*4,dtype='int32'),
'E':pd.Categorical(["test","train","test","train"]),
'F':'foo'})
结果
A B C D E F
0 1.0 2022-01-02 1.0 3 test foo
1 1.0 2022-01-02 1.0 3 train foo
2 1.0 2022-01-02 1.0 3 test foo
3 1.0 2022-01-02 1.0 3 train foo
还可以单独取出各个属性:
df.index取出所有行的名称
df.columns取出所有列的名称
df.values取出所有元素
df.describe()打印出平均值,方差等属性
df.T将矩阵转置
排序
import numpy as np
import pandas as pd
dates=pd.date_range('20220101',periods=6)
df=pd.DataFrame({'A':1.,
'B':pd.Timestamp('20220102'),
'C':pd.Series(1,index=list(range(4)),dtype='float32'),
'D':np.array([3]*4,dtype='int32'),
'E':pd.Categorical(["test","train","test","train"]),
'F':'foo'})
df2=df.sort_index(axis=1,ascending=False)
print(df2)
df2=df.sort_index(axis=0,ascending=False)
print(df2)
df2=df.sort_values(by='E')
print(df2)
结果
F E D C B A
0 foo test 3 1.0 2022-01-02 1.0
1 foo train 3 1.0 2022-01-02 1.0
2 foo test 3 1.0 2022-01-02 1.0
3 foo train 3 1.0 2022-01-02 1.0
A B C D E F
3 1.0 2022-01-02 1.0 3 train foo
2 1.0 2022-01-02 1.0 3 test foo
1 1.0 2022-01-02 1.0 3 train foo
0 1.0 2022-01-02 1.0 3 test foo
A B C D E F
0 1.0 2022-01-02 1.0 3 test foo
2 1.0 2022-01-02 1.0 3 test foo
1 1.0 2022-01-02 1.0 3 train foo
3 1.0 2022-01-02 1.0 3 train foo
2,选择数据
原始数据
A B C D
2022-01-01 0 1 2 3
2022-01-02 4 5 6 7
2022-01-03 8 9 10 11
2022-01-04 12 13 14 15
2022-01-05 16 17 18 19
2022-01-06 20 21 22 23
(1)print(df.['A'],df.A)
2022-01-01 0
2022-01-02 4
2022-01-03 8
2022-01-04 12
2022-01-05 16
2022-01-06 20
Freq: D, Name: A, dtype: int32 2022-01-01 0
2022-01-02 4
2022-01-03 8
2022-01-04 12
2022-01-05 16
2022-01-06 20
Freq: D, Name: A, dtype: int32Process finished with exit code 0
(2)print(df[0:3],df['20220102':'20220104'])
A B C D
2022-01-01 0 1 2 3
2022-01-02 4 5 6 7
2022-01-03 8 9 10 11 A B C D
2022-01-02 4 5 6 7
2022-01-03 8 9 10 11
2022-01-04 12 13 14 15Process finished with exit code 0
(3)print(df.loc['20220102'])
A 4
B 5
C 6
D 7
Name: 2022-01-02 00:00:00, dtype: int32
(4)print(df.loc[:,['A','B']])
A B
2022-01-01 0 1
2022-01-02 4 5
2022-01-03 8 9
2022-01-04 12 13
2022-01-05 16 17
2022-01-06 20 21Process finished with exit code 0
(5)print(df.iloc[3,1])
13
(6)print(df.iloc[3:5,1:3])切片
B C
2022-01-04 13 14
2022-01-05 17 18
(7)print(df.ix[:3,['A','B']])
(8)print(df[df.A>8])
A B C D
2022-01-01 0 1 2 3
2022-01-02 4 5 6 7
3,导入导出
Read_…… (导入)
To_…… (导出)
4,合并
合并(concat)
pd.concat([,,,……],axis=,ignore_index=True/False)ignore_index=重新排序
pd.concat([……],join=outer/inner)outer保留所有的/inner保留共有的
pd.concat([……],join_index=[a.index])使用a的索引
append([b,c])将a,b,c拼接
合并(merge)
Pd.merge(left,right,on=[‘key1’,’key2’],how=[……])how=[‘left’,’right’,’outer’,’inner’]
Suffixes[‘_boy’,’_girl’]合并时区分同一个名字属于不同的元素