一、多重索引
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two',
'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B']) #将列索引变成行索引
stacked = df.stack()
print(stacked.index)
MultiIndex([('bar', 'one', 'A'),
('bar', 'one', 'B'),
('bar', 'two', 'A'),
('bar', 'two', 'B'),
('baz', 'one', 'A'),
('baz', 'one', 'B'),
('baz', 'two', 'A'),
('baz', 'two', 'B'),
('foo', 'one', 'A'),
('foo', 'one', 'B'),
('foo', 'two', 'A'),
('foo', 'two', 'B'),
('qux', 'one', 'A'),
('qux', 'one', 'B'),
('qux', 'two', 'A'),
('qux', 'two', 'B')],
names=['first', 'second', None])
二、数据透视
只看一部分数据
df = pd.DataFrame({'A': ['one', 'one', 'two', 'thee']*3,
'B': ['A', 'B', 'C']*4,
'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar']*2,
'D': np.random.randn(12),
'E': np.random.randn(12)
})
print(df)
A B C D E
0 one A foo 1.383200 -0.146617
1 one B foo 0.055231 -0.654110
2 two C foo -0.341115 0.623954
3 thee A bar 2.878976 0.833007
4 one B bar -0.690180 1.630808
5 one C bar 0.686234 0.954150
6 two A foo 0.304506 1.406060
7 thee B foo -0.105632 0.320874
8 one C foo -0.915833 0.175327
9 one A bar -0.392937 -0.310135
10 two B bar -0.714334 0.831486
11 thee C bar 0.808777 -2.476674
#看以AB为行索引,C为列索引,针对E的数据
print(df.pivot_table(values=['D'], index=['A', 'B'], columns=['C']))
现在A、B变成行索引,先通过A的one,A,在从C中找到bar,就可以对应到df中的-0.932937
D
C bar foo
A B
one A -0.392937 1.383200
B -0.690180 0.055231
C 0.686234 -0.915833
thee A 2.878976 NaN
B NaN -0.105632
C 0.808777 NaN
two A NaN 0.304506
B -0.714334 NaN
C NaN -0.341115
Process finished with exit code 0
三、时间处理
rng = pd.date_range('20200601', periods=600, freq='s')
s = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
print(s.resample('2Min', how='sum')) #每两分钟进行采样
2020-06-01 00:00:00 32180
2020-06-01 00:02:00 30638
2020-06-01 00:04:00 31144
2020-06-01 00:06:00 28279
2020-06-01 00:08:00 29429
Freq: 2T, dtype: int32
to_timestamp():将时间序列转化成时间日期的格式
pd.Timestamp('20181020') - pd.Timestamp('20180920'):时间运算
pd.Timestamp('20181020') + pd.Timedelta(days = 5):加上五天后的时间
四、类别数据
df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6], "raw_grade": ['a', 'b', 'b', 'a', 'a', 'd']})
df['grade'] = df.raw_grade.astype('category') #再给df加一个 raw_grade相同的列
id raw_grade grade
0 1 a a
1 2 b b
2 3 b b
3 4 a a
4 5 a a
5 6 d d
print(df.grade.cat.categories) #查看类别
df.grade.cat.categories = ['very good', 'good', 'bad'] #对类别重命名
print(df)
id raw_grade grade
0 1 a very good
1 2 b good
2 3 b good
3 4 a very good
4 5 a very good
5 6 d bad
df.sort_values(by='grade', ascending=True) #进行排序
s = pd.Series(np.random.randn(1000), index = pd.date_range('20000101', periods=1000))
s = s.cumsum() #对数据进行求和
plt.plot(s) #画出图像
plt.show()
五、读写数据
df = pd.DataFrame(np.random.randn(100,4), columns=list('ABCD'))
df.to_csv('data.csv') #写入到磁盘里面
pd.read_csv('data.csv') #读数据