1.pandas shift
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(16).reshape(4,4),columns=['AA','BB','CC','DD'],index =['a','b','c','d'])
df
| AA | BB | CC | DD |
---|
a | 0 | 1 | 2 | 3 |
---|
b | 4 | 5 | 6 | 7 |
---|
c | 8 | 9 | 10 | 11 |
---|
d | 12 | 13 | 14 | 15 |
---|
1.1 上下移动
df.shift(1)
| AA | BB | CC | DD |
---|
a | NaN | NaN | NaN | NaN |
---|
b | 0.0 | 1.0 | 2.0 | 3.0 |
---|
c | 4.0 | 5.0 | 6.0 | 7.0 |
---|
d | 8.0 | 9.0 | 10.0 | 11.0 |
---|
df.shift(-1)
| AA | BB | CC | DD |
---|
a | 4.0 | 5.0 | 6.0 | 7.0 |
---|
b | 8.0 | 9.0 | 10.0 | 11.0 |
---|
c | 12.0 | 13.0 | 14.0 | 15.0 |
---|
d | NaN | NaN | NaN | NaN |
---|
1.2 左右移动
df.shift(1,axis = 1)
| AA | BB | CC | DD |
---|
a | NaN | 0.0 | 1.0 | 2.0 |
---|
b | NaN | 4.0 | 5.0 | 6.0 |
---|
c | NaN | 8.0 | 9.0 | 10.0 |
---|
d | NaN | 12.0 | 13.0 | 14.0 |
---|
df.shift(-2,axis = 1)
| AA | BB | CC | DD |
---|
a | 2.0 | 3.0 | NaN | NaN |
---|
b | 6.0 | 7.0 | NaN | NaN |
---|
c | 10.0 | 11.0 | NaN | NaN |
---|
d | 14.0 | 15.0 | NaN | NaN |
---|
1.3 groupby 之后再shift,(分组之后再shift)
df = pd.DataFrame({'id':[1,1,1,2,2,3],'value':[1,2,3,4,5,6]})
df
| id | value |
---|
0 | 1 | 1 |
---|
1 | 1 | 2 |
---|
2 | 1 | 3 |
---|
3 | 2 | 4 |
---|
4 | 2 | 5 |
---|
5 | 3 | 6 |
---|
for k,v in df.groupby(['id'])['value']:
print(k)
print(v)
1
0 1
1 2
2 3
Name: value, dtype: int64
2
3 4
4 5
Name: value, dtype: int64
3
5 6
Name: value, dtype: int64
df.groupby(['id'])['value'].shift(1)
0 NaN
1 1.0
2 2.0
3 NaN
4 4.0
5 NaN
Name: value, dtype: float64
2.pandas rolling
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]})
df
-
相当于统计的但是窗口为3的期间内,的和
-
window:表示时间窗的大小,注意有两种形式(int or offset)。如果使用int,则数值表示计算统计量的观测值的数量即向前几个数据。如果是offset类型,表示时间窗的大小。pandas offset相关可以参考这里。
-
min_periods:最少需要有值的观测点的数量,对于int类型,默认与window相等。对于offset类型,默认为1。
-
center:是否使用window的中间值作为label,默认为false。只能在window是int时使用。
df.rolling(3, min_periods=1).sum()
df.rolling(3, min_periods=2).sum()
- center相当于是滑动窗口的位置再最中间,然后来统计结果
df.rolling(3, min_periods=1, center=True).sum()
df = pd.DataFrame({'1': ['A1', 'A2', 'A1', 'A2', 'A2', 'A1', 'A2'],
'2': ['B1', 'B1', 'B1', 'B1', 'B1', 'B1', 'B1'],
'num': [1,2,1,3,4,2,1]},
index = [pd.Timestamp('20130101 09:00:00'),
pd.Timestamp('20130101 09:00:01'),
pd.Timestamp('20130101 09:00:02'),
pd.Timestamp('20130101 09:00:03'),
pd.Timestamp('20130101 09:00:04'),
pd.Timestamp('20130101 09:00:05'),
pd.Timestamp('20130101 09:00:06')])
df
| 1 | 2 | num |
---|
2013-01-01 09:00:00 | A1 | B1 | 1 |
---|
2013-01-01 09:00:01 | A2 | B1 | 2 |
---|
2013-01-01 09:00:02 | A1 | B1 | 1 |
---|
2013-01-01 09:00:03 | A2 | B1 | 3 |
---|
2013-01-01 09:00:04 | A2 | B1 | 4 |
---|
2013-01-01 09:00:05 | A1 | B1 | 2 |
---|
2013-01-01 09:00:06 | A2 | B1 | 1 |
---|
for k,v in df.groupby(['1', '2'])['num']:
print(k)
print(v)
('A1', 'B1')
2013-01-01 09:00:00 1
2013-01-01 09:00:02 1
2013-01-01 09:00:05 2
Name: num, dtype: int64
('A2', 'B1')
2013-01-01 09:00:01 2
2013-01-01 09:00:03 3
2013-01-01 09:00:04 4
2013-01-01 09:00:06 1
Name: num, dtype: int64
df.groupby(['1', '2'])['num'].rolling('3s').sum()
1 2
A1 B1 2013-01-01 09:00:00 1.0
2013-01-01 09:00:02 2.0
2013-01-01 09:00:05 2.0
A2 B1 2013-01-01 09:00:01 2.0
2013-01-01 09:00:03 5.0
2013-01-01 09:00:04 7.0
2013-01-01 09:00:06 5.0
Name: num, dtype: float64
3.pandas rank
list1 = [1, 3, 1,0,7,4,0]
list2 = [3, 3, 2,0,4,4,5]
list3 = [3, 3, 3,3,4,4,6]
df1 = pd.DataFrame({'col1':list1,'col2':list2,'col3':list3})
print(df1)
col1 col2 col3
0 1 3 3
1 3 3 3
2 1 2 3
3 0 0 3
4 7 4 4
5 4 4 4
6 0 5 6
df1=df1.set_index('col2')
df1
| col1 | col3 |
---|
col2 | | |
---|
3 | 1 | 3 |
---|
3 | 3 | 3 |
---|
2 | 1 | 3 |
---|
0 | 0 | 3 |
---|
4 | 7 | 4 |
---|
4 | 4 | 4 |
---|
5 | 0 | 6 |
---|
for k,v in df1.groupby(['col3']):
print(k)
print(v)
3
col1 col3
col2
3 1 3
3 3 3
2 1 3
0 0 3
4
col1 col3
col2
4 7 4
4 4 4
6
col1 col3
col2
5 0 6
- 就是看分组了之后,这个数在这个组里面排第几,是按从小到大排列还是从大到小排列
df1 = df1.groupby(['col3']).rank(method='min',ascending=False)
df1
| col1 |
---|
col2 | |
---|
3 | 2.0 |
---|
3 | 1.0 |
---|
2 | 2.0 |
---|
0 | 4.0 |
---|
4 | 1.0 |
---|
4 | 2.0 |
---|
5 | 1.0 |
---|
df1 = df1.groupby(['col3']).rank(method='max',ascending=False)
df1
| col1 |
---|
col2 | |
---|
3 | 3.0 |
---|
3 | 1.0 |
---|
2 | 3.0 |
---|
0 | 4.0 |
---|
4 | 1.0 |
---|
4 | 2.0 |
---|
5 | 1.0 |
---|