如果我理解正确:
对于(1)执行以下操作:
通过从你给出的数值、随机日期和访问次数中进行抽样,制作一些假数据:In [179]: string = Series(np.random.choice(df.string.values, size=100), name='string')
In [180]: visits = Series(poisson(1000, size=100), name='date')
In [181]: date = Series(np.random.choice([df.date[0], now(), Timestamp('1/1/2001'), Timestamp('11/15/2001'), Timestamp('12/1/01'), Timestamp('5/1/01')], size=100), dtype='datetime64[ns]', name='date')
In [182]: df = DataFrame({'string': string, 'visits': visits, 'date': date})
In [183]: df.head()
Out[183]:
date string visits
0 2001-11-15 00:00:00 current 997
1 2001-11-15 00:00:00 current 974
2 2012-10-02 00:00:00 stem 982
3 2001-12-01 00:00:00 stem 984
4 2001-01-01 00:00:00 current 989
In [186]: resamp = df.set_index('date').groupby('string').resample('M', how='sum')
In [187]: resamp.head()
Out[187]:
visits
string date
current 2001-01-31 2996
2001-02-28 NaN
2001-03-31 NaN
2001-04-30 NaN
2001-05-31 3016
之所以有NaN是因为在那几个月里没有使用该查询字符串的访问。
对于(2),按日期分组,然后除以和:In [188]: g = resamp.groupby(level='date').apply(lambda x: x / x.sum())
In [189]: g.head()
Out[189]:
visits
string date
current 2001-01-31 0.177