python按一小时进行分组_在python / pandas中按月对每日数据进行分组,然后进行规范化...

本篇博客介绍了如何在Python的Pandas库中,利用DataFrame对包含日期和访问次数的数据进行按月分组,并进行归一化的操作。首先,创建了示例数据,接着演示了如何使用resample方法按月对数据进行分组求和,以及如何通过groupby和apply函数进行分组后除以总数以实现归一化。此外,还展示了如何处理缺失值并调整列顺序。
摘要由CSDN通过智能技术生成

如果我理解正确的话:

对于(1)这样做:

通过从您提供的值和一些随机日期和访问次数中抽样来制作一些虚假数据:

In [179]: string = Series(np.random.choice(df.string.values, size=100), name='string')

In [180]: visits = Series(poisson(1000, size=100), name='date')

In [181]: date = Series(np.random.choice([df.date[0], now(), Timestamp('1/1/2001'), Timestamp('11/15/2001'), Timestamp('12/1/01'), Timestamp('5/1/01')], size=100), dtype='datetime64[ns]', name='date')

In [182]: df = DataFrame({'string': string, 'visits': visits, 'date': date})

In [183]: df.head()

Out[183]:

date string visits

0 2001-11-15 00:00:00 current 997

1 2001-11-15 00:00:00 current 974

2 2012-10-02 00:00:00 stem 982

3 2001-12-01 00:00:00 stem 984

4 2001-01-01 00:00:00 current 989

In [186]: resamp = df.set_index('date').groupby('string').resample('M', how='sum')

In [187]: resamp.head()

Out[187]:

visits

string date

current 2001-01-31 2996

2001-02-28 NaN

2001-03-31 NaN

2001-04-30 NaN

2001-05-31 3016

NaN 因为那些月份没有访问该查询字符串 .

对于(2),按日期分组然后除以总和:

In [188]: g = resamp.groupby(level='date').apply(lambda x: x / x.sum())

In [189]: g.head()

Out[189]:

visits

string date

current 2001-01-31 0.177

2001-02-28 NaN

2001-03-31 NaN

2001-04-30 NaN

2001-05-31 0.188

只是为了说服你(2)做你想做的事:

In [176]: h = g.sortlevel('date').head()

In [177]: h

Out[177]:

visits

string date

current 2001-01-31 0.077

molecular 2001-01-31 0.228

neuron 2001-01-31 0.073

nucleus 2001-01-31 0.234

stem 2001-01-31 0.388

In [178]: h.sum()

Out[178]:

visits 1

dtype: float64

如果你想将 resamp 转换为 DataFrame 并删除 NaN ,请执行以下操作:

In [196]: resamp.dropna()

Out[196]:

visits

string date

current 2001-01-31 2996

2001-05-31 3016

2001-11-30 5959

2001-12-31 3998

2013-09-30 1077

molecular 2001-01-31 3984

2001-05-31 1911

2001-11-30 3054

2001-12-31 1020

2012-10-31 977

2013-09-30 1947

neuron 2001-01-31 3961

2001-05-31 2069

2001-11-30 5010

2001-12-31 2065

2012-10-31 6973

2013-09-30 994

nucleus 2001-01-31 3060

2001-05-31 3035

2001-11-30 2924

2001-12-31 4144

2012-10-31 2004

2013-09-30 7881

stem 2001-01-31 2911

2001-05-31 5994

2001-11-30 6072

2001-12-31 4916

2012-10-31 1991

2013-09-30 3977

In [197]: resamp.dropna().reset_index()

Out[197]:

string date visits

0 current 2001-01-31 00:00:00 2996

1 current 2001-05-31 00:00:00 3016

2 current 2001-11-30 00:00:00 5959

3 current 2001-12-31 00:00:00 3998

4 current 2013-09-30 00:00:00 1077

5 molecular 2001-01-31 00:00:00 3984

6 molecular 2001-05-31 00:00:00 1911

7 molecular 2001-11-30 00:00:00 3054

8 molecular 2001-12-31 00:00:00 1020

9 molecular 2012-10-31 00:00:00 977

10 molecular 2013-09-30 00:00:00 1947

11 neuron 2001-01-31 00:00:00 3961

12 neuron 2001-05-31 00:00:00 2069

13 neuron 2001-11-30 00:00:00 5010

14 neuron 2001-12-31 00:00:00 2065

15 neuron 2012-10-31 00:00:00 6973

16 neuron 2013-09-30 00:00:00 994

17 nucleus 2001-01-31 00:00:00 3060

18 nucleus 2001-05-31 00:00:00 3035

19 nucleus 2001-11-30 00:00:00 2924

20 nucleus 2001-12-31 00:00:00 4144

21 nucleus 2012-10-31 00:00:00 2004

22 nucleus 2013-09-30 00:00:00 7881

23 stem 2001-01-31 00:00:00 2911

24 stem 2001-05-31 00:00:00 5994

25 stem 2001-11-30 00:00:00 6072

26 stem 2001-12-31 00:00:00 4916

27 stem 2012-10-31 00:00:00 1991

28 stem 2013-09-30 00:00:00 3977

您当然可以为 g 执行此操作:

In [198]: g.dropna()

Out[198]:

visits

string date

current 2001-01-31 0.177

2001-05-31 0.188

2001-11-30 0.259

2001-12-31 0.248

2013-09-30 0.068

molecular 2001-01-31 0.236

2001-05-31 0.119

2001-11-30 0.133

2001-12-31 0.063

2012-10-31 0.082

2013-09-30 0.123

neuron 2001-01-31 0.234

2001-05-31 0.129

2001-11-30 0.218

2001-12-31 0.128

2012-10-31 0.584

2013-09-30 0.063

nucleus 2001-01-31 0.181

2001-05-31 0.189

2001-11-30 0.127

2001-12-31 0.257

2012-10-31 0.168

2013-09-30 0.496

stem 2001-01-31 0.172

2001-05-31 0.374

2001-11-30 0.264

2001-12-31 0.305

2012-10-31 0.167

2013-09-30 0.251

In [199]: g.dropna().reset_index()

Out[199]:

string date visits

0 current 2001-01-31 00:00:00 0.177

1 current 2001-05-31 00:00:00 0.188

2 current 2001-11-30 00:00:00 0.259

3 current 2001-12-31 00:00:00 0.248

4 current 2013-09-30 00:00:00 0.068

5 molecular 2001-01-31 00:00:00 0.236

6 molecular 2001-05-31 00:00:00 0.119

7 molecular 2001-11-30 00:00:00 0.133

8 molecular 2001-12-31 00:00:00 0.063

9 molecular 2012-10-31 00:00:00 0.082

10 molecular 2013-09-30 00:00:00 0.123

11 neuron 2001-01-31 00:00:00 0.234

12 neuron 2001-05-31 00:00:00 0.129

13 neuron 2001-11-30 00:00:00 0.218

14 neuron 2001-12-31 00:00:00 0.128

15 neuron 2012-10-31 00:00:00 0.584

16 neuron 2013-09-30 00:00:00 0.063

17 nucleus 2001-01-31 00:00:00 0.181

18 nucleus 2001-05-31 00:00:00 0.189

19 nucleus 2001-11-30 00:00:00 0.127

20 nucleus 2001-12-31 00:00:00 0.257

21 nucleus 2012-10-31 00:00:00 0.168

22 nucleus 2013-09-30 00:00:00 0.496

23 stem 2001-01-31 00:00:00 0.172

24 stem 2001-05-31 00:00:00 0.374

25 stem 2001-11-30 00:00:00 0.264

26 stem 2001-12-31 00:00:00 0.305

27 stem 2012-10-31 00:00:00 0.167

28 stem 2013-09-30 00:00:00 0.251

最后,如果要以不同的顺序放置列,请使用 reindex :

In [210]: g.dropna().reset_index().reindex(columns=['visits', 'string', 'date'])

Out[210]:

visits string date

0 0.177 current 2001-01-31 00:00:00

1 0.188 current 2001-05-31 00:00:00

2 0.259 current 2001-11-30 00:00:00

3 0.248 current 2001-12-31 00:00:00

4 0.068 current 2013-09-30 00:00:00

5 0.236 molecular 2001-01-31 00:00:00

6 0.119 molecular 2001-05-31 00:00:00

7 0.133 molecular 2001-11-30 00:00:00

8 0.063 molecular 2001-12-31 00:00:00

9 0.082 molecular 2012-10-31 00:00:00

10 0.123 molecular 2013-09-30 00:00:00

11 0.234 neuron 2001-01-31 00:00:00

12 0.129 neuron 2001-05-31 00:00:00

13 0.218 neuron 2001-11-30 00:00:00

14 0.128 neuron 2001-12-31 00:00:00

15 0.584 neuron 2012-10-31 00:00:00

16 0.063 neuron 2013-09-30 00:00:00

17 0.181 nucleus 2001-01-31 00:00:00

18 0.189 nucleus 2001-05-31 00:00:00

19 0.127 nucleus 2001-11-30 00:00:00

20 0.257 nucleus 2001-12-31 00:00:00

21 0.168 nucleus 2012-10-31 00:00:00

22 0.496 nucleus 2013-09-30 00:00:00

23 0.172 stem 2001-01-31 00:00:00

24 0.374 stem 2001-05-31 00:00:00

25 0.264 stem 2001-11-30 00:00:00

26 0.305 stem 2001-12-31 00:00:00

27 0.167 stem 2012-10-31 00:00:00

28 0.251 stem 2013-09-30 00:00:00

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值