python 按月份分组_Python:按小时,日期和月份按年份分组在Pandas中过滤DataFrame...

Being new to Pandas I had to dig a lot in order to find a solution to this problem. I would like to know a better way to get this resolved, taking into account I still need to resolve the border problems.

I have a set of 10 minutal measures of "Power" from 2009 till 2012 and want to get a window of hours and day/month for all the years (i.e. Filter by hour, day and month grouped by year).

What I have come to is as follows:

import pandas as pd

import numpy as np

import datetime

dates = pd.date_range(start="08/01/2009",end="08/01/2012",freq="10min")

df = pd.DataFrame(np.random.rand(len(dates), 1)*1500, index=dates, columns=['Power'])

def filter(df, day, month, hour, daysWindow, hoursWindow):

"""

Filter a Dataframe by a date window and hour window grouped by years

@type df: DataFrame

@param df: DataFrame with dates and values

@type day: int

@param day: Day to focus on

@type month: int

@param month: Month to focus on

@type hour: int

@param hour: Hour to focus on

@type daysWindow: int

@param daysWindow: Number of days to perform the days window selection

@type hourWindow: int

@param hourWindow: Number of hours to perform the hours window selection

@rtype: DataFrame

@return: Returns a DataFrame with the

"""

df_filtered = None

grouped = df.groupby(lambda x : x.year)

for year, groupYear in grouped:

groupedMonthDay = groupYear.groupby(lambda x : (x.month, x.day))

for monthDay, groupMonthDay in groupedMonthDay:

if monthDay >= (month,day - daysWindow) and monthDay <= (month,day + daysWindow):

new_df = groupMonthDay.ix[groupMonthDay.index.indexer_between_time(datetime.time(hour - hoursWindow), datetime.time(hour + hoursWindow))]

if df_filtered is None:

df_filtered = new_df

else:

df_filtered = df_filtered.append(new_df)

return df_filtered

df_filtered = filter(df,day=8, month=10, hour=8, daysWindow=1, hoursWindow=1)

print len(df)

print len(df_filtered)

Which returns as output:

>>>

157825

117

Of course there would be an improvement this code needs regarding border issues when selecting an hour like 1 and hoursWindow 2. i.e.:

>>> filter(df,day=8, month=10, hour=1, daysWindow=1, hoursWindow=2)

Traceback (most recent call last):

File "", line 1, in

File "D:\tmp\test_filtro.py", line 40, in filter

new_df = groupMonthDay.ix[groupMonthDay.index.indexer_between_time(datetime.time(hour - hoursWindow), datetime.time(hour + hoursWindow))]

ValueError: hour must be in 0..23

Similar issue would happen when selecting a day like 1 or 30.

How could this code be improved?

解决方案

Updated code for filter function ensures there is no border issues:

import pandas as pd

import numpy as np

import datetime

dates = pd.date_range(start="08/01/2009",end="08/01/2012",freq="10min")

df = pd.DataFrame(np.random.rand(len(dates), 1)*1500, index=dates, columns=['Power'])

def filter(df, day, month, hour, minute=0, daysWindow=1, hoursWindow=1):

"""

Filter a Dataframe by a date window and hour window grouped by years

@type df: DataFrame

@param df: DataFrame with dates and values

@type day: int

@param day: Day to focus on

@type month: int

@param month: Month to focus on

@type hour: int

@param hour: Hour to focus on

@type daysWindow: int

@param daysWindow: Number of days to perform the days window selection

@type hoursWindow: int

@param hourWindow: Number of hours to perform the hours window selection

@rtype: DataFrame

@return: Returns a DataFrame with the

"""

df_filtered = None

grouped = df.groupby(lambda x : x.year)

for year, groupYear in grouped:

date = datetime.date(year, month, day)

dateStart = date - datetime.timedelta(days=daysWindow)

dateEnd = date + datetime.timedelta(days=daysWindow+1)

df_filtered_days = df[dateStart:dateEnd]

timeStart = datetime.time(0 if hour-hoursWindow < 0 else hour-hoursWindow, minute)

timeEnd = datetime.time(23 if hour+hoursWindow > 23 else hour+hoursWindow, minute)

new_df = df_filtered_days.ix[df_filtered_days.index.indexer_between_time(timeStart, timeEnd)]

if df_filtered is None:

df_filtered = new_df

else:

df_filtered = df_filtered.append(new_df)

return df_filtered

df_filtered = filter(df,day=8, month=10, hour=1, daysWindow=1, hoursWindow=2)

print len(df)

print len(df_filtered)

Output is:

>>>

157825

174

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值