python增长的数据_Python DataFrame从每日数据中选择每月递增的行

Let's get right into the question. The following is the daily data:

AAA BBB CCC

date

2012-04-16 44.48 28.48 17.65

2012-04-17 44.59 28.74 17.65

2012-04-18 44.92 28.74 17.72

2012-04-19 44.92 28.62 17.72

2012-04-20 45.09 28.68 17.71

2012-04-23 45.09 28.40 17.76

2012-04-24 45.09 28.51 17.73

2012-04-25 45.01 28.76 17.73

2012-04-26 45.40 28.94 17.76

2012-04-27 45.57 29.02 17.79

2012-04-30 45.45 28.90 17.80

2012-05-01 45.79 29.07 17.80

2012-05-02 45.71 28.98 17.77

2012-05-03 45.44 28.81 17.79

2012-05-04 45.05 28.48 17.79

2012-05-07 45.05 28.48 17.79

2012-05-08 45.00 28.40 17.93

2012-05-09 44.87 28.30 17.94

2012-05-10 44.93 28.34 17.85

2012-05-11 44.86 28.30 17.96

... ... ...

I want to select the rows starting from the first row with a monthly increment, that is, the rows whose index is 2012-04-16, 2012-05-16, 2012-06-16, ... . I can just use relativedelta and manually add them but I'm wondering if there is a more efficient method. I tried resampling, but I could only choose the first or last of each month as in df.resample('M').first().

What makes the problem more complicated is that some of the dates are missing; they are business days but not those of U.S.. There are several ways to handle this problem:

Choose the exact date or the earlier one closest to the date. If

such date is nonexistent, then start looking up for the later dates.

Choose the exact date or the later one closest to the date. If such

date is nonexistent, then start looking up for the earlier dates.

Choose the closest date to the exact date regardless of being early

or late; I can use min(df.index, key=lambda x: abs(x - (df.index[0] + relativedelta(months=1))).

And in each of these cases, I wonder which method is the most efficient and easy to read. In the last code example, the month is a variable so I'm not sure if I can make it as a lambda procedure and use 'apply'.

Thanks in advance.

解决方案

Before we look at your data, let's first see how we can create a DatetimeIndex for a specific day of each month. Since the regular pd.date_range with monthly frequency takes the last day of each month, we can simply add a fixed number of days:

idx = pd.date_range('2018-04-01', '2018-07-01', freq='1M') + pd.DateOffset(days=16)

DatetimeIndex(['2018-05-16', '2018-06-16', '2018-07-16'],

dtype='datetime64[ns]', freq=None)

Now let's take an example dataframe which has some 16th days missing:

AAA BBB CCC

date

2012-04-16 44.48 28.48 17.65

2012-04-17 44.59 28.74 17.65

2012-05-15 45.79 29.07 17.80

2012-05-16 45.71 28.98 17.77

2012-05-17 45.44 28.81 17.79

2012-06-15 44.87 28.30 17.94

2012-06-17 44.95 28.50 17.98

2012-07-14 44.65 28.25 17.87

2012-07-17 44.55 28.75 17.75

As you mention, there are a number of ways you can decide on how to select non-matching days, either go backwards, forwards, or look for nearest with no preference. You need to consider what's most appropriate in the context of your project. Below is a solution which sticks to Pandas functionality and avoids custom lambda functions.

Define dataframe with DatetimeIndex

First create a dataframe with only required indices specified:

offset = pd.DateOffset(days=16)

start_date = df.index[0]-pd.DateOffset(months=1)

idx = pd.date_range(start_date, df.index[-1], freq='1M') + offset

df_idx = pd.DataFrame(index=idx)

Notice we need to subtract a month from the start argument, so that the first month is not omitted after adding 16 days. Now you can use pd.merge_asof with a variety of options:-

Match backwards / forwards / nearest via merge_asof

Specify direction argument as 'backward' (default), 'forward' or 'nearest' as appropriate. For example, using 'forward':

print(pd.merge_asof(df_idx, df, left_index=True, right_index=True, direction='forward'))

AAA BBB CCC

2012-04-16 44.48 28.48 17.65

2012-05-16 45.71 28.98 17.77

2012-06-16 44.95 28.50 17.98

2012-07-16 44.55 28.75 17.75

This now may be sufficient for your needs.

Edit: If you want to keep the index from the dataframe, you can reverse the direction of the merge and use 'backward' instead of 'forward':

res = pd.merge_asof(df.reset_index(),

df_idx.reset_index().rename(columns={'index': 'date_idx'}),

left_on='date', right_on='date_idx', direction='backward')

res['diff'] = (res['date'] - res['date_idx']).dt.days.abs()

grouper = res['date'].dt.strftime('%Y-%m')

res = res[res['diff'] == res.groupby(grouper)['diff'].transform('min')]

print(res)

date AAA BBB CCC date_idx diff

0 2012-04-16 44.48 28.48 17.65 2012-04-16 0

3 2012-05-16 45.71 28.98 17.77 2012-05-16 0

6 2012-06-17 44.95 28.50 17.98 2012-06-16 1

8 2012-07-17 44.55 28.75 17.75 2012-07-16 1

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值