需求
1、对于每个product id,满足:6月有数,9月有数 且(7月8月任意一个月为空/0 OR 7+8的sales < 50%*9的sales)
2、则 apply:7+8+9的数,按32% 30% 38%分配在这三个月,volume round到INT, value round到保留2位小数
3、其中:sales = Value;sales_volume= Volume
code question
– groupby().apply() 会重复输出第一行,返回却是正确的,没有重复项
– 如何查找相同id,相同月份,行的数量
对id和月份进行groupby
– 一个df中,78月两个月的行数据,假如可能出现sales列7有数据8没有,怎么计算sale列的sum
df[(df==7) | (df==8)]['sales'].sum()
解决思路:
1、完成第一步筛选任务,获得filter_df
1、筛选出同时具有6月和9月的ID
2、删除同时具有7和8月且两月之和大于9月的ID
2、完成第二步按比例赋值
1、构建一个789月的month_df_789,具有month_dt,ratio
2、month_df_789与筛选出来的single_df做merge,再为single_df添加两列s_sum,v_sum
1、以month_df_789为左连接,构建出temp_merge_df格式的数据
2、再将m_df的列ratio乘以新两列,结果赋给sales,sales_volume
3、修改类型,复制给final_df
"""
Email: 704669594@qq.com
data: 2019/4/21/0021
"""
import pandas as pd
# 筛选出符合条件的数据,使用切片选出有点不合适,采用groupby.apply筛选
# 本来先过滤count小于2的id
count = 0
def filter_month(single_df):
global count
count +=1
if count % 10000 == 0:
print(f'process {count} data')
# single_df.month_dt.tolist()
# 一种方式,tolist后,用in的方法
sdmd = single_df.month_dt
sim = '2017-06-01'
sem = '2017-07-01'
eim = '2017-08-01'
nim = '2017-09-01'
# 筛选出同时具有6月和9月的ID
if any(sdmd == sim) and any(sdmd == nim):
# 同时具有7和8月 且 两月之和大于9月 直接返回df
if any(sdmd == sem) and any(sdmd == eim):
if any(single_df[(sdmd == sem) | (sdmd == eim)]['sales'].sum() > 0.5 * single_df[sdmd == nim]['sales']):
return single_df
# 其他情况都需要处理
"""
1、构建一个789月的month_df_789,具有month_dt,ratio
2、month_df_789与筛选出来的single_df做merge,再为single_df添加两列s_sum,v_sum
1、以month_df_789为左连接,构建出temp_merge_df格式的数据
2、再将m_df的列ratio乘以新两列,结果赋给sales,sales_volume
"""
month_df_789 = pd.DataFrame({
'month_dt': ['2017-07-01', '2017-08-01', '2017-09-01'],
'ratio': [0.32, 0.3, 0.38]})
temp_merge_df = pd.merge(month_df_789, single_df, on=['month_dt'], how='left')
temp_merge_df['s_sum'] = temp_merge_df.sales.sum()
temp_merge_df['v_sum'] = temp_merge_df.sales_volume.sum()
temp_merge_df['sales'] = temp_merge_df['s_sum'] * temp_merge_df['ratio']
temp_merge_df['sales_volume'] = temp_merge_df['v_sum'] * temp_merge_df['ratio']
# 确定最终结果,添加6月份,填充product_id
temp_result_df = temp_merge_df[['month_dt', 'product_id', 'sales', 'sales_volume']].copy()
temp_result_df = temp_result_df.append(single_df[sdmd == sim], sort=False)
temp_result_df['product_id'] = temp_result_df['product_id'].ffill()
return temp_result_df
def run():
# total_data和sample_data
link = r'C:\lllong33\typora-lllong33\[2]jupyter_notebook_code\Apr_18_temp_need\toBen.csv'
# link = r'C:\lllong33\typora-lllong33\[2]jupyter_notebook_code\Apr_18_temp_need\new_test_dbs.csv'
# 1、获取数据
# handle:将id置为str类型,取出需要的4列数据,去重(这里不能保证月份唯一)
df = pd.read_csv(link)
df['product_id'] = df['product_id'].round(0).astype(str)
need_df = df[['month_dt', 'product_id', 'sales', 'sales_volume']].copy()
print('去重前数据量:', len(need_df))
need_df.drop_duplicates(inplace=True)
print('去重后数据量:', len(need_df))
# 2、对相同id数据,进行筛选和赋值
process_df = need_df.groupby('product_id').apply(filter_month)
print('process data after length: ', len(process_df))
# 3.1、question: 在里面处理类型,读出来还会变化,只能在外面改变类型
process_df['sales'] = process_df['sales'].round(2)
process_df['sales_volume'] = process_df['sales_volume'].round(0).astype(int)
# 3.2、将groupby转为DataFrame
process_df.reset_index(drop=True, inplace=True)
# 3.3、merge处理,将其他列数据加上,最后存在sales和sales_volume对比列
result_df = pd.merge(process_df, df, on=['product_id', 'month_dt'], how='left')
# 3.4、九月份数据一定存在,78月可能不存在,则用bfill
result_df = result_df.bfill()
# 写入file
result_df.to_csv(r'C:\lllong33\typora-lllong33\[2]jupyter_notebook_code\Apr_18_temp_need\final_to_ben.csv', index=None, encoding='utf_8_sig')
print('finish...')
if __name__ == '__main__':
run()