pandas 练习1

最新推荐文章于 2022-10-27 12:57:15 发布

lllong33

最新推荐文章于 2022-10-27 12:57:15 发布

阅读量163

点赞数

分类专栏： Pandas

本文链接：https://blog.csdn.net/lllong33/article/details/89430705

版权

Pandas 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

需求

1、对于每个product id，满足：6月有数，9月有数且(7月8月任意一个月为空/0 OR 7+8的sales < 50%*9的sales)

2、则 apply:7+8+9的数，按32% 30% 38%分配在这三个月，volume round到INT, value round到保留2位小数

3、其中：sales = Value；sales_volume= Volume

code question

– groupby().apply() 会重复输出第一行，返回却是正确的，没有重复项

– 如何查找相同id，相同月份，行的数量
对id和月份进行groupby

– 一个df中，78月两个月的行数据，假如可能出现sales列7有数据8没有，怎么计算sale列的sum
df[(df==7) | (df==8)]['sales'].sum()

解决思路：

1、完成第一步筛选任务，获得filter_df
	1、筛选出同时具有6月和9月的ID
	2、删除同时具有7和8月且两月之和大于9月的ID

2、完成第二步按比例赋值
	1、构建一个789月的month_df_789，具有month_dt，ratio
	2、month_df_789与筛选出来的single_df做merge，再为single_df添加两列s_sum,v_sum
		1、以month_df_789为左连接，构建出temp_merge_df格式的数据
		2、再将m_df的列ratio乘以新两列，结果赋给sales,sales_volume
	3、修改类型，复制给final_df

"""
Email: 704669594@qq.com
data: 2019/4/21/0021
"""
import pandas as pd

# 筛选出符合条件的数据，使用切片选出有点不合适，采用groupby.apply筛选
# 本来先过滤count小于2的id
count = 0
def filter_month(single_df):
    global count
    count +=1
    if count % 10000 == 0:
        print(f'process {count} data')
    #     single_df.month_dt.tolist()
    #     一种方式，tolist后，用in的方法
    sdmd = single_df.month_dt
    sim = '2017-06-01'
    sem = '2017-07-01'
    eim = '2017-08-01'
    nim = '2017-09-01'
    # 筛选出同时具有6月和9月的ID
    if any(sdmd == sim) and any(sdmd == nim):
        # 同时具有7和8月 且 两月之和大于9月 直接返回df
        if any(sdmd == sem) and any(sdmd == eim):
            if any(single_df[(sdmd == sem) | (sdmd == eim)]['sales'].sum() > 0.5 * single_df[sdmd == nim]['sales']):
                return single_df

        # 其他情况都需要处理
        """
        1、构建一个789月的month_df_789，具有month_dt，ratio
		2、month_df_789与筛选出来的single_df做merge，再为single_df添加两列s_sum,v_sum
			1、以month_df_789为左连接，构建出temp_merge_df格式的数据
			2、再将m_df的列ratio乘以新两列，结果赋给sales,sales_volume
        """
        month_df_789 = pd.DataFrame({
            'month_dt': ['2017-07-01', '2017-08-01', '2017-09-01'],
            'ratio': [0.32, 0.3, 0.38]})
        temp_merge_df = pd.merge(month_df_789, single_df, on=['month_dt'], how='left')
        temp_merge_df['s_sum'] = temp_merge_df.sales.sum()
        temp_merge_df['v_sum'] = temp_merge_df.sales_volume.sum()
        temp_merge_df['sales'] = temp_merge_df['s_sum'] * temp_merge_df['ratio']
        temp_merge_df['sales_volume'] = temp_merge_df['v_sum'] * temp_merge_df['ratio']

        # 确定最终结果，添加6月份，填充product_id
        temp_result_df = temp_merge_df[['month_dt', 'product_id', 'sales', 'sales_volume']].copy()
        temp_result_df = temp_result_df.append(single_df[sdmd == sim], sort=False)
        temp_result_df['product_id'] = temp_result_df['product_id'].ffill()


        return temp_result_df



def run():
    # total_data和sample_data
    link = r'C:\lllong33\typora-lllong33\[2]jupyter_notebook_code\Apr_18_temp_need\toBen.csv'
    # link = r'C:\lllong33\typora-lllong33\[2]jupyter_notebook_code\Apr_18_temp_need\new_test_dbs.csv'

    # 1、获取数据
    # handle：将id置为str类型，取出需要的4列数据，去重(这里不能保证月份唯一)
    df = pd.read_csv(link)
    df['product_id'] = df['product_id'].round(0).astype(str)
    need_df = df[['month_dt', 'product_id', 'sales', 'sales_volume']].copy()
    print('去重前数据量：', len(need_df))
    need_df.drop_duplicates(inplace=True)
    print('去重后数据量：', len(need_df))

    # 2、对相同id数据，进行筛选和赋值
    process_df = need_df.groupby('product_id').apply(filter_month)
    print('process data after length: ', len(process_df))

    # 3.1、question: 在里面处理类型，读出来还会变化，只能在外面改变类型
    process_df['sales'] = process_df['sales'].round(2)
    process_df['sales_volume'] = process_df['sales_volume'].round(0).astype(int)

    # 3.2、将groupby转为DataFrame
    process_df.reset_index(drop=True, inplace=True)

    # 3.3、merge处理，将其他列数据加上，最后存在sales和sales_volume对比列
    result_df = pd.merge(process_df, df, on=['product_id', 'month_dt'], how='left')
    # 3.4、九月份数据一定存在，78月可能不存在，则用bfill
    result_df = result_df.bfill()

    # 写入file
    result_df.to_csv(r'C:\lllong33\typora-lllong33\[2]jupyter_notebook_code\Apr_18_temp_need\final_to_ben.csv', index=None, encoding='utf_8_sig')

    print('finish...')
if __name__ == '__main__':
    run()

lllong33

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pandas 练习1

需求1、对于每个product id，满足：6月有数，9月有数且(7月8月任意一个月为空/0 OR 7+8的sales < 50%*9的sales)2、则 apply:7+8+9的数，按32% 30% 38%分配在这三个月，volume round到INT, value round到保留2位小数3、其中：sales = Value；sales_volume= Volume"""E...
复制链接

扫一扫