按索引对数据分组

A few days ago, I was working on a project on risk management, until I came across a simple problem that I had never seen.

几天前,我正在从事一个风险管理项目,直到遇到一个从未见过的简单问题。

问题 (The Problem)

I needed to group data on the different IDs of a transaction dataset. This dataset contains 8,250 transactions for 1,125 customers of a bank.

我需要将数据按事务数据集的不同ID分组。 该数据集包含银行的1,125个客户的8,250笔交易。

Here are the first five rows:

这是前五行:

As you can see, all the rows have the same ID and I needed to group all the transactions according to the ID that created them.

如您所见,所有行都具有相同的ID,我需要根据创建它们的ID对所有事务进行分组。

Without having the solution in my head, I started looking in my Data Science books and more than anything those that were about data analysis and feature engineering but I didn’t find anything.

在没有解决方案的情况下,我开始在数据科学书籍中寻找更多有关数据分析和功能工程的书籍,但没有找到任何东西。

For this reason I started looking at Stack Overflow until I found something that looked like my situation but I had to adapt it to the characteristics of the project.

因此,我开始研究Stack Overflow,直到发现与我的情况类似的东西,但我不得不根据项目的特点进行调整。

Link to GitHub
here. 此处检查项目。

解决方案 (The Solution)

To get around this problem you need the well-known groupby of Pandas.

为了解决这个问题,你需要熟知groupby大熊猫。

First, we are going to group the prod_code according to their ID (if you want to know more you can enter GitHub with the link above). But we are going to group the various prod_code values into a list with the following code:

首先,我们将根据prod_code的ID对其进行分组(如果您想了解更多信息,可以通过上面的链接输入GitHub)。 但是我们将使用以下代码将各种prod_code值分组到一个列表中:

df_0 = payment_data.groupby(['id'])['prod_code'].apply(list).reset_index()
df_0.head(10)

This is the output:

这是输出:

As we can see, now we have all the unique different values of prod_code (one for every transaction made) grouped by the ID.

如我们所见,现在我们按ID分组了prod_code所有唯一不同值(每笔交易一个)。

The next feature that I needed to group was new_balance , where I decided to get the mean of all the new balances for every prod_code:

我需要分组的下一个功能是new_balance ,在这里我决定获取每个prod_code的所有新余额的prod_code

df_1 = payment_data.groupby(['id'])['new_balance'].apply(np.mean, axis=0).reset_index()
df_1.head(10)

Now, I had to group all the different types of overall payment: OVD_t1, OVD_t2, and OVD_t3. First, I grouped the sum of every type with their correspondent ID and then grouped all the types together in one dataframe.

现在,我必须对所有不同类型的总付款进行分组: OVD_t1OVD_t2OVD_t3 。 首先,我将每种类型的总和与它们的对应ID进行分组,然后将所有类型分组到一个数据帧中。

For this I had to use the function reduce:

为此,我不得不使用函数reduce

from functools import reduce
df_OVD_t1 = payment_data.groupby(['id'])['OVD_t1'].apply(np.sum, axis=0).reset_index()
df_OVD_t2 = payment_data.groupby(['id'])['OVD_t2'].apply(np.sum, axis=0).reset_index()
df_OVD_t3 = payment_data.groupby(['id'])['OVD_t3'].apply(np.sum, axis=0).reset_index()
OVD_dfs = [df_OVD_t1, df_OVD_t2, df_OVD_t3]
df_2 = reduce(lambda left,right: pd.merge(left,right,on='id'), OVD_dfs)
df_2.head(10)

The final feature that I needed to group was pay_normal. Here, the only operation that I applied was np.sum because I needed the total of times that the payment was normal for every client.

我需要分组的最后一个功能是pay_normal 。 在这里,我应用的唯一操作是np.sum因为我需要为每个客户正常支付的总次数。

df_3 = payment_data.groupby(['id'])['pay_normal'].apply(np.sum, axis=0).reset_index()
df_3.head(10)

将所有数据框合并为一个 (Merging all the Dataframes into one)

The final step for this task was to merge all the dataframes that I created into only one. For this, I used the reduce function again in the same form as in the grouping of the OVD_types before.

该任务的最后一步是将我创建的所有数据框仅合并为一个。 为此,我再次以与以前的OVD_types分组相同的形式使用了reduce函数。

dfs = [df_0, df_1, df_2, df_3]
df_final = reduce(lambda left,right: pd.merge(left,right,on='id'), dfs)
df_final.head(10)

结论 (Conclusion)

And that’s it. I hope this tutorial helps you, and thanks for getting here.

就是这样。 希望本教程对您有所帮助,并感谢您来到这里。

If you liked the article, I invite you to follow me on dev.to, GitHub and LinkedIn!

如果您喜欢这篇文章,我邀请您在dev.toGitHubLinkedIn上关注我!

翻译自: https://medium.com/swlh/group-your-data-by-index-f5760f0f3e3f

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值