python输出特征相关矩阵_Python#2中的分组特征矩阵

It's not too different from before. We can start with the sample data:

DataFrame1:

Name No. Comment

Bob 2123320 Doesn't Matter

Joe 2832883 Whatever

John 2139300 Irrelevant

Bob 2123320 Something

John 2234903 Regardless

DataFrame2:

Name No. Report

Bob 2123320 Great

Joe 2832883 Solid

John 2139300 Awesome

Bob 2123320 Good

John 2234903 Perfect

I am looking for a way to make a new excel file that looks like this (Expected Outcome):

-----------------------2139300--------------------- 2234903----

Name Irrelevant Whatever Regardless Awesome Solid Perfect Irrelevant \

John 1 0 0 1 0 0 0

--------------------2234903-------------

Name Whatever Regardless Awesome Solid Perfect

John 0 1 0 0 1

(Note: It doesn't need to have the head-titles of the No., I just did that for clarity and later explanation).

Basically what I have done is, very similar to the other, looks for each name, and then for each name it looks to see how many distinct No.'s it has. It then selects for people who have a certain amount of distinct No.'s. Now, I have a set of "Comments" and "Reports" I wish to look for

({Irrelevant, Whatever, Regardless} and {Awesome, Solid, Perfect} respectively [note: this is only a subset of Comments/Reports]) and for these I want to have a 1 or 0 if it appears but only for each No. Put another way, I want for each No. to have a "group" of columns titled {Irrelevant, Whatever, Regardless} and {Awesome, Solid, Perfect} and for each value I want a 1 if it appeared for the person for that Specific No. and a 0 if it didn't.

In this matrix, for example, we only see John because he is the only one with more than 1 distinct No. In the first group of columns only Irrelevant and Awesome have values of 1 whereas the rest have 0 and in the second group only Regardless and Perfect will have 1s. What it did was it listed all of my desired Comments/Reports ({Irrelevant, Whatever, Regardless} and {Awesome, Solid, Perfect}) for only one No. and then found out if each appeared or not (1 or 0). It then repeated all the desired Comments/Reports in a new "group" of columns for a new No. and for this new No. found out which Comments/Reports now appeared.

Let me know if anything is unclear and I truly do appreciate your help.

Thank you.

解决方案

Try:

df_out = df_out[df_out.groupby(['Name'])['No.'].transform(lambda x: x.nunique() > 1)]\

.set_index(['Name','No.'])['Comment'].str.get_dummies()\

.reindex(df_out.Comment, fill_value=0, axis=1)\

.sum(level=[0,1])\

.unstack()\

.swaplevel(0,1,axis=1)\

.sort_index(1)

print(df_out)

Output:

No. 2139300 \

Comment Awesome Doesn't Matter Good Great Irrelevant Perfect Regardless Solid

Name

John 1 0 0 0 1 0 0 0

No. 2234903 \

Comment Something Whatever Awesome Doesn't Matter Good Great Irrelevant

Name

John 0 0 0 0 0 0 0

No.

Comment Perfect Regardless Solid Something Whatever

Name

John 1 1 0 0 0

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值