有时候需要将特征名称转化为变量,也就是将数据集由横向改为纵向,或者为转秩。使用场景如下:
# 数据集
In [5]: test
Out[5]:
tweet_id doggo floofer pupper puppo
0 675003128568291329 None None None None
1 786233965241827333 None None None None
2 683481228088049664 None None pupper None
3 675497103322386432 None None None None
# 先设置index,再使用.stack()方法由横向变纵向,对特征进行命名
In [6]: s1 = test.set_index('tweet_id').stack().rename('stage')
In [7]: s1
Out[7]:
tweet_id
675003128568291329 doggo None
floofer None
pupper None
puppo None
786233965241827333 doggo None
floofer None
pupper None
puppo None
683481228088049664 doggo None
floofer None
pupper pupper
puppo None
675497103322386432 doggo None
floofer None
pupper None
puppo None
Name: stage, dtype: object
# 将多重索引reset
In [8]: s2 = s1.reset_index()
In [9]: s2
Out[9]:
tweet_id level_1 stage
0 675003128568291329 doggo None
1 675003128568291329 floofer None
2 675003128568291329 pupper None
3 675003128568291329 puppo None
4 786233965241827333 doggo None
5 786233965241827333 floofer None
6 786233965241827333 pupper None
7 786233965241827333 puppo None
8 683481228088049664 doggo None
9 683481228088049664 floofer None
10 683481228088049664 pupper pupper
11 683481228088049664 puppo None
12 675497103322386432 doggo None
13 675497103322386432 floofer None
14 675497103322386432 pupper None
15 675497103322386432 puppo None
# 将level_1列删除,同时stage列只保留不为none的数据
In [10]: s2.drop(['level_1'], axis=1, inplace=True)
In [11]: s3 = s2[s2.stage != 'None']
In [12]: s3
Out[12]:
tweet_id stage
10 683481228088049664 pupper
# 跟原始数据集进行合并
In [14]: result = pd.merge(test, s3, how='left', on='tweet_id')
In [15]: result
Out[15]:
tweet_id doggo floofer pupper puppo stage
0 675003128568291329 None None None None NaN
1 786233965241827333 None None None None NaN
2 683481228088049664 None None pupper None pupper
3 675497103322386432 None None None None NaN
# 删除中间特征,得到最终结果
In [16]: result.drop(['doggo','floofer','pupper','puppo'], axis=1)
Out[16]:
tweet_id stage
0 675003128568291329 NaN
1 786233965241827333 NaN
2 683481228088049664 pupper
3 675497103322386432 NaN
In [17]: test
Out[17]:
tweet_id doggo floofer pupper puppo
0 675003128568291329 None None None None
1 786233965241827333 None None None None
2 683481228088049664 None None pupper None
3 675497103322386432 None None None None
应该有更为简便易行的方法。后续补充。