使用pandas做onehot编码
奖pandas读取的csv中某一个字符串的列作one-hot编码,并统计各个编码出现次数
如数据如下:
>>> file
hdid time eventid is_black
0 00000ec16ad8603567608b7bce582e57 1.568535e+09 20025229 0
1 00000ec16ad8603567608b7bce582e57 1.568535e+09 20026513 0
2 00000ec16ad8603567608b7bce582e57 1.568535e+09 20035569 0
3 00000ec16ad8603567608b7bce582e57 1.568535e+09 20023769 0
4 00000ec16ad8603567608b7bce582e57 1.568535e+09 20035569 0
5 00000ec16ad8603567608b7bce582e57 1.568535e+09 20035569 0
6 00000ec16ad8603567608b7bce582e57 1.568535e+09 20028335 0
7 00000ec16ad8603567608b7bce582e57 1.568535e+09 20025229 0
8 00000ec16ad8603567608b7bce582e57 1.568535e+09 20023769 0
9 00000ec16ad8603567608b7bce582e57 1.568535e+09 20023768 0
我对eventid进行one-hot编码
>>> tmp2 = pd.get_dummies(file["eventid"][:10])
>>> tmp2
20023768 20023769 20025229 20026513 20028335 20035569
0 0 0 1 0 0 0
1 0 0 0 1 0 0
2 0 0 0 0 0 1
3 0 1 0 0 0 0
4 0 0 0 0 0 1
5 0 0 0 0 0 1
6 0 0 0 0 1 0
7 0 0 1 0 0 0
8 0 1 0 0 0 0
9 1 0 0 0 0 0
恢复其他列
tmp2[["hdid","is_black"]] = file[["hdid","is_black"]]
统计各编码出现次数
tmp2.groupby(["hdid","is_black"]).sum()
这里使用到groupby