数据挖掘之用户画像

一. 构造词向量特征

1.1 原始数据编码转换

import pandas as pd
import csv 

# 训练数据
data_path = r'data\user_tag_query.10W.TRAIN'
csvfile = open(data_path + '-1w.csv', 'w')
writer = csv.writer(csvfile)
writer.writerow(['ID', 'age', 'Gender', 'Education', 'QueryList'])
with open(data_path, 'r', encoding = 'gb18030', errors = 'ignore') as f:
    lines = f.readlines()
    for line in lines[0: 10000]:        
        try:
            line.strip()
            data = line.split('\t')
            writedata = [data[0], data[1], data[2], data[3]]
            querystr = ''
            data[-1] = data[-1][:-1]
            for d in data[4:]:
                try:
                    cur_str = d.encode('utf8')
                    cur_str = cur_str.decode('utf8')
                    querystr += cur_str + '\t'
                except:
#                    print(data[0][0:10])
                    continue
            querystr = querystr[:-1]
            writedata.append(querystr)
            writer.writerow(writedata)
        except:
#            print(data[0][0:20])
            continue
            
# 测试数据
data_path = r'data\user_tag_query.10W.TEST'
csvfile = open(data_path + '-1w.csv', 'w')
writer = csv.writer(csvfile)
writer.writerow(['ID', 'age', 'Gender', 'Education', 'QueryList'])
with open(data_path, 'r', encoding = 'gb18030', errors = 'ignore') as f:
    lines = f.readlines()
    for line in lines[0: 10000]:        
        try:
            line.strip()
            data = line.split('\t')
            writedata = [data[0], data[1], data[2], data[3]]
            querystr = ''
            data[-1] = data[-1][:-1]
            for d in data[4:]:
                try:
                    cur_str = d.encode('utf8')
                    cur_str = cur_str.decode('utf8')
                    querystr += cur_str + '\t'
                except:
                    #print(data[0][0:10])
                    continue
            querystr = querystr[:-1]
            writedata.append(querystr)
            writer.writerow(writedata)
        except:
            #print(data[0][0:20])
            continue
            
trainname = r'data\user_tag_query.10W.TRAIN-1w.csv'
testname = r'data\user_tag_query.10W.TEST-1w.csv'
data = pd.read_csv(trainname, encoding = 'gbk')
print(data.shape)
data.head()

在这里插入图片描述

1.2 生成对应的数据表

data.age.to_csv(r'data\train_age.csv', index = False)
data.Gender.to_csv(r'data\train_gender.csv', index = False
  • 0
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值