电费敏感数据挖掘二: 文本特征构造

最新推荐文章于 2021-03-02 16:38:59 发布

弎见

最新推荐文章于 2021-03-02 16:38:59 发布

阅读量656

点赞数

分类专栏：数据挖掘文章标签：数据挖掘机器学习 python 文本特征降维

本文链接：https://blog.csdn.net/sanjianjixiang/article/details/105963355

版权

电费敏感数据挖掘一: 数据处理与特征工程

四. 处理文本特征

4.1 结巴分词

import jieba

print('开始处理表1中的文本特征...')
mywords = ['户号', '分时', '抄表', '抄表示数', '工单', '单号', '工单号', '空气开关', '脉冲灯', '计量表', '来电', '报修']
for word in mywords:
    jieba.add_word(word)
    
stops = set()
with open(r'..\电费敏感预测\stopwords.txt', encoding = 'utf-8') as f:
    for word in f:
        word = word.strip()
        stops.add(word)
        
def fenci(line):
    res = []
    words = jieba.cut(line)
    for word in words:
        if word not in stops:
            res.append(word)
    return ' '.join(res)

print('分词ing...')

jobinfo['contents'] = jobinfo.ACCEPT_CONTENT.apply(lambda x: fenci(x))

4.2 处理手机号,户号等后面连接的号码

import re

def hash_number(x):
    shouji_pattern = re.compile('\s1\d{10}\s|\s1\d{10}\Z')
    if shouji_pattern.findall(x):
        x = re.sub(shouji_pattern, ' 手机number ', x)
    
    huhao_pattern = re.compile('\s\d{10}\s|\s\d{10}\Z')
    if huhao_pattern.findall(x):
        x = re.sub(huhao_pattern, ' 户号number ', x)
        
    tuiding_pattern = re.

最低0.47元/天解锁文章

弎见

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
电费敏感数据挖掘二: 文本特征构造

电费敏感数据挖掘一: 数据处理与特征工程目录:四. 处理文本特征4.1 结巴分词4.2 处理手机号,户号等后面连接的号码4.3 加入文本特征五. 文本特征筛选5.1 构建数据集5.2 稀疏矩阵5.3 构造tf-idf特征5.4 基于特征选择来降维保存文本特征四. 处理文本特征4.1 结巴分词import jiebaprint('开始处理表1中的文本特征...')mywords = [...
复制链接

扫一扫