文本预处理

最新推荐文章于 2022-07-11 21:35:30 发布

spespusliar

最新推荐文章于 2022-07-11 21:35:30 发布

阅读量1.2k

点赞数

文章标签： python 自然语言处理

本文链接：https://blog.csdn.net/weixin_45153966/article/details/121684569

版权

文本预处理（持续更新）

文本预处理的核心或者说是干嘛的？
答：把词变成能训练的东西。
本文目的：作为一个工具使用

1、简单粗暴的文本处理（英文）
将非英文直接替换成空格，并将所有字母都转化成小写

with open('text.txt','r',encoding='utf-8') as f:
    lines = f.readlines()
#lines是一个list，每个元素是文件中的一行组成的字符串
import re
simple_process_file = [ re.sub('[^A-Za-z]+',' ',line).strip().lower() for line in lines ]

效果如下：
在这里插入图片描述
2、分词（英文）

def tokenize(lines,token = 'word'):
    if token == 'word':
        return [ line.split() for line in lines ]
    elif token == 'char':
        return [list(line) for line in lines]
    else:
        print("error: unknown token type"+token)

对s1作用后，效果如下：
在这里插入图片描述
3、构建vocabulary，即一个字典，形式为：{‘word’:id}
id可以是词出现的次数，也可以是一个从0开始的编码

import numpy as np
MAX_VOCAB_SIZE = 5  #希望我的vocabulary中有多少个单词，有时候一些冷门的词不希望出现在这里
from collections import Counter
vocab = dict(Counter(token[0]).most_common(MAX_VOCAB_SIZE-1))
vocab["<unk>"] = len(token[0]) - np.sum(list(vocab.values()))

这样，就得到了形如{‘word’:出现次数}的vocabulary，添加了未知词
在这里插入图片描述
下面可以根据词来构建形如{‘word’:id}，id是从0开始的下标的vocabulary

idx_to_word = [word for word in vocab.keys()] 
word_to_idx = {word:i for i, word in enumerate(idx_to_word)}

idx_to_word是能根据对应的下标返回单词的list。
在这里插入图片描述
4、将单词转化成对应的id

def encode_text(texts):     #输入是一个文本的list
    encoded_text = []
    for text in texts:
        E = []
        for word in text.strip().split():
            if word in word_to_idx.keys():
                E.append(word_to_idx[word])
            else:
                E.append(word_to_idx['<unk>'])
        encoded_text.append(E)
    return encoded_text