基本文本处理技能

最新推荐文章于 2024-04-12 13:26:36 发布

Lquartz

最新推荐文章于 2024-04-12 13:26:36 发布

阅读量364

点赞数

分类专栏：学习笔记文章标签： NLP AI

本文链接：https://blog.csdn.net/Lquartz/article/details/90216117

版权

学习笔记专栏收录该内容

3 篇文章 0 订阅

订阅专栏

文章目录

基本文本处理技能
语言模型
- - 参考资料

基本文本处理技能

文本处理基本流程

中英文文本都存在一致的基本处理流程, 主要包括: 分词(Segmentation), 清洗(Cleaning), 标准化(Normalization), 特征提取(Feature Extraction)和建模(Modeling).

中英文文本预处理特点

中英文文本虽然总体预处理流程一致, 但是存在一些本质的区别. 首先, 中文不像英文天然使用空格和符号完成了分词, 因此需要使用分词算法将一段文本进行切分. 另外英文也存在自身的一些特殊问题: 如拼写错误, 词形还原等. 词形还原是由于英文单词会随着不同的上下文出现各种不同的形式, 这些形式都是表示同一个词, 但是由于拼写改变被当做了不同的词.

文本预处理

完整代码github

读取文本

以上一篇博客使用的THUCNews的子集数据为例

ch_data_file = '../task1/cnews/cnews.train.txt'
with open(ch_data_file, 'r', encoding='utf-8') as f:
    ch_samples = [x.strip().split('\t') for x in f.readlines()]
    
print(len(ch_samples))
print(ch_samples[0])

去除数据中非文本部分

通过正则表达式的方式去除文本中非文本部分

import re

def filter_nontext(samples):
    # 过滤不了\\ \ 中文（）还有
    r1 = u'[a-zA-Z0-9’!"#$%&\'()*+,-./:;<=>?@，。?★、…【】《》？“”‘’！[\\]^_`{|}~]+'
    #用户也可以在此进行自定义过滤字符 # 者中规则也过滤不完全
    r2 = "[\s+\.\!\/_,$%^*(+\"\']+|[+——！，。？、~@#￥%……&*（）]+"
    # \\\可以过滤掉反向单杠和双杠，/可以过滤掉正向单杠和双杠，第一个中括号里放的是英文符号，第二个中括号里放的是中文符号，第二个中括号前不能少|，否则过滤不完全
    r3 =  "[.!//_,$&%^*()<>+\"'?@#-|:~{}]+|[——！\\\\，。=？、：“”‘’《》【】￥……（）]+" 
    # 去掉括号和括号内的所有内容
    r4 =  "\\【.*?】+|\\《.*?》+|\\#.*?#+|[.!/_,$&%^*()<>+""'?@|:~{}#]+|[——！\\\，。=？、：“”‘’￥……（）《》【】]"

    clear_samples = []
    for sample in samples:
        sentence = sample[1]
        cleanr = re.compile('<.*?>')
        sentence = re.sub(cleanr, ' ', sentence) #去除html标签
        sentence = re.sub(r4,'',sentence)
        clear_samples.append([sample[0], sentence])
    return clear_samples

clear_samples = filter_nontext(ch_samples)
print(len(clear_samples))
print(clear_samples[0])
print(clear_samples[1])

分词

英文分词一般可以直接使用split()操作
中文分词需要使用专门的分词算法, 如jieba分词

import jieba 

def cut_samples(samples):
    new_samples = []
    for sample in samples:
        sentence = sample[1]
        sentence_seg = jieba.cut(sentence)
        new_samples.append([sample[1], list(sentence_seg)])
    return new_samples

seg_samples = cut_samples(clear_samples)
print(len(seg_samples))
print(seg_samples[0])
print(seg_samples[1])

去除停用词

中文停用词表可以参考中文停用词, 将对应停用词表下载并读取.

from nltk.corpus import stopwords 
#stop = set(stopwords.words('english')) 
with open('ch_stopwords.txt', 'r', encoding='utf-8') as f:
    stop = [x.strip() for x in f.readlines()]
    stop = set(stop)
print(stop)

def filter_stopwords(samples):
    new_samples = []
    for sample in samples:
        sentence = sample[1]
        filter_sentence= [w for w in sentence if w not in stop]
        new_samples.append((sample[1], filter_sentence))
    return new_samples

nostop_samples = filter_stopwords(seg_samples)
print(len(nostop_samples))
print(nostop_samples[0])
print(nostop_samples[1])

词频统计

from collections import Counter

def count(samples):
    cnt = Counter()
    for sample in samples:
        cnt += Counter(sample[1])
    return cnt

cnt = count(nostop_samples)
print(cnt.most_common(100))

语言模型

简单的说，语言模型(Language Model)是用来计算一个句子出现概率的模型, n-gram语言模型指由n个连续词组成的词组集合, n=1称为uni-gram, n=2称为bi-gram, n=3称为tri-gram. 以文本I love deep learning为例:

Uni-gram: {I}, {love}, {deep}, {learning}
Bi-gram : {I, love}, {love, deep}, {love, deep}, {deep, learning}
Tri-gram : {I, love, deep}, {love, deep, learning}

给定n-gram表示的文本序列 $W=(w_1, w_2, \dots, w_n)$ , 其中 $w_i$ 表示n-gram表示的文本序列中地 $i$ 个词组, 语句 $W$ 出现的概率可以表示为:

$\begin{aligned} p(W)&=p(w_1,w_2,\dots,wn) \\ &=p(w_1) \cdot p(w_2|w_1) \cdot p(w_3|w_1,w_2) \dots p(w_n|w_1,w_2,\dots,w_{n-1}) \end{aligned}$

参考资料

文本预处理技术详解
 语言模型 Language Madel 与 word2vec
结巴分词

Lquartz

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
基本文本处理技能

文章目录基本文本处理技能文本处理基本流程中英文文本预处理特点文本预处理读取文本去除数据中非文本部分分词去除停用词词频统计语言模型参考资料基本文本处理技能文本处理基本流程中英文文本都存在一致的基本处理流程, 主要包括: 分词(Segmentation), 清洗(Cleaning), 标准化(Normalization), 特征提取(Feature Extraction)和建模(Modeling...
复制链接

扫一扫