NLP自然语言中英文本预处理

饭团爱吃饭

已于 2024-04-24 17:31:39 修改

阅读量308

点赞数 9

文章标签：自然语言处理人工智能

于 2024-04-24 17:14:55 首次发布

本文链接：https://blog.csdn.net/weixin_48049326/article/details/138163780

版权

本篇文章主要对中英两种文本预处理操作进行函数的封装，通过调用封装好的函数可快速对文本数据进行预处理

核心处理函数

# coding=utf-8
import jieba, re
#英文文本处理需要用到的库
import nltk
nltk.download('stopwords')                #加载停用词
from nltk.tokenize import word_tokenize   #英文的分词工具
from nltk.corpus import stopwords         #停用词需要使用到
import string                             #用于去除英文语句中所有的标点符号

#---------------------------------------------------(此部分是中英文词处理都可以公用的)
# 去除原始字符串中的url
def remove_urls(raw_sentence):
    # 正则表达式
    url_reg = r'[a-z]*[:.]+\S+'
    result = re.sub(url_reg, '', raw_sentence)
    return result


# 去除原始字符串中的emoji字符
def remove_emoji(raw_sentence):
    try:
        co = re.compile(u'[\U00010000-\U0010ffff]')
    except re.error:
        co = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
    return co.sub('', raw_sentence)
    return result


#------------------------------------------------------------------（主要是处理中文文本数据的封装函数）
# 创建停用词表
def stopwordslist(path):  #输入停用词表的路径
    stopwords = [line.strip() for line in open(path, encoding='UTF-8').readlines()]
    return stopwords


# 利用jieba分词对文档进行中文分词
def seg_depart_Chiness(raw_sentence,path):
    sentence_depart = jieba.cut(raw_sentence.strip().replace(" ", ""))
    stopwords = stopwordslist(path)
    outstr_list = []
    for word in sentence_depart:
        if word not in stopwords:
            outstr_list.append(word)
    return outstr_list

#--------------------------------------------------------------------（处理英文文本函数）
#对英文进行分词
def seg_depart_English(raw_sentence):
    # 使用 string.punctuation 生成标点符号转换表，'' 表示转换后的结果是移除
    translator = str.maketrans('', '', string.punctuation)
    # 移除字符串中的所有标点符号
    raw_sentence = raw_sentence.translate(translator)
    #下面进行分词和去除停用词
    sentence_depart = word_tokenize(raw_sentence)  #对英文文本进行分词
    stop_words = set(stopwords.words('english'))   #加载英文停用词表
    outstr_list = [w.lower() for w in sentence_depart if not w.lower() in stop_words]
    return outstr_list

中文文本调用预处理案例

#停用词表文件路径
path = './stopWords.txt'
#测试
sentence = "大众、奥迪现在几乎全部的最新电子科技都用在奥迪身上了，A4L的配置新科技应用很明显…😀https://zhidao.autohome.com.cn/summary/692.html#pvareaid=3454447"
print("原始：" + sentence)
pro_sentence_1 = remove_urls(sentence)
print("去除url: " + pro_sentence_1)
pro_sentence_2 = remove_emoji(pro_sentence_1)
print("去除emoji： " + pro_sentence_2)
final_sentence = seg_depart_Chiness(pro_sentence_2,path)
print("分词结果： " ,final_sentence)

原始：大众、奥迪现在几乎全部的最新电子科技都用在奥迪身上了，A4L的配置新科技应用很明显…😀https://zhidao.autohome.com.cn/summary/692.html#pvareaid=3454447
去除url: 大众、奥迪现在几乎全部的最新电子科技都用在奥迪身上了，A4L的配置新科技应用很明显…😀
去除emoji： 大众、奥迪现在几乎全部的最新电子科技都用在奥迪身上了，A4L的配置新科技应用很明显…
分词结果：  ['大众', '奥迪', '最新', '电子科技', '奥迪', '身上', 'A4L', '配置', '新', '科技']

英文文本调用预处理案例

sentence = 'I LIKe CAT ?😀https://zhidao.autohome.com.cn/summary/692.html'
pro_sentence_1 = remove_urls(sentence)
print("去除url: " + pro_sentence_1)
pro_sentence_2 = remove_emoji(pro_sentence_1)
print("去除emoji： " + pro_sentence_2)
final_sentence = seg_depart_English(pro_sentence_2)
print("分词结果： " ,final_sentence)

去除url: I LIKe CAT ?😀
去除emoji： I LIKe CAT ?
分词结果：  ['like', 'cat']

也可以使用这个函数处理整篇文章的，具体可下载下面的具体代码资料包

饭团爱吃饭

关注

9
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
1
评论
NLP自然语言中英文本预处理

自然语言文本预处理
复制链接

扫一扫