文本预处理 | （1）文本规范化

最新推荐文章于 2023-10-30 14:53:50 发布

郭畅小渣渣

最新推荐文章于 2023-10-30 14:53:50 发布

阅读量494

点赞数

分类专栏： # 文本预处理

本文链接：https://blog.csdn.net/qq_40276310/article/details/109964231

版权

文本预处理专栏收录该内容

7 篇文章 2 订阅

订阅专栏

参考书籍胡盼盼《自然语言处理从入门到实战》

4.1.1文本规范化

大写字母转小写

# 输入文本
input_str = "The 5 biggest countries by population in 2019 are China, India, United States, Indonesia, and Brazil."
# 转为小写
output_str = input_str.lower()
print(output_str)
# 输出结果为：
# the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and
# brazil.

数字处理：有时候一些数字对于语义理解并没有什么用处，比如列表序号、大小标题的序号等，因此可以将其剔除

import re
# 输入文本
input_str = "The 5 biggest countries by population in 2019 are China, India, United States, Indonesia, and Brazil."
# 剔除数字
output_str = re.sub(r"\d+", "", input_str)
print(output_str)
# 输出结果为：
# The biggest countries by population in are China, India, United States, Indonesia, and
# Brazil.

标点符号的处理：

import string
# 输入文本
input_str = "The 5 biggest countries by population in 2019 are China, India, United States, Indonesia, and Brazil."
# 剔除标点
output_str = input_str.translate (string.maketrans("",""), string.punctuation)
print(output_str)
# 输出结果为：
# The biggest countries by population in are China India United States Indonesia and
# Brazil

空白处理：在原文本中，比如句子的两端，标题和正文之间，段落之间常常存在一些空白，需要删除。

# 输入文本
input_str = "\t The 5 biggest countries by population in 2019 are China, India, United States, Indonesia, and Brazil. \t"
# 剔除空白
output_str = input_str.strip()
print(output_str)
# 输出结果为：
# The biggest countries by population in are China, India, United States, Indonesia, and
# Brazil.

词干提取（Stemming）：针对可变化形态的语言，提取词汇的主干成分，多用于信息检索领域，用于扩展搜索，粒度较粗。

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
# 词干提取工具
stemmer = PorterStemmer()
# 输入文本
input_str = "There are several types of stemming algorithms"
# 词干提取
output_str = word_tokenize(input_str)
for word in output_str:
   print(stemmer.stem(word))
# 输出结果为：
# There are sever type of stem algorithm

词形还原（Lemmatization）：针对可变化形态的语言,将词汇转化为最常规的格式，比如将“was”，“were”，转化为“is”。词形还原更主要被应用于文本挖掘、文本分析等，粒度较细。

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
# 词形还原工具
lemmatizer = WordNetLemmatizer()
# 输入文本
input_str = "I had a dream"
# 词形还原
output_str = word_tokenize(input_str)
for word in output_str:
   print(lemmatizer.lemmatize(word))
# 输出结果为：
# I have a dream

停用词的处理：

后期补充，需要停用词表，后期此处添加几个常用的中文停用词的txt文件。