参考书籍胡盼盼《自然语言处理从入门到实战》
4.1.1文本规范化
大写字母转小写
# 输入文本
input_str = "The 5 biggest countries by population in 2019 are China, India, United States, Indonesia, and Brazil."
# 转为小写
output_str = input_str.lower()
print(output_str)
# 输出结果为:
# the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and
# brazil.
数字处理:有时候一些数字对于语义理解并没有什么用处,比如列表序号、大小标题的序号等,因此可以将其剔除
import re
# 输入文本
input_str = "The 5 biggest countries by population in 2019 are China, India, United States, Indonesia, and Brazil."
# 剔除数字
output_str = re.sub(r"\d+", "", input_str)
print(output_str)
# 输出结果为:
# The biggest countries by population in are China, India, United States, Indonesia, and
# Brazil.
标点符号的处理:
import string
# 输入文本
input_str = "The 5 biggest countries by population in 2019 are China, India, United States, Indonesia, and Brazil."
# 剔除标点
output_str = input_str.translate (string.maketrans("",""), string.punctuation)
print(output_str)
# 输出结果为:
# The biggest countries by population in are China India United States Indonesia and
# Brazil
空白处理:在原文本中,比如句子的两端,标题和正文之间,段落之间常常存在一些空白,需要删除。
# 输入文本
input_str = "\t The 5 biggest countries by population in 2019 are China, India, United States, Indonesia, and Brazil. \t"
# 剔除空白
output_str = input_str.strip()
print(output_str)
# 输出结果为:
# The biggest countries by population in are China, India, United States, Indonesia, and
# Brazil.
词干提取(Stemming):针对可变化形态的语言,提取词汇的主干成分,多用于信息检索领域,用于扩展搜索,粒度较粗。
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
# 词干提取工具
stemmer = PorterStemmer()
# 输入文本
input_str = "There are several types of stemming algorithms"
# 词干提取
output_str = word_tokenize(input_str)
for word in output_str:
print(stemmer.stem(word))
# 输出结果为:
# There are sever type of stem algorithm
词形还原(Lemmatization):针对可变化形态的语言,将词汇转化为最常规的格式,比如将“was”,“were”,转化为“is”。词形还原更主要被应用于文本挖掘、文本分析等,粒度较细。
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
# 词形还原工具
lemmatizer = WordNetLemmatizer()
# 输入文本
input_str = "I had a dream"
# 词形还原
output_str = word_tokenize(input_str)
for word in output_str:
print(lemmatizer.lemmatize(word))
# 输出结果为:
# I have a dream
停用词的处理:
后期补充,需要停用词表,后期此处添加几个常用的中文停用词的txt文件。