nltk:python自然语言处理三标准化

最新推荐文章于 2024-07-07 09:48:56 发布

qq_41864652

最新推荐文章于 2024-07-07 09:48:56 发布

阅读量1.7k

点赞数

分类专栏： nltk 文章标签： nltk python

nltk 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

本文介绍了Python的nltk库在自然语言处理中的文本标准化操作，包括清除标点、统一大小写、处理停用词、替换和校正标识符以及消除重复字符。通过这些步骤，可以对文本进行预处理，提升后续分析的准确性。

摘要由CSDN通过智能技术生成

文本的标注化处理主要涉及清楚标点符号、统一大小写、数字的处理、扩展缩略词等文本的规范化操作

1.清除标点

import re
import string
from nltk import word_tokenize

text = """
I Love there things in this world. 
Sun, Moon and You. 
Sun for morning, Moon for
night, and You forever. 
"""
# 分词
words = word_tokenize(text)
# ['I', 'Love', 'there', 'things', 'in', 'this', 'world', '.', 'Sun', ',', 'Moon', 'and', 'You', '.', 'Sun', 'for', 'morning', ',', 'Moon', 'for', 'night', ',', 'and', 'You', 'forever', '.']

# re.escape(string) 返回一个字符串, 其中的所有非字母数字字符都带有反斜杠
# string.punctuation 所有的（英文）标点符号
regex_punctuation = re.compile('[%s]' % re.escape(string.punctuation))
# 将每个单词中的标点全部替换为空，如果替换后为字符则过滤掉  弊端：数字的小数点、人名间隔符会被清除
new_words = filter(lambda word: word != "", [regex_punctuation.sub("", word) for word in words])
print(new_words)
# ['I', 'Love', 'there', 'things', 'in', 'this', 'world', 'Sun', 'Moon', 'and', 'You', 'Sun', 'for', 'morning', 'Moon', 'for', '20', 'do', 'nt', 'night', 'and', 'You', 'forever']

2.统一大小写

text = "I Love there things in this world. "
text.lower()
text.upper()

3.处理停用词

过滤掉大量出现又没有实际意义的词

from nltk.corpus import stopwords
# stopwords是WordListCorpusReader的一个实例 WordListCorpusReader有一个word方法用于获取停用词
# 可以通过word方法的filed参数指定获取那个语言的停用词  如果不指定会获取所有语言的停用词
# 实例的fileids方法可以查看nltk文件中都包含哪些语言的停用词库

# 使用stopwords中的英文停用词库
stop_words = set(stopwords.words("english"))
words = ['I', 'Love', 'there', 'things', 'in', 'this', 'world', 'Sun', 'Moon', 'and', 'You',
         'Sun', 'for', 'morning', 'Moon', 'for', 'night', 'and', 'You', 'forever']
# 过滤words中存在于停用词库中的单词
new_words = [word for word in words if word not in stop_words]

4.替换和校正标识符

# 对缩略词进行格式化，如将isn't替换为is not。一般在分词前进行替换，避免切分缩略词时出现问题

import re

replace_patterns = [
    (r"can\'t", "cannot"),
    (r"won't", "will not"),
    (r"i'm", "i am"),
    (r"isn't", "is not"),
    (r"(\w+)'ll", "\g<1> will"),
    (r"(\w+)n't", "\g<1> not"),
    (r"(\w+)'ve", "\g<1> have"),
    (r"(\w+)'s", "\g<1> is"),
    (r"(\w+)'re", "\g<1> are"),
    (r"(\w+)'d", "\g<1> would"),
]


class RegexpReplacer(object):

    def __init__(self, replace_patterns=replace_patterns):
        self.parrents = [(re.compile(regex), repl) for regex, repl in replace_patterns]

    def replace(self, text):
        for parrent, repl in self.parrents:
            text, count = re.subn(pattern=parrent, repl=repl, string=text)
        return text

replacer = RegexpReplacer()
text = "The hard part isn't making the decision. It's living with it."
print(replacer.replace(text))
# The hard part is not making the decision. It is living with it.

5.消除重复字符

# 如手抖将like写成了likeeee，需要规整为like

# 但是happy就是正常的单词不能处理，这里我们借助语料库中wordnet来检查单词

from nltk.corpus import wordnet


class RepeatReplacer(object):

    def __init__(self):
        # 能匹配则表示有重复字符
        self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
        # 替换后会去除一个重复字符
        self.repl = r'\1\2\3'

    def replace(self, word):
        # 获取尝试获取同义词集， 如果有结果即表示是正常单词
        if wordnet.synsets(word):
            return word
        # 如果替换后还是自己（即没有重复字符）则返回，否则进行递归替换
        repl_word = self.repeat_regexp.sub(self.repl, word)
        if repl_word != word:
            return self.replace(repl_word)
        return word

replacer = RepeatReplacer()
print(replacer.replace("likkeee"))
# like
print(replacer.replace("happpyyy"))
# happy

qq_41864652

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
nltk:python自然语言处理三标准化

文本的标注化处理主要涉及清楚标点符号、统一大小写、数字的处理、扩展缩略词等文本的规范化操作1.清除标点import reimport stringfrom nltk import word_tokenizetext = """I Love there things in this world. Sun, Moon and You. Sun for morning, Moon...
复制链接

扫一扫