自然语言处理之NLTK快速掌握（python3）

最新推荐文章于 2024-07-08 16:14:38 发布

算法黑哥

最新推荐文章于 2024-07-08 16:14:38 发布

阅读量698

点赞数

分类专栏：自然语言处理文章标签： python nltk nlp

本文链接：https://blog.csdn.net/weixin_41504611/article/details/103572211

版权

自然语言处理专栏收录该内容

3 篇文章 0 订阅

订阅专栏

文章目录

- - NLTK工具包安装

NLTK工具包安装

非常实用的文本处理工具，主要用于英文数据，历史悠久~

pip install nltk #命令窗口安装

在这里插入图片描述

缺少什么东西，就在nltk.download()中下载。运行此代码会出下如下界面。

选择All Packages 在里面选择下载自己用到的工具。

分词

在这里插入图片描述

Text对象

help(nltk.text)

创建一个Text对象，方便后续操作
在这里插入图片描述

停用词

在这里插入图片描述
intersection 交集

过滤掉停用词

在这里插入图片描述

词性标注

在这里插入图片描述

POS Tag	指代
CC	并列连词
CD	基数词
DT	限定符
EX	存在词
FW	外来词
IN	介词或从属连词
JJ	形容词
JJR	比较级的形容词
JJS	最高级的形容词
LS	列表项标记
MD	情态动词
NN	名词单数
NNS	名词复数
NNP	专有名词
PDT	前置限定词
POS	所有格结尾
PRP	人称代词
PRP$	所有格代词
RB	副词
RBR	副词比较级
RBS	副词最高级
RP	小品词
UH	感叹词
VB	动词原型
VBD	动词过去式
VBG	动名词或现在分词
VBN	动词过去分词
VBP	非第三人称单数的现在时
VBZ	第三人称单数的现在时
WDT	以wh开头的限定词

分块

在这里插入图片描述
运行之后的结果：

命名实体识别

在这里插入图片描述

数据清洗实例

import re
from nltk.corpus import stopwords
# 输入数据
s = '    RT @Amila #Test\nTom\'s newly listed Co  &amp; Mary\'s unlisted     Group to supply tech for nlTK.\nh $TSLA $AAPL https:// t.co/x34afsfQsh'

#指定停用词
cache_english_stopwords = stopwords.words('english')

def text_clean(text):
    print('原始数据:', text, '\n')
    
    # 去掉HTML标签(e.g. &amp;)
    text_no_special_entities = re.sub(r'\&\w*;|#\w*|@\w*', '', text)
    print('去掉特殊标签后的:', text_no_special_entities, '\n')
    
    # 去掉一些价值符号
    text_no_tickers = re.sub(r'\$\w*', '', text_no_special_entities) 
    print('去掉价值符号后的:', text_no_tickers, '\n')
    
    # 去掉超链接
    text_no_hyperlinks = re.sub(r'https?:\/\/.*\/\w*', '', text_no_tickers)
    print('去掉超链接后的:', text_no_hyperlinks, '\n')

    # 去掉一些专门名词缩写，简单来说就是字母比较少的词
    text_no_small_words = re.sub(r'\b\w{1,2}\b', '', text_no_hyperlinks) 
    print('去掉专门名词缩写后:', text_no_small_words, '\n')
    
    # 去掉多余的空格
    text_no_whitespace = re.sub(r'\s\s+', ' ', text_no_small_words)
    text_no_whitespace = text_no_whitespace.lstrip(' ') 
    print('去掉空格后的:', text_no_whitespace, '\n')
    
    # 分词
    tokens = word_tokenize(text_no_whitespace)
    print('分词结果:', tokens, '\n')    
          
    # 去停用词
    list_no_stopwords = [i for i in tokens if i not in cache_english_stopwords]
    print('去停用词后结果:', list_no_stopwords, '\n')
    # 过滤后结果
    text_filtered =' '.join(list_no_stopwords) # ''.join() would join without spaces between words.
    print('过滤后:', text_filtered)

text_clean(s)

运行结果：
在这里插入图片描述