NLTK英文文本分词的常用模块

原创

已于 2025-04-10 20:04:28 修改 · 1.3k 阅读

13 ·

CC 4.0 BY-SA版权

文章标签：

#nlp #python #自然语言处理

于 2024-09-10 10:34:13 首次发布

NLTK被常用于处理语料库、分类文本、分析语言结构中

https://www.nltk.org/ #NLTK官网有教程

NLTK支持python3.7及以上的版本

安装分两步

（1）pip install nltk

(2)去 Gitee网站下载nltk数据包

nltk.find('.') #可以找到 nltk在找东西时的调用目录

1.断句模块：

import nltk
from nltk.tokenize import sent_tokenize  #英文断句模块

#要断句的文本
paragraph = 'You must follow me carefully. I shall have to controvert one or twoideas that are almost universally accepted. The geometry, forinstance, they taught you at school is founded on a misconception.'

tokenized_text = sent_tokenize(paragraph)
print(tokenized_text)

tokenized_text输出结果：
['You must follow me carefully.', 'I shall have to controvert one or twoideas that are almost universally accepted.', 'The geometry, forinstance, they taught you at school is founded on a misconception.']

2.分词模块：

from nltk import word_tokenize  #导入分词模块


text = 'You must follow me carefully.'
tokenized_word = word_tokenize(text)
print(tokenized_word)

tokenized_word输出结果：
['You', 'must', 'follow', 'me', 'carefully', '.']

3.去除文本中的除标点符号：

import string   #python自带的英文标点模块


punctuation = string.punctuation  #英文标点符号
text = 'You must follow me carefully.' #待处理文本

#设置映射关系： 用空格替代标点=删除掉标点
#translate()函数功能：  用A替代B
text_1 = text.translate(str.maketrans(punctuation, ' ' * len(punct