关于NLP中的文本预处理的完整教程

最新推荐文章于 2024-05-11 12:59:18 发布

CRMEB定制开发

最新推荐文章于 2024-05-11 12:59:18 发布

阅读量1.9k

点赞数 1

分类专栏：学习笔记文章标签： nlp 文本预处理

本文链接：https://blog.csdn.net/qq_39221436/article/details/124244361

版权

实现文本预处理
在下面的python代码中，我们从Twitter情感分析数据集的原始文本数据中去除噪音。之后，我们将进行删除停顿词、干化和词法处理。

导入所有的依赖性。

! pip install contractions
import nltk
import contractions
import inflect
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer
from bs4 import BeautifulSoup
import re, string, unicodedata
复制代码

去除噪音。

第一步是去除数据中的噪音；在文本领域，噪音是指与人类语言文本无关的东西，这些东西具有各种性质，如特殊字符、小括号的使用、方括号的使用、空白、URL和标点符号。

下面是我们正在处理的样本文本。

在这里插入图片描述

正如你所看到的，首先有许多HTML标签和一个URL；我们需要删除它们，为此，我们使用BeautifulSoup。下面的代码片段将这两者都删除了。

# to remove HTML tag
def html_remover(data):
  beauti = BeautifulSoup(data,'html.parser')
  return beauti.get_text()

# to remove URL
def url_remover(data):
  return re.sub(r'https\S','',data)

def web_associated(data):
  text = html_remover(data)

最低0.47元/天解锁文章

CRMEB定制开发

关注

1
点赞
踩
8

收藏

觉得还不错? 一键收藏
打赏
0
评论
关于NLP中的文本预处理的完整教程

实现文本预处理在下面的python代码中，我们从Twitter情感分析数据集的原始文本数据中去除噪音。之后，我们将进行删除停顿词、干化和词法处理。导入所有的依赖性。! pip install contractionsimport nltkimport contractionsimport inflectfrom nltk import word_tokenize, sent_tokenizefrom nltk.corpus import stopwordsfrom nltk.stem im
复制链接

扫一扫