数据清洗实例

最新推荐文章于 2024-05-18 15:37:01 发布

不写代码的程序员~zs

最新推荐文章于 2024-05-18 15:37:01 发布

阅读量503

点赞数

分类专栏：自然语言处理文章标签： python 自然语言处理知识图谱深度学习 nlp

本文链接：https://blog.csdn.net/m0_57064565/article/details/119446368

版权

自然语言处理专栏收录该内容

15 篇文章 1 订阅

订阅专栏

在自然语言处理中，往往我们拿到一份数据，不能直接使用，需进行预操作，把数据转化成我们需要的样子。

下面介绍一下基本的数据清洗操作：

代码：

import re
from nltk.corpus import stopwords
s = '     RT @Amila #Test\nTom\'s newly listed Co &amp; Mary\'s unlisted   Group to supply tech for nlTK.\nh $TSLA $AAPL http://t.co/x3wu2u32ush'
cache_english_stopwords = stopwords.words('english')

print("原始数据:",s,'\n')

#去除HTML标签
text_no_special_entities = re.sub(r'&\w*;|#\w*|@\w*','',s)
print("去除特殊标签后：",text_no_special_entities,'\n')

#去除价值符号

text_no_tickers = re.sub(r'\$\w*','',text_no_special_entities)
print("去除价值符号：",text_no_tickers,'\n')

#去除超链接
text_no_hyperlinks = re.sub(r'https?:\/\/.*\/\w*','',text_no_tickers)
print("去除超链接：",text_no_hyperlinks,'\n')

#去掉专门名词缩写
text_no_small_words = re.sub(r'\b\w{1,2}\b','',text_no_hyperlinks)
print("去掉名词缩写：",text_no_small_words,'\n')

#去掉多余的空格
text_no_whitespace = re.sub(r'\s\s+',' ',text_no_small_words)
text_no_whitespace = text_no_whitespace.strip(' ')
print("去掉多余空格:",text_no_whitespace,'\n')

tokens = word_tokenize(text_no_whitespace)
print("分词结果:",tokens,'\n')

#去停用词
list_no_stopwords = [i for i in tokens if i not in cache_english_stopwords]
print("去停用词：",list_no_stopwords,'\n')

text_filtered = ' '.join(list_no_stopwords)
print('过滤后:',text_filtered)

运行结果：

不写代码的程序员~zs

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
数据清洗实例

在自然语言处理中，往往我们拿到一份数据，不能直接使用，需进行预操作，把数据转化成我们需要的样子。下面介绍一下基本的数据清洗操作：代码：import refrom nltk.corpus import stopwordss = ' RT @Amila #Test\nTom\'s newly listed Co &amp; Mary\'s unlisted Group to supply tech for nlTK.\nh $TSLA $AAPL http://t.co/x
复制链接

扫一扫