点击关注我哦
一篇文章带你了解使用Python清洗文本数据
数据格式并不总是表格格式。随着我们进入大数据时代,数据具有相当多样化的格式,包括图像,文本,图形等。
由于格式非常多样,从一种数据到另一种数据,因此将这些数据预处理为可读格式对于计算机来说真的很重要。
在本文中,我想向您展示如何使用Python预处理文本数据。正如标题中提到的,您只需要NLTK和re库即可。
详细流程
小写文字
在开始处理文本之前,最好先将所有字符小写。我们这样做的原因是为了避免任何区分大小写的过程。
假设我们要从字符串中删除停顿词,而我们使用的技术是采用非停顿词并将它们组合为一个句子。如果我们不是小写字母,则无法检测到停顿词,它将导致相同的字符串。这就是为什么小写字母至关重要。
在Python中做到这一点很容易。代码看起来如下所示:
# Example
x = "Watch This Airport Get Swallowed Up By A Sandstorm In Under A Minute http://t.co/TvYQczGJdy"
# Lowercase the text
x = x.lower()
print(x)
>>> watch this airport get swallowed up by a sandstorm in under a minute http://t.co/tvyqczgjdy
删除Unicode字符
有些推文可能包含一个Unicode字符,当我们以ASCII格式看到它时,该字符将不可读。通常,这些字符用于表情符号和非ASCII字符。为了消除这一点,我们可以使用类似下列的代码:
# Example
x = "Reddit Will Now QuarantineÛ_ http://t.co/pkUAMXw6pm #onlinecommunities #reddit #amageddon #freespeech #Business http://t.co/PAWvNJ4sAP"
# Remove unicode characters
x = x.encode('ascii', 'ignore').decode()
print(x)
>>> Reddit Will Now Quarantine_ http://t.co/pkUAMXw6pm #onlinecommunities #reddit #amageddon #freespeech #Business http://t.co/PAWvNJ4sAP
删除停顿词
之后,我们可以删除属于停顿词的词。停顿词是一种对文本含义没有重大贡献的词。因此,我们可以删除这些单词。要检索停顿词,我们可以从NLTK库下载语料库。下列是进行上述操作的代码:
import nltk
nltk.download()
# just download all-nltk
stop_words = stopwords.words("english")
# Example
x = "America like South Africa is a traumatised sick country - in different ways of course - but still messed up."
# Remove stop words
x = ' '.join([word for word in x.split(' ') if word not in stop_words])
print(x)
>>> America like South Africa traumatised sick country - different ways course - still messed up.
删除诸如提及,主题标签,链接等之类的术语
除了删除Unicode和停顿词外,还应删除一些术语,包括提及,主题标签,链接,标点符号等。
要删除这些字符,如果我们仅依赖于已定义的字符,则将是一个挑战。因此,我们需要使用称为正则表达式(Regex)的方式来匹配所需的术语。
正则表达式是一个特殊的字符串,其中包含可以匹配与该模式关联的单词的模式。通过使用它,我们可以使用称为re的Python库基于模式搜索或删除那些模式。为此,我们在下列代码中举例:
import re
# Remove mentions
x = "@DDNewsLive @NitishKumar and @ArvindKejriwal can't survive without referring @@narendramodi . Without Mr Modi they are BIG ZEROS"
x = re.sub("@\S+", " ", x)
print(x)
>>> and can't survive without referring . Without Mr Modi they are BIG ZEROS
# Remove URL
x = "Severe Thunderstorm pictures from across the Mid-South http://t.co/UZWLgJQzNS"
x = re.sub("https*\S+", " ", x)
print(x)
>>> Severe Thunderstorm pictures from across the Mid-South
# Remove Hashtags
x = "Are people not concerned that after #SLAB's obliteration in Scotland #Labour UK is ripping itself apart over #Labourleadership contest?"
x = re.sub("#\S+", " ", x)
print(x)
>>> Are people not concerned that after obliteration in Scotland UK is ripping itself apart over contest?
# Remove ticks and the next character
x = "Notley's tactful yet very direct response to Harper's attack on Alberta's gov't. Hell YEAH Premier! http://t.co/rzSUlzMOkX #ableg #cdnpoli"
x = re.sub("\'\w+", '', x)
print(x)
>>> Notley tactful yet very direct response to Harper attack on Alberta gov. Hell YEAH Premier! http://t.co/rzSUlzMOkX #ableg #cdnpoli
# Remove punctuations
x = "In 2014 I will only smoke crqck if I becyme a mayor. This includes Foursquare."
x = re.sub('[%s]' % re.escape(string.punctuation), ' ', x)
print(x)
>>> In 2014 I will only smoke crqck if I becyme a mayor. This includes Foursquare.
# Remove numbers
x = "C-130 specially modified to land in a stadium and rescue hostages in Iran in 1980... http://t.co/tNI92fea3u http://t.co/czBaMzq3gL"
x = re.sub(r'\w*\d+\w*', '', x)
print(x)
>>> C- specially modified to land in a stadium and rescue hostages in Iran in ... http://t.co/ http://t.co/
# Replace the over spaces
x = " and can't survive without referring . Without Mr Modi they are BIG ZEROS"
x = re.sub('\s{2,}', " ", x)
print(x)
>>> and can't survive without referring . Without Mr Modi they are BIG ZEROS
总结如下
在了解了预处理文本的每个步骤之后,让我们将其应用于列表。如果仔细查看详细步骤,您会发现每种方法都是相互关联的。因此,必须将其应用于功能,以便我们可以在同一时间顺序处理所有功能。在我们应用预处理步骤之前,以下是示例文本的预览:
Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
Forest fire near La Ronge Sask. Canada
All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
13,000 people receive #wildfires evacuation orders in California
Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school
我们需要执行几个步骤来预处理文本列表。依次是:
1. 创建一个包含所有预处理步骤的函数,并返回一个预处理字符串;
2. 使用称为Apply的方法来应用函数,并使用该方法将列表链接起来。
代码如下所示:
# # In case of import errors
# ! pip install nltk
# ! pip install textblob
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re
import nltk
import string
from nltk.corpus import stopwords
# # In case of any corpus are missing
# download all-nltk
nltk.download()
df = pd.read_csv('train.csv')
stop_words = stopwords.words("english")
wordnet = WordNetLemmatizer()
def text_preproc(x):
x = x.lower()
x = ' '.join([word for word in x.split(' ') if word not in stop_words])
x = x.encode('ascii', 'ignore').decode()
x = re.sub(r'https*\S+', ' ', x)
x = re.sub(r'@\S+', ' ', x)
x = re.sub(r'#\S+', ' ', x)
x = re.sub(r'\'\w+', '', x)
x = re.sub('[%s]' % re.escape(string.punctuation), ' ', x)
x = re.sub(r'\w*\d+\w*', '', x)
x = re.sub(r'\s{2,}', ' ', x)
return x
df['clean_text'] = df.text.apply(text_preproc)
结果如下:
deeds reason may allah forgive us
forest fire near la ronge sask canada
residents asked place notified officers evacuation shelter place orders expected
people receive evacuation orders california
got sent photo ruby smoke pours school
最后的想法
这就是使用Python预处理文本的方法。希望小伙伴可以将其应用于解决与文本数据有关的问题。
· END ·
HAPPY LIFE