从单词blob到上下文在nlp和文本清理中的5个步骤

最新推荐文章于 2022-08-22 10:33:00 发布

weixin_26704853

最新推荐文章于 2022-08-22 10:33:00 发布

阅读量385

点赞数

文章标签： nlp 自然语言处理 linux

原文链接：https://medium.com/@marcelinomv98x/from-word-blobs-to-context-5-steps-in-nlp-and-text-cleaning-db9c15e11a4c

版权

You don’t have to speak or write a lot to get your point across.

您无需多说话或写很多东西就能表达您的观点。

Many people appreciate this direct approach, and it’s the same with computers when it comes to performing analytics. One of the goals of data cleaning is to feed the computer clear and concise information for the most meaningful results. This is why empty values are dropped, repetitive data may be corrected, transformations are done, and other techniques are performed to get to a refined version of the data.

许多人赞赏这种直接方法，并且在执行分析时，计算机也是如此。数据清理的目标之一是为计算机提供清晰简明的信息，以获取最有意义的结果。这就是为什么丢弃空值，可以纠正重复数据，进行转换以及执行其他技术以获得数据精炼版本的原因。

Text data is no different, and in this article, I’ll be discussing natural language processing (NLP). I’ll be going over the text-cleaning process with data from a recent Twitter sentiment-analysis project of mine, using the Python library NLTK, a library for working with human-language data.

文本数据没有什么不同，在本文中，我将讨论自然语言处理 (NLP)。我将使用Python库NLTK (一个用于处理人类语言数据的库)，通过我最近的Twitter情感分析项目中的数据，来进行文本清理过程。

NLP：自然语言处理 (NLP: Natural Language Processing)

Sentiment analysis through NLP has become a popular method of text analysis for research or business purposes. Use case examples include analyzing customer reviews on certain products, visualizing political polarity in a forum or comment thread, or opinion mining some social media feed.

通过NLP进行情感分析已成为一种流行的文本分析方法，用于研究或商业目的。用例示例包括分析某些产品的客户评论，在论坛或评论线程中可视化政治极性，或挖掘一些社交媒体供稿的观点。

Insight like this is extremely useful because it gives businesses the intelligence to make efficient decisions and helps social science researchers study human behavior through language.

这样的洞察力非常有用，因为它为企业提供了做出有效决策的智慧，并帮助社会科学研究者通过语言研究人类行为。

Just like numerical data, however, text data has to be processed, or cleaned, before conducting any analytics. Python is a great language for this, with libraries like NLTK making the text-cleaning process smooth and easy.

但是，就像数值数据一样，在执行任何分析之前，必须先处理或清除文本数据。 Python是对此的一种很好的语言，它具有NLTK之类的库，可以使文本清理过程变得轻松便捷。

NLTK, or the Natural Language Toolkit, is a Python library for word processing techniques like stemming, tokenization, classification, and more. We’ll use it to go over five practices in regular NLP:

NLTK或自然语言工具包是一个Python库，用于词干，词干化，分类等文字处理技术。我们将使用它遍历常规NLP中的五种实践：

Lowercase conversion
小写转换
Tokenization
代币化
Punctuation removal
删除标点符号
Stop word removal
停止单词删除
Stemming and lemmatization
词干和词条去除

Image for post — Morning Brew on Morning Brew在 Unsplash Unsplash拍摄

对COVID-19疫苗的情绪 (Sentiment on a COVID-19 Vaccine)

We’ll be doing text cleaning on a data set of tweets regarding sentiment on a COVID-19 vaccine. These tweets were scraped by TwitterScraper, a Python library for scraping tweets based on keywords and hashtags.

我们将对有关COVID-19疫苗情绪的推文数据集进行文本清理。这些推文由TwitterScraper抓取， TwitterScraper是一个Python库，用于根据关键字和主题标签抓取推文。

If it’s your first time using NLTK, then you’ll need to download it into your virtual environment or notebook. See the following cell:

如果这是您第一次使用NLTK，则需要将其下载到虚拟环境或笔记本中。请参见以下单元格：

演示地址

Uncomment ‘nltk.download()’, and run the code

取消注释“ nltk.download()”，然后运行代码

转换为小写 (Convert to lowercase)

“Converting all your data to lowercase helps in the process of pre-processing and in later stages in the NLP application, when you are doing parsing.”

“在解析时，将所有数据转换为小写有助于NLP应用程序的预处理和后期处理。”

— Jalaj Thanaki via O’Reilly

— 贾拉吉·塔纳基 ( Jalaj Thanaki)通过O'Reilly

This is the first step in our NLP process. Remember that the whole purpose of text processing is to make the data easier for the computer to understand. Converting our text data to lowercase helps toward that goal.

这是我们NLP流程的第一步。请记住，文本处理的全部目的是使数据更易于计算机理解。将我们的文本数据转换为小写有助于实现该目标。

演示地址

Convert text and hashtag columns’ data to lowercase

将文本和主题标签列的数据转换为小写

Using the string library and list comprehension, I convert the data within the text and hashtag columns using the .lower() function. Note that I have to loop through the hashtag column twice because it consists of lists within lists — or, in other words, it’s a nested list.

使用字符串库和列表.lower() ，我使用.lower()函数转换text和hashtag列中的数据。请注意，我必须遍历hashtag列两次，因为它由列表中的列表组成-换句话说，这是一个嵌套列表。

代币化 (Tokenization)

“As tokens are the building blocks of Natural Language, the most common way of processing the raw text happens at the token level. For example, Transformer based models — the State of The Art (SOTA) Deep Learning architectures in NLP — process the raw text at the token level.”

“由于令牌是自然语言的基础，因此处理原始文本的最常见方法是在令牌级别上进行。例如，基于Transformer的模型-NLP中的最新技术(SOTA)深度学习架构-在令牌级别处理原始文本。”

—Aravind Pai via Analytics Vidhya

— 通过Analytics Vidhya的Aravind Pai

Tokenization is simply breaking up our text data word by word or sentence by sentence. This further translates the data into a more computer-friendly syntax.

标记化只是将我们的文本数据逐单词或逐句分解。这进一步将数据转换为对计算机更友好的语法。

演示地址

Word and sentence tokenization

单词和句子标记化

For our tweet data, we’ll use NLTK’s tokenize class and implement the word_tokenize() function on our text column, breaking the text up word by word. I’ve also used the sent_tokenize() function to demonstrate sentence tokenization.

对于我们的tweet数据，我们将使用NLTK的tokenize类，并在text列上实现word_tokenize()函数，将文本逐字拆分。我还使用了sent_tokenize()函数来演示句子标记化。

删除标点符号 (Remove punctuation)

Removing punctuation from our text data ultimately depends on the purpose of the model that the data is based on. In this case, for clustering tweets based on COVID-19 vaccine sentiment, punctuation wouldn’t serve much of a purpose, so we’ll remove it.

从文本数据中删除标点符号最终取决于数据所基于的模型的目的。在这种情况下，对于将基于COVID-19疫苗情绪的推文进行聚类的话，标点符号并没有多大作用，因此我们将其删除。

演示地址

Using the re (regex) library, we remove all punctuation

使用re(regex)库，我们删除了所有标点符号

Using the re library for pattern recognition, we create a regex object that’ll search for all forms of punctuation and nonalphanumeric terms. We then call the sub() function on our object so it’ll check if the tokens (individual elements/words for our word tokenization) are punctuation. If the token is not punctuation, that token will be appended to our new list, no_punc.

使用re库进行模式识别，我们创建了一个正则表达式对象，该对象将搜索所有形式的标点符号和非字母数字术语。然后，我们在对象上调用sub()函数，以便检查标记(用于单词标记化的单个元素/单词)是否为标点符号。如果令牌不是标点符号，则该令牌将附加到我们的新列表no_punc 。

删除停用词 (Removing stop words)

“One of the major forms of pre-processing is to filter out useless data. In natural language processing, useless words (data), are referred to as stop words.”

“预处理的主要形式之一是过滤掉无用的数据。在自然语言处理中，无用的单词(数据)被称为停用词。”

—Pratima Upadhyay via GeeksforGeeks

- 通过GeeksforGeeks的Pratima Upadhyay

In more detail, stop words are commonly used words like is, but, an, to, him, etc. and are considered useless data since they add no contextual meaning to the text data. Whether for building a database or performing clustering, preprocessed text data is generally free of stop words, and NLTK’s list of stop words is perfect for removing them from our tweets.

更详细地讲，停用词是常用词，例如is，but，an，to，他等，并且被认为是无用的数据，因为它们没有向文本数据添加上下文含义。无论是用于构建数据库还是执行聚类，预处理的文本数据通常都不含停用词，NLTK的停用词列表非常适合将其从我们的推文中删除。

演示地址

import NLTK’s ‘stopwords’ class to check for English stop words

导入NLTK的“停用词”类以检查英语停用词

In this code, I’m simply checking our text data for any English stop words and appending those that aren’t to our new list, new_term_vector.

在这段代码中，我只是在检查文本数据中是否有任何英语停用词，并将未添加到我们的新列表new_term_vector那些词附加到new_term_vector 。

词干和词条去除 (Stemming and lemmatization)

“Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word. Stemming follows an algorithm with steps to perform on the words which makes it faster.”

“词干和词法化都会产生变形词的词根形式。区别在于词干可能不是实际单词，而引理是实际语言单词。词干遵循一种算法，该算法具有对单词执行的步骤，从而使其变得更快。”

— Hafsa Jabeen via DataCamp

- 通过DataCamp的Hafsa Jabeen

In NLP, stemming is simply reducing a word by cutting off suffixes, like -er, -ing, -es, and so on. So runner becomes runn, sentences becomes sentenc, and growing becomes grow.

在NLP中，词干只是通过切断后缀(例如-er，-ing，-es等)来简化单词。因此，奔跑者变得无足轻重， 句子变成了情感， 成长也变得成长。

However, runn and sentenc are obviously not words. This is where lemmatization comes in since it involves referring to the English dictionary and matching stemmed words with their actual-language equivalents. This replaces suffixed words with their root words, making the data more computer-friendly.

但是， runn和sentenc显然不是单词。进行词法化的原因在于，它涉及到参考英语词典并将词干词与它们的实际语言对等词进行匹配。这会将后缀词替换为其词根，使数据对计算机更友好。

演示地址

First process the word to it’s a stem. Then move onto its lemma

首先处理这个词是词干。然后进入其引理

We first import the PorterStemmer and WordNetLemmatizer classes from NLTK. We then create stemmer and lemmatizer objects and use them to perform the stem() and lemmatize() functions on every word within the text data.

我们首先从NLTK导入PorterStemmer和WordNetLemmatizer类。然后，我们创建stemmer和lemmatizer对象，并使用它们对文本数据中的每个单词执行stem()和lemmatize()函数。

奖励：微调 (Bonus: fine-tuning)

I noticed after processing the tweets, I still had words with web terms attached to them, like html, www, http, and others. So I created a new regex object that searches for these specific terms and implemented the same method I did when removing punctuation.

我注意到在处理这些推文之后，我仍然有一些单词附有Web术语，例如html ， www ， http等。因此，我创建了一个新的正则表达式对象来搜索这些特定术语，并实现了与删除标点符号相同的方法。

演示地址

Removal of elements that consist of web terms

删除包含网络术语的元素

结论 (Conclusion)

“The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable.”

“自然语言处理的最终目标是以有价值的方式阅读，解读，理解和理解人类语言。”

— Dr. Michael J. Garbade via Becoming Human

-Michael J. Garbade博士通过《成为人类》

In conclusion, NLP facilitates the computer’s interpretation of text data. Conversion to lowercase, tokenization, punctuation/stop-word removal, and stemming and lemmatization are some of the most important yet basic text-processing techniques used in natural language processing.

总之，NLP促进了计算机对文本数据的解释。转换为小写字母，令牌化，标点/停用词删除以及词干和词形化是自然语言处理中使用的一些最重要但最基本的文本处理技术。

The best way to get more familiar with the subject is to try it out yourself! Pick a topic that really interests you, and do an NLP project on it. It can be analyzing customer reviews on Amazon or opinion mining on social media.

熟悉该主题的最好方法是亲自尝试一下！选择一个您真正感兴趣的主题，并在该主题上进行一个NLP项目。它可以分析亚马逊上的客户评论或社交媒体上的观点挖掘。

What other NLP techniques do you use when dealing with text data? Which ones do you think are the most useful to know? Which ones would you like to learn?

您在处理文本数据时还使用其他哪些NLP技术？您认为最了解哪些？您想学习哪些？