nlp文本预处理_您需要了解有关NLP和机器学习的文本预处理的所有信息-CSDN博客

本文深入探讨了NLP中的文本预处理，包括下套管、抽干、合法化、去除停用词等技术。强调了预处理的重要性，指出其任务依赖性，提醒读者并非所有任务都需要相同级别的预处理。对于小型数据集，预处理可能至关重要，而在大型通用数据集中，预处理可能不是那么关键。最后，提出了根据任务和数据特点进行预处理的一般原则。

摘要由CSDN通过智能技术生成

nlp文本预处理

by Kavita Ganesan

通过Kavita Ganesan

Based on some recent conversations, I realized that text preprocessing is a severely overlooked topic. A few people I spoke to mentioned inconsistent results from their NLP applications only to realize that they were not preprocessing their text or were using the wrong kind of text preprocessing for their project.

根据最近的一些交谈，我意识到文本预处理是一个严重被忽视的话题。我交谈过的几个人提到他们的NLP应用程序结果不一致，只是为了意识到他们没有对文本进行预处理，或者对项目使用了错误的文本预处理。

With that in mind, I thought of shedding some light around what text preprocessing really is, the different methods of text preprocessing, and a way to estimate how much preprocessing you may need. For those interested, I’ve also made some text preprocessing code snippets for you to try. Now, let’s get started!

考虑到这一点，我想到了一些关于文本预处理真正是什么，文本预处理的不同方法以及估算您可能需要多少预处理的方法。对于那些感兴趣的人，我还制作了一些文本预处理代码段供您尝试。现在，让我们开始吧！

什么是文字预处理？ (What is text preprocessing?)

To preprocess your text simply means to bring your text into a form that is predictable and analyzable for your task. A task here is a combination of approach and domain. For example, extracting top keywords with tfidf (approach) from Tweets (domain) is an example of a Task.

预处理文本只是意味着将文本转换为可预测和可分析的形式为您的任务。这里的任务是方法和领域的结合。例如，从Tweets(域)中使用tfidf(方法)提取顶级关键字是Task的示例。

Task = approach + domain

任务=方法+领域

One task’s ideal preprocessing can become another task’s worst nightmare. So take note: text preprocessing is not directly transferable from task to task.

一个任务的理想预处理可能成为另一任务的最恶梦。因此请注意：文本预处理不能直接在任务之间转移。

Let’s take a very simple example, let’s say you are trying to discover commonly used words in a news dataset. If your pre-processing step involves removing stop words because some other task used it, then you are probably going to miss out on some of the common words as you have ALREADY eliminated it. So really, it’s not a one-size-fits-all approach.

让我们举一个非常简单的示例，假设您正在尝试发现新闻数据集中的常用单词。如果您的预处理步骤由于其他任务使用了停用词而将其删除，那么您可能已经错过了一些常用词，因为您已经消除了它。所以说真的，这不是万能的。

文本预处理技术的类型 (Types of text preprocessing techniques)

There are different ways to preprocess your text. Here are some of the approaches that you should know about and I will try to highlight the importance of each.

有多种预处理文本的方法。这是您应该了解的一些方法，我将尝试强调每种方法的重要性。

下套管 (Lowercasing)

Lowercasing ALL your text data, although commonly overlooked, is one of the simplest and most effective form of text preprocessing. It is applicable to most text mining and NLP problems and can help in cases where your dataset is not very large and significantly helps with consistency of expected output.

虽然通常会忽略掉所有文本数据，但将它们简化为最简单，最有效的文本预处理形式之一。它适用于大多数文本挖掘和NLP问题，并且在数据集不是很大的情况下可以提供帮助，并且可以显着地帮助提高预期输出的一致性。

Quite recently, one of my blog readers trained a word embedding model for similarity lookups. He found that different variation in input capitalization (e.g. ‘Canada’ vs. ‘canada’) gave him different types of output or no output at all. This was probably happening because the dataset had mixed-case occurrences of the word ‘Canada’ and there was insufficient evidence for the neural-network to effectively learn the weights for the less common version. This type of issue is bound to happen when your dataset is fairly small, and lowercasing is a great way to deal with sparsity issues.

最近，我的一位博客读者训练了一个单词嵌入模型来进行相似性查找。他发现投入资本的不同变化(例如“加拿大”与“加拿大”)为他提供了不同类型的产出或根本没有产出。之所以可能发生这种情况，是因为数据集出现了单词'Canada'的大小写混合的情况，并且神经网络没有足够的证据来有效地学习不太常见版本的权重。当您的数据集很小时，这种类型的问题一定会发生，而下套管是处理稀疏性问题的好方法。

Here is an example of how lowercasing solves the sparsity issue, where the same words with different cases map to the same lowercase form:

以下是小写字母如何解决稀疏性问题的示例，其中具有不同大小写的相同单词映射到相同的小写形式：

Another example where lowercasing is very useful is for search. Imagine, you are looking for documents containing “usa”. However, no results were showing up because “usa” was indexed as “USA”. Now, who should we blame? The U.I. designer who set-up the interface or the engineer who set-up the search index?

小写字母非常有用的另一个示例是搜索。想象一下，您正在寻找包含“美国”的文件。但是，没有结果显示出来，因为“美国”被索引为“美国”。 现在，我们应该责怪谁？设置界面的UI设计器或设置搜索索引的工程师？

While lowercasing should be standard practice, I’ve also had situations where preserving the capitalization was important. For example, in predicting the programming language of a source code file. The word System in Java is quite different from system in python. Lowercasing the two makes them identical, causing the classifier to lose important predictive features. While lowercasing is generally helpful, it may not be applicable for all tasks.

虽然小写应该是标准做法，但我也遇到过保持大写很重要的情况。例如，在预测源代码文件的编程语言时。 Java中的System一词与python中的system完全不同。降低两者的大小写会使它们完全相同，从而使分类器失去重要的预测功能。虽然使用小写字母通常会有所帮助，但不一定适用于所有任务。

抽干 (Stemming)

Stemming is the process of reducing inflection in words (e.g. troubled, troubles) to their root form (e.g. trouble). The “root” in this case may not be a real root word, but just a canonical form of the original word.

词干处理是减少单词(例如麻烦，烦恼)词根变化(例如麻烦)的过程。在这种情况下，“词根”可能不是真实的词根，而只是原始词的规范形式。

Stemming uses a crude heuristic process that chops off the ends of words in the hope of correctly transforming words into its root form. So the words “trouble”, “troubled” and “troubles” might actually be converted to troubl instead of trouble because the ends were just chopped off (ughh, how crude!).

词干提取使用粗略的启发式过程，可将单词的结尾切掉，以期将单词正确地转换为词根形式。因此，“麻烦”，“麻烦”和“麻烦”一词实际上可能会转换为troubl而不是trouble因为末端被砍掉了(嗯，太粗糙了！)。

There are different algorithms for stemming. The most common algorithm, which is also known to be empirically effective for English, is Porters Algorithm. Here is an example of stemming in action with Porter Stemmer:

有不同的词干算法。最常见的算法是Porters Algorithm ，它对英语的经验也很有效。这是Porter Stemmer阻止行动的示例：

Stemming is useful for dealing with sparsity issues as well as standardizing vocabulary. I’ve had success with stemming in search applications in particular. The idea is that, if say you search for “deep learning classes”, you also want to surface documents that mention “deep learning class” as well as “deep learn classes”, although the latter doesn’t sound right. But you get where we are going with this. You want to match all variations of a word to bring up the most relevant documents.

词干对于处理稀疏性问题以及标准化词汇很有用。我在阻止搜索应用程序方面尤其成功。这样做的想法是，如果说您搜索“深度学习课程”，那么您还希望显示提及“深度学习课程 ”和“深度学习课程”的文档，尽管后者听起来不太正确。但是您可以了解我们的发展方向。您想匹配单词的所有变体以显示最相关的文档。

In most of my previous text classification work however, stemming only marginally helped improved classification accuracy as opposed to using better engineered features and text enrichment approaches such as using word embeddings.

但是，在我以前的大多数文本分类工作中，仅使用词干仅能勉强地帮助提高分类准确性，而不是使用更好的工程特征和诸如单词嵌入之类的文本丰富方法。

合法化 (Lemmatization)

Lemmatization on the surface is very similar to stemming, where the goal is to remove inflections and map a word to its root form. The only difference is that, lemmatization tries to do it the proper way. It doesn’t just chop things off, it actually transforms words to the actual root. For example, the word “better” would map to “good”. It may use a dictionary such as WordNet for mappings or some special rule-based approaches. Here is an example of lemmatization in action using a WordNet-based approach:

表面上的词法化与词干非常相似，后者的目的是消除词尾变化并将单词映射为其词根形式。唯一的区别是，词形化尝试以正确的方式进行。它不仅将内容砍掉，而且实际上将单词转换为实际的词根。例如，单词“更好”将映射为“好”。它可能使用诸如WordNet之类的字典进行映射或某些基于规则的特殊方法。这是使用基于WordNet的方法进行词法化的示例：

In my experience, lemmatization provides no significant benefit over stemming for search and text classification purposes. In fact, depending on the algorithm you choose, it could be much slower compared to using a very basic stemmer and you may have to know the part-of-speech of the word in question in order to get a correct lemma. This paper finds that lemmatization has no significant impact on accuracy for text classification with neural architectures.

以我的经验，出于搜索和文本分类的目的，词干化没有明显的优势。实际上，根据所选择的算法，与使用非常基本的词干分析器相比，它可能要慢得多，并且可能必须知道所涉及单词的词性才能获得正确的引理。本文发现，词根化对使用神经体系结构的文本分类的准确性没有重大影响。

I would personally use lemmatization sparingly. The additional overhead may or may not be worth it. But you could always try it to see the impact it has on your performance metric.

我个人会少量使用lemmatization。额外的开销可能值得也可能不值得。但是您始终可以尝试查看它对性能指标的影响。

去除停用词 (Stopword Removal)

Stop words are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. The intuition behind using stop words is that, by removing low information words from text, we can focus on the important words instead.

停用词是一组语言中常用的词。停用词的英文示例包括“ a”，“ the”，“ is”，“ are”等。使用停用词的直觉是，通过从文本中删除信息量较低的词，我们可以专注于重要词。

For example, in the context of a search system, if your search query is “what is text preprocessing?”, you want the search system to focus on surfacing documents that talk about text preprocessing over documents that talk about what is. This can be done by preventing all words from your stop word list from being analyzed. Stop words are commonly applied in search systems, text classification applications, topic modeling, topic extraction and others.

例如，在搜索系统的上下文中，如果您的搜索查询是“什么是文本预处理？” ，你希望搜索系统，以专注于堆焊文档谈text preprocessing过的文档讲what is 。这可以通过阻止对停用词列表中的所有词进行分析来完成。停用词通常用于搜索系统，文本分类应用程序，主题建模，主题提取等。

In my experience, stop word removal, while effective in search and topic extraction systems, showed to be non-critical in classification systems. However, it does help reduce the number of features in consideration which helps keep your models decently sized.

以我的经验，停用词移除功能虽然在搜索和主题提取系统中很有效，但在分类系统中却并不重要。但是，它确实有助于减少所考虑的功能数量，从而有助于使模型保持适当的尺寸。

Here is an example of stop word removal in action. All stop words are replaced with a dummy character, W:

这是一个删除停用词的示例。所有停用词均替换为虚拟字符W ：

Stop word lists can come from pre-established sets or you can create a custom one for your domain. Some libraries (e.g. sklearn) allow you to remove words that appeared in X% of your documents, which can also give you a stop word removal effect.

停用词列表可以来自预先设置的集合，也可以为您的域创建自定义列表。一些库(例如sklearn)允许您删除出现在文档X％中的单词，这也可以使您获得停用词的效果。

正常化 (Normalization)

A highly overlooked preprocessing step is text normalization. Text normalization is the process of transforming a text into a canonical (standard) form. For example, the word “gooood” and “gud” can be transformed to “good”, its canonical form. Another example is mapping of near identical words such as “stopwords”, “stop-words” and “stop words” to just “stopwords”.

高度被忽略的预处理步骤是文本规范化。文本规范化是将文本转换为规范(标准)形式的过程。例如，单词“ gooood”和“ gud”可以转换为规范形式的“ good”。另一个示例是将几乎相同的词(例如“停用词”，“停用词”和“停用词”)映射为“停用词”。

Text normalization is important for noisy texts such as social media comments, text messages and comments to blog posts where abbreviations, misspellings and use of out-of-vocabulary words (oov) are prevalent. This paper showed that by using a text normalization strategy for Tweets, they were able to improve sentiment classification accuracy by ~4%.

文本规范化对于嘈杂的文本非常重要，例如社交媒体评论，文本消息和博客文章的评论，在这些地方普遍存在缩写词，拼写错误和使用词汇外词(oov)的情况。本文显示，通过对推文使用文本规范化策略，他们能够将情感分类的准确性提高约4％。

Here’s an example of words before and after normalization:

这是规范化前后的单词示例：

Notice how the variations, map to the same canonical form.

请注意变体如何映射到相同的规范形式。

In my experience, text normalization has even been effective for analyzing highly unstructured clinical texts where physicians take notes in non-standard ways. I’ve also found it useful for topic extraction where near synonyms and spelling differences are common (e.g. topic modelling, topic modeling, topic-modeling, topic-modelling).

以我的经验，文本规范化甚至可以有效地分析高度非结构化的临床文本，其中医生以非标准的方式进行记录。我还发现它对于主题提取非常有用，因为在主题提取中，近义词和拼写差异非常常见(例如，主题建模，主题建模，主题建模，主题建模)。

Unfortunately, unlike stemming and lemmatization, there isn’t a standard way to normalize texts. It typically depends on the task. For example, the way you would normalize clinical texts would arguably be different from how you normalize sms text messages.

不幸的是，与词干和词根化不同，没有标准化文本的标准方法。它通常取决于任务。例如，规范化临床文本的方式可以说与规范短信文本的方式不同。

Some common approaches to text normalization include dictionary mappings (easiest), statistical machine translation (SMT) and spelling-correction based approaches. This interesting article compares the use of a dictionary based approach and a SMT approach for normalizing text messages.

一些常见的文本标准化方法包括字典映射(最简单)，统计机器翻译(SMT)和基于拼写校正的方法。这篇有趣的文章比较了基于字典的方法和SMT方法对文本消息的标准化使用。

噪音消除 (Noise Removal)

Noise removal is about removing characters digits and pieces of text that can interfere with your text analysis. Noise removal is one of the most essential text preprocessing steps. It is also highly domain dependent.

噪声消除是指去除可能干扰您的文本分析的characters digits和pieces of text 。噪声消除是最重要的文本预处理步骤之一。它也是高度依赖域的。

For example, in Tweets, noise could be all special characters except hashtags as it signifies concepts that can characterize a Tweet. The problem with noise is that it can produce results that are inconsistent in your downstream tasks. Let’s take the example below:

例如，在推文中，噪声可能是除标签之外的所有特殊字符，因为它表示可以表征推文的概念。噪音的问题在于，它会产生与下游任务不一致的结果。我们来看下面的例子：

Notice that all the raw words above have some surrounding noise in them. If you stem these words, you can see that the stemmed result does not look very pretty. None of them have a correct stem. However, with some cleaning as applied in this notebook, the results now look much better:

请注意，上面所有原始单词的周围都有一些杂音。如果您阻止了这些词，则可以看到阻止的结果看起来不太漂亮。他们没有一个正确的词干。但是，使用此笔记本进行一些清洁后，结果现在看起来好得多：

Noise removal is one of the first things you should be looking into when it comes to Text Mining and NLP. There are various ways to remove noise. This includes punctuation removal, special character removal, numbers removal, html formatting removal, domain specific keyword removal (e.g. ‘RT’ for retweet), source code removal, header removal and more. It all depends on which domain you are working in and what entails noise for your task. The code snippet in my notebook shows how to do some basic noise removal.

消除噪音是涉及文本挖掘和NLP时应首先考虑的事情之一。有多种消除噪音的方法。这包括标点符号删除 ， 特殊字符删除 ， 数字删除，html格式删除，特定于域的关键字删除 (例如，用于转发的“ RT”)，源代码删除，标头删除等等。这完全取决于您在哪个域中工作以及在工作中带来什么噪音。笔记本中的代码段显示了如何进行一些基本的噪声消除。

文字丰富/扩充 (Text Enrichment / Augmentation)

Text enrichment involves augmenting your original text data with information that you did not previously have. Text enrichment provides more semantics to your original text, thereby improving its predictive power and the depth of analysis you can perform on your data.

文本丰富化涉及使用以前没有的信息来扩充原始文本数据。文本丰富化为原始文本提供了更多的语义，从而提高了其预测能力和可以对数据执行的分析深度。

In an information retrieval example, expanding a user’s query to improve the matching of keywords is a form of augmentation. A query like text mining could become text document mining analysis. While this doesn’t make sense to a human, it can help fetch documents that are more relevant.

在信息检索示例中，扩展用户查询以改善关键字的匹配是一种扩展形式。像text mining这样的查询可以成为text document mining analysis 。尽管这对人类没有意义，但可以帮助获取更相关的文档。

You can get really creative with how you enrich your text. You can use part-of-speech tagging to get more granular information about the words in your text.

您可以在丰富文本方面真正发挥创意。您可以使用词性标记来获取有关文本中单词的更详细的信息。

For example, in a document classification problem, the appearance of the word book as a noun could result in a different classification than book as a verb as one is used in the context of reading and the other is used in the context of reserving something. This article talks about how Chinese text classification is improved with a combination of nouns and verbs as input features.

例如，在文档分类问题中，单词book作为名词的出现可能导致与book作为动词的分类不同，因为一个单词用于阅读，而另一个单词用于保留。本文讨论了如何通过名词和动词的组合作为输入特征来改进中文文本分类。

With the availability of large amounts texts however, people have started using embeddings to enrich the meaning of words, phrases and sentences for classification, search, summarization and text generation in general. This is especially true in deep learning based NLP approaches where a word level embedding layer is quite common. You can either start with pre-established embeddings or create your own and use it in downstream tasks.

然而，随着大量文本的可用性，人们开始使用嵌入来丰富单词，短语和句子的含义，以用于分类，搜索，摘要和文本生成。在基于单词层嵌入层非常普遍的基于深度学习的NLP方法中尤其如此。您可以从预先建立的嵌入开始，也可以创建自己的嵌入并在下游任务中使用它。

Other ways to enrich your text data include phrase extraction, where you recognize compound words as one (aka chunking), expansion with synonyms and dependency parsing.

丰富文本数据的其他方法包括短语提取，将复合词识别为一个词(aka块)，使用同义词进行扩展和依赖项解析。

你需要这一切吗？ (Do you need it all?)

Not really, but you do have to do some of it for sure if you want good, consistent results. To give you an idea of what the bare minimum should be, I’ve broken it down to Must Do, Should Do and Task Dependent. Everything that falls under task dependent can be quantitatively or qualitatively tested before deciding you actually need it.

并非如此，但是如果您要获得良好的，一致的结果，则必须执行一些操作。为了让您了解最低限度应该是什么，我将其细分为必须做 ， 应该做和取决于任务 。在决定您是否实际需要之前，可以对所有与任务相关的内容进行定量或定性测试。

Remember, less is more and you want to keep your approach as elegant as possible. The more overhead you add, the more layers you will have to peel back when you run into issues.

请记住，少即是多，您想让自己的方法尽可能优雅。您添加的开销越多，遇到问题时就必须剥离更多的层。

必须做： (Must Do:)

Noise removal
噪音消除
Lowercasing (can be task dependent in some cases)
小写(在某些情况下可能取决于任务)

应该做： (Should Do:)

Simple normalization — (e.g. standardize near identical words)
简单规范化-(例如，将相同的单词标准化)

取决于任务： (Task Dependent:)

Advanced normalization (e.g. addressing out-of-vocabulary words)
高级归一化(例如，解决词汇外词)
Stop-word removal
停用词删除
Stemming / lemmatization
词干/词条去除
Text enrichment / augmentation
文字丰富/扩充

So, for any task, the minimum you should do is try to lowercase your text and remove noise. What entails noise depends on your domain (see section on Noise Removal). You can also do some basic normalization steps for more consistency and then systematically add other layers as you see fit.

因此，对于任何任务，您要做的最少工作就是尝试使文本小写并消除噪音。引起噪声的因素取决于您的域(请参阅“噪声消除”部分)。您还可以执行一些基本的标准化步骤以提高一致性，然后根据需要系统地添加其他层。

一般经验法则 (General Rule of Thumb)

Not all tasks need the same level of preprocessing. For some tasks, you can get away with the minimum. However, for others, the dataset is so noisy that, if you don’t preprocess enough, it’s going to be garbage-in-garbage-out.

并非所有任务都需要相同级别的预处理。对于某些任务，您可以少花钱。但是，对于其他人来说，数据集是如此嘈杂，以至于如果您没有进行足够的预处理，它将被垃圾吞噬。

Here’s a general rule of thumb. This will not always hold true, but works for most cases. If you have a lot of well written texts to work with in a fairly general domain, then preprocessing is not extremely critical; you can get away with the bare minimum (e.g. training a word embedding model using all of Wikipedia texts or Reuters news articles).

这是一般的经验法则。这并不总是正确的，但在大多数情况下都适用。如果您在一个相当普通的领域中有很多写得很好的文章可以使用，那么预处理并不是非常关键的。您可以摆脱最低限度的要求(例如，使用所有Wikipedia文本或路透社新闻文章来训练单词嵌入模型)。

However, if you are working in a very narrow domain (e.g. Tweets about health foods) and data is sparse and noisy, you could benefit from more preprocessing layers, although each layer you add (e.g. stop word removal, stemming, normalization) needs to be quantitatively or qualitatively verified as a meaningful layer. Here’s a table that summarizes how much preprocessing you should be performing on your text data:

但是，如果您的工作范围非常狭窄(例如有关保健食品的推文)，并且数据稀疏且嘈杂，则可以从更多的预处理层中受益，尽管您添加的每个层(例如，停用词删除，词干，规范化)都需要在数量或质量上被验证为有意义的层。下表总结了应对文本数据执行的预处理量：

I hope the ideas here steer you towards the right preprocessing steps for your projects. Remember, less is more. A friend of mine once mentioned to me how he made a large e-commerce search system more efficient and less buggy just by throwing out layers of unneeded preprocessing.

我希望这里的想法能引导您朝着项目的正确预处理步骤迈进。记住， 少即是多 。我的一个朋友曾经对我说过，他如何通过扔掉不需要的预处理层来使大型电子商务搜索系统更高效，更省力。