python中nlp的库_单词袋简介以及如何在Python for NLP中对其进行编码-CSDN博客

python中nlp的库

by Praveen Dubey

通过Praveen Dubey

单词词汇入门以及如何在Python中为NLP 编写代码的简介 (An introduction to Bag of Words and how to code it in Python for NLP)

Bag of Words (BOW) is a method to extract features from text documents. These features can be used for training machine learning algorithms. It creates a vocabulary of all the unique words occurring in all the documents in the training set.

单词袋(BOW)是一种从文本文档中提取特征的方法。这些功能可用于训练机器学习算法。它为训练集中的所有文档中出现的所有唯一单词创建了词汇表。

In simple terms, it’s a collection of words to represent a sentence with word count and mostly disregarding the order in which they appear.

简而言之，它是一个单词集合，代表一个带有单词计数的句子，并且大多不考虑它们出现的顺序。

BOW is an approach widely used with:

BOW是一种广泛用于以下方面的方法：

Natural language processing
自然语言处理
Information retrieval from documents
从文件中检索信息
Document classifications
文件分类

On a high level, it involves the following steps.

从高层次上讲，它涉及以下步骤。

Generated vectors can be input to your machine learning algorithm.

生成的向量可以输入到您的机器学习算法中。

Let’s start with an example to understand by taking some sentences and generating vectors for those.

让我们从一个示例开始，以理解一些句子并为其生成向量。

Consider the below two sentences.

考虑下面的两个句子。

1. "John likes to watch movies. Mary likes movies too."

2. "John also likes to watch football games."

These two sentences can be also represented with a collection of words.

这两个句子也可以用单词集合来表示。

1. ['John', 'likes', 'to', 'watch', 'movies.', 'Mary', 'likes', 'movies', 'too.']

2. ['John', 'also', 'likes', 'to', 'watch', 'football', 'games']

Further, for each sentence, remove multiple occurrences of the word and use the word count to represent this.

此外，对于每个句子，删除单词的多次出现并使用单词计数来表示。

1. {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}

2. {"John":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,   "games":1}

Assuming these sentences are part of a document, below is the combined word frequency for our entire document. Both sentences are taken into account.

假设这些句子是文档的一部分，以下是我们整个文档的合并词频。两个句子都被考虑在内。

{"John":2,"likes":3,"to":2,"watch":2,"movies":2,"Mary":1,"too":1,  "also":1,"football":1,"games":1}

The above vocabulary from all the words in a document, with their respective word count, will be used to create the vectors for each of the sentences.

来自文档中所有单词的上述词汇以及相应的单词计数将用于为每个句子创建向量。

The length of the vector will always be equal to vocabulary size. In this case the vector length is 11.

向量的长度将始终等于词汇量。 在这种情况下，向量长度为11。

In order to represent our original sentences in a vector, each vector is initialized with all zeros — [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

为了表示向量中的原始句子，每个向量都用全零初始化- [ 0，0，0，0，0，0，0，0，0，0 ]

This is followed by iteration and comparison with each word in our vocabulary, and incrementing the vector value if the sentence has that word.

接下来是迭代和与词汇表中的每个单词进行比较，如果句子中有该单词，则将向量值增加。

John likes to watch movies. Mary likes movies too.[1, 2, 1, 1, 2, 1, 1, 0, 0, 0]

John also likes to watch football games.[1, 1, 1, 1, 0, 0, 0, 1, 1, 1]

For example, in sentence 1 the word likes appears in second position and appears two times. So the second element of our vector for sentence 1 will be 2: [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]

例如，在句子1中，“ likes ”一词出现在第二个位置，并且出现了两次。因此，句子1的向量的第二个元素将是2： [1、2、1、1、1、2、1、1、0、0、0]

The vector is always proportional to the size of our vocabulary.

向量始终与我们的词汇量成正比。

A big document where the generated vocabulary is huge may result in a vector with lots of 0 values. This is called a sparse vector. Sparse vectors require more memory and computational resources when modeling. The vast number of positions or dimensions can make the modeling process very challenging for traditional algorithms.

生成的词汇量很大的大文档可能会导致向量具有很多0值。这称为稀疏向量 。建模时，稀疏向量需要更多的内存和计算资源。大量的位置或维度可能会使建模过程对传统算法非常具有挑战性。

编码我们的BOW算法 (Coding our BOW algorithm)

The input to our code will be multiple sentences and the output will be the vectors.

我们代码的输入将是多个句子，输出将是向量。

The input array is this:

输入数组是这样的：

["Joe waited for the train", "The train was late", "Mary and Samantha took the bus",

"I looked for Mary and Samantha at the bus station",

"Mary and Samantha arrived at the bus station early but waited until noon for the bus"]

步骤1：标记句子 (Step 1: Tokenize a sentence)

We will start by removing stopwords from the sentences.

我们将从删除句子中的停用词开始。

Stopwords are words which do not contain enough significance to be used without our algorithm. We would not want these words taking up space in our database, or taking up valuable processing time. For this, we can remove them easily by storing a list of words that you consider to be stop words.

停用词是指没有我们算法无法使用的重要性不高的词。我们不希望这些单词占用我们数据库中的空间或占用宝贵的处理时间。为此，我们可以通过存储您认为是停用词的单词列表来轻松删除它们。

Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded.

令牌化是将字符串序列分解为单词，关键字，短语，符号和其他称为Token的元素的行为。令牌可以是单个单词，短语甚至整个句子。在标记化过程中，某些字符(如标点符号)被丢弃。

def word_extraction(sentence):    ignore = ['a', "the", "is"]    words = re.sub("[^\w]", " ",  sentence).split()    cleaned_text = [w.lower() for w in words if w not in ignore]    return cleaned_text

For more robust implementation of stopwords, you can use python nltk library. It has a set of predefined words per language. Here is an example:

为了更强大地实现停用词，您可以使用python nltk库。每种语言都有一组预定义的单词。这是一个例子：

import nltkfrom nltk.corpus import stopwords set(stopwords.words('english'))

步骤2：对所有句子应用标记化 (Step 2: Apply tokenization to all sentences)

def tokenize(sentences):    words = []    for sentence in sentences:        w = word_extraction(sentence)        words.extend(w)            words = sorted(list(set(words)))    return words

The method iterates all the sentences and adds the extracted word into an array.

该方法迭代所有句子并将提取的单词添加到数组中。

The output of this method will be:

该方法的输出将是：

['and', 'arrived', 'at', 'bus', 'but', 'early', 'for', 'i', 'joe', 'late', 'looked', 'mary', 'noon', 'samantha', 'station', 'the', 'took', 'train', 'until', 'waited', 'was']

步骤3：建立词汇表并产生向量 (Step 3: Build vocabulary and generate vectors)

Use the methods defined in steps 1 and 2 to create the document vocabulary and extract the words from the sentences.

使用步骤1和2中定义的方法来创建文档词汇表并从句子中提取单词。

def generate_bow(allsentences):        vocab = tokenize(allsentences)    print("Word List for Document \n{0} \n".format(vocab));

for sentence in allsentences:        words = word_extraction(sentence)        bag_vector = numpy.zeros(len(vocab))        for w in words:            for i,word in enumerate(vocab):                if word == w:                     bag_vector[i] += 1                            print("{0}\n{1}\n".format(sentence,numpy.array(bag_vector)))

Here is the defined input and execution of our code:

这是代码的定义输入和执行：

allsentences = ["Joe waited for the train train", "The train was late", "Mary and Samantha took the bus",

"I looked for Mary and Samantha at the bus station",

"Mary and Samantha arrived at the bus station early but waited until noon for the bus"]

generate_bow(allsentences)

The output vectors for each of the sentences are:

每个句子的输出向量是：

Output:

Joe waited for the train train[0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 1. 0.]

The train was late[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1.]

Mary and Samantha took the bus[1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0.]

I looked for Mary and Samantha at the bus station[1. 0. 1. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0.]

Mary and Samantha arrived at the bus station early but waited until noon for the bus[1. 1. 1. 2. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0.]

As you can see, each sentence was compared with our word list generated in Step 1. Based on the comparison, the vector element value may be incremented. These vectors can be used in ML algorithms for document classification and predictions.

如您所见， 每个句子都与步骤1中生成的单词列表进行了比较。基于比较，向量元素值可能会增加 。这些向量可以在ML算法中用于文档分类和预测。

We wrote our code and generated vectors, but now let’s understand bag of words a bit more.

我们编写了代码并生成了向量，但是现在让我们更多地了解一些单词。

洞悉字词 (Insights into bag of words)

The BOW model only considers if a known word occurs in a document or not. It does not care about meaning, context, and order in which they appear.

BOW模型仅考虑已知单词是否出现在文档中。它不关心它们出现的含义，上下文和顺序。

This gives the insight that similar documents will have word counts similar to each other. In other words, the more similar the words in two documents, the more similar the documents can be.

这提供了这样的见解，即相似的文档将具有彼此相似的字数。换句话说，两个文档中的单词越相似，文档可能就越相似。

BOW的局限性 (Limitations of BOW)

Semantic meaning: the basic BOW approach does not consider the meaning of the word in the document. It completely ignores the context in which it’s used. The same word can be used in multiple places based on the context or nearby words.
语义：基本的BOW方法不考虑文档中单词的含义。它完全忽略了使用它的上下文。可以根据上下文或附近的单词在多个位置使用同一单词。
Vector size: For a large document, the vector size can be huge resulting in a lot of computation and time. You may need to ignore words based on relevance to your use case.
向量大小 ：对于大文档，向量大小可能很大，导致大量的计算和时间。您可能需要根据与用例的相关性来忽略单词。

This was a small introduction to the BOW method. The code showed how it works at a low level. There is much more to understand about BOW. For example, instead of splitting our sentence in a single word (1-gram), you can split in the pair of two words (bi-gram or 2-gram). At times, bi-gram representation seems to be much better than using 1-gram. These can often be represented using N-gram notation. I have listed some research papers in the resources section for more in-depth knowledge.

这是对BOW方法的简短介绍。该代码显示了它是如何在较低级别上工作的。有关BOW的更多知识。例如，您可以将我们的句子拆分成两个单词对(二元语法或2-gram)，而不是将我们的句子拆分成一个单词(1-克)。有时，二元语法表示似乎比使用1-gram更好。这些通常可以用N-gram表示法表示。我在资源部分列出了一些研究论文，以获取更深入的知识。

You do not have to code BOW whenever you need it. It is already part of many available frameworks like CountVectorizer in sci-kit learn.

您无需在需要时对BOW进行编码。它已经是许多可用框架的一部分，例如sci-kit learning中的CountVectorizer。

Our previous code can be replaced with:

我们之前的代码可以替换为：

from sklearn.feature_extraction.text import CountVectorizervectorizer = CountVectorizer()X = vectorizer.fit_transform(allsentences)print(X.toarray())

It’s always good to understand how the libraries in frameworks work, and understand the methods behind them. The better you understand the concepts, the better use you can make of frameworks.

了解框架中的库如何工作并了解其背后的方法总是很有益的。您越了解概念，就可以更好地利用框架。

Thanks for reading the article. The code shown is available on my GitHub.

感谢您阅读本文。 显示的代码在我的GitHub上可用。

You can follow me on Medium, Twitter, and LinkedIn, For any questions, you can reach out to me on email (praveend806 [at] gmail [dot] com).

您可以在Medium ， Twitter和LinkedIn上关注我，如有任何疑问，您可以通过电子邮件与我联系(praveend806 [at] gmail [dot] com)。