nlp单词提取
In this article, we will look at how to tokenize never-before-seen words. Python’s tensorflow tokeniser can easily convert known words into tokens but what happens when you throw it words that it hasn’t seen before?
在本文中,我们将研究如何标记从未见过的单词。 Python的tensorflow令牌生成器可以轻松地将已知单词转换为令牌,但是当您将它抛弃之前没有看到的单词时会发生什么呢?
Tensorflow tokenizer is a very powerful tool. As shown in the article below it is very easy to get started with it.
Tensorflow令牌生成器是一个非常强大的工具。 如下面的文章所示,它很容易上手。
The tokenizer can be used to convert a set of training data (sentences) into a dictionary where each unique word gets a different ID, so to say. Let’s look at how to create a dictionary out of words.
可以使用令牌生成器将一组训练数据(句子)转换为字典,可以说每个唯一单词都有一个不同的ID。 让我们看看如何用单词创建字典。
In Tensorflow, this dictionary is called a word index
在Tensorflow中,此字典称为word index
from tensorflow.keras.preprocessing.text import Tokenizer
#Let's add custom sentences
sentences = [
"Apples are red",
"Apples are round",
"Oranges are round",
"Grapes are green"
]
#Tokenize the sentences
myTokenizer = Tokenizer(num_words=100)
myTokenizer.fit_on_texts(sentences)
print(myTokenizer.word_index)
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/efa19d92f0c7e72d9866122b36f54092.png)
训练集 (Training set)
- Apples are red 苹果是红色的
- Apples are round 苹果是圆的
- Oranges are round 橘子是圆的
- Grapes are green 葡萄是绿色的
Now this word index
can be used to convert our training set into a sequence