词嵌入 网络嵌入_词嵌入深入实践

本文深入探讨了词嵌入的概念,并通过实际操作介绍了如何在网络中应用词嵌入,帮助理解如何将自然语言处理技术应用于网络环境中。
摘要由CSDN通过智能技术生成

词嵌入 网络嵌入

介绍 (Introduction)

I’m sure most of you would stumble sooner or later on the term “Word Embeddings” as you progress with your Natural Language Processing journey. Word Embeddings has become one of the most significant building blocks for today’s state-of-the-art language models. It’s crucial that we understand what they represent, how they are computed under the hood, and what sets them apart. So let’s begin by understanding what it really means and layout the characteristics and features behind its extensive usage and popularity in the NLP community. Starting with the basic foundations of word embeddings, we’ll gradually explore the depths as we advance through the article. The full code shared in this article is available on Github.

我相信你们中的大多数人迟早会在“ 自然语言处理”过程中迷失于“单词嵌入”一词。 单词嵌入已成为当今最先进的语言模型的最重要组成部分之一。 至关重要的是,我们要理解它们代表什么,如何在引擎盖下进行计算以及是什么使它们与众不同。 因此,让我们首先了解它的真正含义,并对其在NLP社区中的广泛使用和普及背后的特征和功能进行布局。 从词嵌入的基本基础开始,我们将逐步研究本文的深度。 Github上提供了本文共享的完整代码。

词嵌入 (Word Embeddings)

Let’s take a look at what Wikipedia has to say about word embeddings —

让我们看一下维基百科对词嵌入的看法–

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.

单词嵌入是自然语言处理 (NLP)中一组语言建模特征学习技术的总称,其中词汇表中的单词或短语映射为实数 向量

In other words — word embeddings are vectorized, fixed-length, distributed, dense representations of words that interpret a word’s textual meaning by mapping it to a vector of real values. I know that’s a lot to take in all at once. We’ll break the definition into parts and focus on one part at a time.

换句话说,词嵌入是词的向量化,定长,分布式,密集表示,通过将词映射到实数值向量来解释词的文本含义。 我知道一次可以吸收很多东西。 我们将定义分为几部分,一次只关注一个部分。

Word embeddings are fixed-length vectors — meaning all the words in our vocabulary would be represented by a vector(of real numbers) of a fixed predefined size that we decide on. Word embeddings of size 3 would look like: ‘cold’ — [0.2, 0.5, 0.08], ‘house’ — [0.05, 0.1, 0.8]

词嵌入是固定长度的向量 -意味着我们词汇表中的所有词都将由我们决定的固定预定义大小的(实数)向量表示。 大小为3的单词嵌入看起来像:“冷”-[0.2,0.5,0.08],“房屋”-[0.05,0.1,0.8]

Distributed representations — Word embeddings are based on the distributional hypothesis, which states that words that occur in similar contexts tend to have similar meanings. Distributed representations try to comprehend a word’s meaning by considering the company it keeps (context words).

分布式表示 —单词嵌入基于分布假设,该假设指出在相似上下文中出现的单词往往具有相似的含义。 分布式表示试图通过考虑它所保留的公司(上下文词)来理解一个词的含义。

Dense representations — this is one of the most prominent features of word embeddings that made them so popular. Traditionally, One Hot Encoding was being used for mapping a word to numerical values. One hot encoding is the process of converting vocabulary words into binary vectors. If vocabulary size is 5 with the words — {cat, food, house, milk, water}, the cat will be encoded as a binary vector of [1, 0, 0, 0, 0] and milk would be [0, 0, 0, 1, 0] and so on.

密集表示 -这是使单词嵌入如此流行的最显着特征之一。 传统上,使用一种热编码将单词映射到数值。 一种热编码是将词汇词转换为二进制向量的过程。 如果词汇量为5,且带有单词-{猫,食物,房子,牛奶,水},则猫将被编码为[1、0、0、0、0]的二进制矢量,而牛奶将是[0,0] ,0,1,0]等。

As you might have noticed already, we’re only setting one element using the word index of the entire vector. As the vocabulary size increases, we’d end up using an extensive length sparse vector for encoding a single word which would result in performance and storage penalties because of the curse of dimensionality. In addition to that, such representation is incapable of learning semantic relationships between words which is of essential importance when dealing with textual data.

您可能已经注意到,我们仅使用整个向量的单词索引来设置一个元素。 随着词汇量的增加,我们最终将使用长度很长的稀疏向量来编码单个单词,这是由于维数诅咒而导致的性能和存储损失。 除此之外,这种表示不能学习单词之间的语义关系,这在处理文本数据时至关重要。

需要词嵌入 (Need for Word Embeddings)

To overcome the limitations of one-hot encoding, traditional information retrieval methods have also been tried and implemented with the hope of combating the curse of dimensionality — TF-IDF, Latent semantic analysis, etc. TF-IDF, LSA both use a document-centric approach making them limited to a subclass of NLP problems. These techniques still can’t effectively capture a word’s meaning in a dense representation.

为了克服一次性编码的局限性,还尝试并实施了传统的信息检索方法,以应对维度的诅咒-TF-IDF潜在语义分析等。TF-IDF,LSA都使用文档-以中心为中心的方法使它们仅限于NLP问题的子类。 这些技术仍然无法有效地以密集表示形式捕获单词的含义。

Word embeddings eliminate all the above shortcomings and equip us with enriched powerful representations that are capable of capturing contextual and semantic similarities between words.

词嵌入消除了上述所有缺点,并为我们配备了能够捕获词之间上下文和语义相似性的功能强大的表示形式。

Now that we’ve taken a look behind the idea and motivation for word embeddings, we’ll attempt to contemplate one of the most significant and widely used algorithms for learning word embeddings — Word2Vec. We’ll go through it in detail and define its salient features and characteristics followed by a thorough implementation using tensorflow.

现在,我们已经了解了词嵌入的概念和动机,我们将尝试构想一种学习词嵌入的最重要,使用最广泛的算法-Word2Vec 。 我们将详细介绍它,并定义其显着特征和特征,然后使用tensorflow进行全面实现。

Word2Vec (Word2Vec)

Word2Vec is a predication based algorithm, for generating word embeddings, which was originally proposed at Google by Mikolov et al. For a deeper understanding of concepts involved I’d suggest you dig into their research paper — Efficient Estimation of Word Representations in Vector Space

Word2Vec是用于生成单词嵌入的基于谓词的算法,最初是由Mikolov等人在Google上提出的。 为了更深入地了解所涉及的概念,建议您深入研究他们的研究论文- 向量空间中单词表示的有效估计

It proposes two new novel architectures for learning distributed and dense representations of words:

它提出了两种新的新颖的体系结构,用于学习单词的分布式表示和密集表示:

  • Continuous Bag of Words Model (CBOW)

    连续词袋模型(CBOW)
  • The Skip Gram Model

    跳格模型

CBOW模型 (The CBOW model)

In the CBOW model, we try to predict the distributed representation of the target word (middle word) from the context words (surrounding words) which lie on either side of the target word within the context window (whose size is configurable but usually is set to 5). For example, Pack my box with five dozen liquor jugs, a context window of size 2 would have the following pairs — (context_window, target_word) — ([five, liquor], dozen), ([dozen, jugs], liquor) and so on.

在CBOW模型中,我们尝试从位于上下文窗口(其大小是可配置但通常已设置)中的目标单词两侧的上下文单词(环绕单词)预测目标单词(中间单词)的分布式表示形式至5)。 例如, 用五打酒壶包装我的盒子 ,大小为2的上下文窗口将具有以下对-(context_window,target_word)-([五,酒],一拳),([十二,酒壶],酒) 等等。

Image for post
The CBOW Architecture
CBOW体系结构

跳格模型 (The Skip Gram Model)

The Skip-gram model is similar to the CBOW model, but instead of predicting the current word given the context, it tries to predict the context words from the current word. For example, Pack my box with five dozen liquor jugs, a context window of size 2 would have the following pairs — (current_word, context_window) — (dozen, [five, liquor]), (liquor, [dozen, jugs]) and so on.

Skip-gram模型与CBOW模型相似,但是它没有根据给定上下文预测当前单词,而是尝试根据当前单词预测上下文单词。 例如, 用五打酒壶包装我的盒子 ,大小为2的上下文窗口将具有以下几对-(current_word,context_window)-(十二,[五,酒]),(酒,[十二,水罐] ) 等等。

Image for post
The Skip-gram Architecture
跳过图架构

略过Vs CBOW-哪个是首选? (Skip Gram Vs CBOW — which is preferred?)

According to Mikolov:

据米科洛夫说

Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words

跳过语法表:适用于少量训练数据,甚至可以很好地表示稀有单词或短语CBOW:比跳过语法表的训练速度快几倍,对常见单词的准确性稍高

CBOW learns to find the word with maximum probability in its context. The context words are averaged and fed to the network to predict the most probable word. For example, “the cat ate the mouse” — the network would try to predict “ate” from averaged input of “the cat the mouse”. In this context and other relevant contexts, overtime the model would smooth itself to predict frequent words like ate and would give much less attention to “gobbled” since it would rarely occur. Because of this reason, the quality of their distributed representations suffers.

弓箭 学会在上下文中最大可能地找到单词。 对上下文词进行平均,并馈入网络以预测最可能的词。 例如,“猫吃了老鼠”-网络将尝试根据“猫吃老鼠”的平均输入来预测“吃”。 在这种情况下以及其他相关情况下,加班模型将使自身平滑以预测频繁出现的词语(例如“吃”),并且对“狼吞虎咽”的关注​​将大大减少,因为这种情况很少发生。 因此,它们的分布式表示的质量受到影响。

Skip gram, on the other hand, learns to predict context words using the target word. Instead of averaging the input context words — each pair can be separately used to feed the model to predict the other word in that pair. Predicting “the” from “ate”, “ate” from “mouse”, etc. Such behavior of training won’t enforce competition between “ate” and “gobble” since both would be used in their respective contexts to predict context words. A detailed discussion on the topic can be found here.

另一方面,跳过语法学习使用目标单词来预测上下文单词。 不用平均输入的上下文单词,而是可以将每对分别用于模型预测模型中的另一个单词。 从“ ate”预测“ the”,从“ mouse”预测“ ate”,等等。这种训练行为不会在“ ate”和“ gobble”之间产生竞争,因为两者都将在各自的上下文中用于预测上下文词。 关于该主题的详细讨论可以在这里找到。

We’re going to implement word2vec algorithm using Skip-gram architecture coupled with negative sampling (will be explained later in the article). So, let’s dive straight into implementation!

我们将使用Skip-gram体系结构和负采样来实现word2vec算法(稍后将在本文中进行解释)。 因此,让我们直接深入实施!

带负采样的跳过图 (Skip-gram with Negative Sampling)

We’re going to use the text8 dataset for the purpose of this article. Text8 is the first 100,000,000 bytes of plain text from Wikipedia. It’s mainly used for testing purposes. Let’s start with loading data:

出于本文的目的,我们将使用text8数据集。 Text8是Wikipedia的前100,000,000字节的纯文本。 它主要用于测试目的。 让我们从加载数据开始:

def load_data():
  text8_zip_file_path = api.load('text8', return_path=True)
  with gzip.open(text8_zip_file_path, 'rb') as file:
    file_content = file.read()
  wiki = file_content.decode()
  return wiki


wiki = load_data()

预处理数据 (Preprocessing Data)

Stopwords removal — We begin with removing stopwords as they bring little to no value for our task of learning word embeddings.

停用词的删除 -我们从删除停用词开始,因为停用词对我们学习单词嵌入的任务几乎没有价值。

Subsampling words — In a large corpus, most frequent words can easily occur hundreds of millions of times and such words usually don’t bring much information to the table. It is of essential importance to cut down on their frequencies to mitigate the negative impact it adds. For example, co-occurrences of “English” and “Spanish” benefit much more than co-occurrences of “English” and “the” or “Spanish” and “of”. To counter the imbalance between rare and frequent words Mikolov et. al came up with the following heuristic formula for determining probability to drop a particular word:

二次采样单词 -在大型语料库中,最频繁出现的单词很容易出现数亿次,并且此类单词通常不会给表带来太多信息。 降低频率以减轻其增加的负面影响至关重要。 例如,“英语”和“西班牙语”的同时出现比“英语”和“ the”或“西班牙语”和“ of”的同时出现受益更多。 为了消除稀有词和常见词之间的不平衡,Mikolov等 Al提出了以下启发式公式,用于确定丢弃特定单词的可能性:

Image for post

where t is threshold value (heuristically set to 1e-5) and f(w) is the frequency of the word.

其中t是阈值(启发式设置为1e-5),f(w)是单词的频率。

Filtering words — Frequency of words tell us a lot about their importance and significance for our model. Words occurring only once can’t really be represented correctly because of the lack of context words associated with it. To preclude such noise (words) from our data (as we don’t have much information about their whereabouts), we’re keeping words occurring at least five times in our data.

过滤单词 - 单词的频率告诉我们很多关于它们对我们的模型的重要性和意义。 由于缺少与之关联的上下文单词,因此只能真正出现一次的单词不能真正正确地表示出来。 为了从我们的数据中排除此类噪音(单词)(因为我们对其行踪的了解不多),我们将单词至少出现在数据中五次。

def get_drop_prob(x, threshold_value):
  return 1 - np.sqrt(threshold_value/x)


def subsample_words(words, word_counts):
  threshold_value = 1e-5
  total_count = len(words)
  freq_words = {word: (word_counts[word]/total_count) for word in set(words)}
  subsampled_words = [word for word in words if random.random() < (1 - get_drop_prob(freq_words[word], threshold_value))]
  return subsampled_words


def preprocess_text(text):
  # Replace punctuation with tokens so we can use them in our model
  text = text.lower()
  text = text.strip()
  text = text.replace('.', ' <PERIOD> ')
  text = text.replace(',', ' <COMMA> ')
  text = text.replace('"', ' <QUOTATION_MARK> ')
  text = text.replace(';', ' <SEMICOLON> ')
  text = text.replace('!', ' <EXCLAMATION_MARK> ')
  text = text.replace('?', ' <QUESTION_MARK> ')
  text = text.replace('(', ' <LEFT_PAREN> ')
  text = text.replace(')', ' <RIGHT_PAREN> ')
  text = text.replace('--', ' <HYPHENS> ')
  text = text.replace('?', ' <QUESTION_MARK> ')
  text = text.replace(':', ' <COLON> ')
  words = text.split()


  # Remove stopwords
  stopwords_eng = set(stopwords.words('english'))
  words = [word for word in words if word not in stopwords_eng]
  # Remove all the words with frequency less than 5
  word_counts = Counter(words)
  print("Count of words: %s" % (len(words)))
  filtered_words = [word for word in words if word_counts[word] >= 5]
  print("Count of filtered words: %s" % (len(filtered_words)))
  # Subsample words with threshold of 10^-5
  subsampled_words = subsample_words(filtered_words, word_counts)
  print("Count of subsampled words: %s" % (len(subsampled_words)))


  return word_counts, subsampled_words

使用Skipgrams准备Tensorflow数据集 (Preparing Tensorflow Dataset using skipgrams)

Generating skipgrams — First, we tokenize our pre-processed textual data and convert them into corresponding vectorized tokens. After that, we make use of the skipgrams library offered by Keras for generating (word, context) pairs. As it’s description reads:

生成跳过图—首先,我们将预处理的文本数据标记化,并将其转换为相应的矢量化标记。 之后,我们利用Keras提供的skipgrams库生成(单词,上下文)对。 正如其描述所示:

Generates skip-gram word pairs. It transforms a sequence of word indexes (list of integers) into tuples of words of the form:

生成跳过语法词对。 它将单词索引序列(整数列表)转换为以下形式的单词元组:

  • (word, word in the same window), with label 1 (positive samples).

    (单词,同一窗口中的单词),带有标签1(正样本)。

  • (word, random word from the vocabulary), with label 0 (negative samples).Read more about Skip-gram in this gnomic paper by Mikolov et al.: Efficient Estimation of Word Representations in Vector Space

    (单词,词汇中的随机单词),标签为0(负样本)。Mikolov等人在这份基因组论文中了解有关跳过语法的更多信息: 向量空间中单词表示的有效估计

Negative Sampling — For every input, we give to the network, we train it using the output from the softmax layer. That means for each input, we’re making very small changes to millions of weights even though we only have one true example. This makes training the network very inefficient and unfeasible. The problem of predicting context words can instead be framed as a set of independent binary classification tasks. Then the goal is to independently predict the presence (or absence) of context words. The following snippet generates pairs of (target, context) words also known as skipgrams, and for each input(target, context) pair we also randomly sample a negative (target, ~context) pair. For further reading, refer to this paper by Mikolov et. al

负采样-对于每个输入,我们给网络,我们使用softmax层的输出对其进行训练。 这意味着对于每个输入,即使只有一个真实的示例,我们也会对数百万个权重进行非常小的更改。 这使得训练网络非常低效且不可行。 相反,可以将预测上下文词的问题构造为一组独立的二进制分类任务。 然后,目标是独立预测上下文词的存在(或不存在)。 以下代码片段生成了成对的(目标,上下文)词,也称为跳过图,并且对于每个输入(目标,上下文)对,我们还随机采样一个负(目标,上下文)对。 要进一步阅读,请参阅Mikolov等人的这篇论文 。 人

VOCAB_SIZE = len(tokenizer.word_counts)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(preprocessed_words)
vectorized_words = [tokenizer.word_index[word] for word in preprocessed_words]


pairs, labels = skipgrams(vectorized_words, VOCAB_SIZE, window_size=3, negative_samples=1.0, shuffle=True)
target_words = [p[0] for p in pairs]
context_words = [q[1] for q in pairs]


SAMPLE_SIZE = len(labels)
labels_sample = labels[:SAMPLE_SIZE]
target_words_sample = target_words[:SAMPLE_SIZE]
context_words_sample = context_words[:SAMPLE_SIZE]
train_size = int(len(labels_sample) * 0.9)
train_target_words, train_context_words, train_labels = target_words_sample[:train_size], context_words_sample[:train_size], labels_sample[:train_size]
test_target_words, test_context_words, test_labels = target_words_sample[train_size:], context_words_sample[train_size:], labels_sample[train_size:]


train_dataset = tf.data.Dataset.from_tensor_slices((train_target_words, train_context_words, train_labels)).shuffle(BUFFER_SIZE)
train_dataset = train_dataset.batch(BATCH_SIZE, drop_remainder=True)
test_dataset = tf.data.Dataset.from_tensor_slices((test_target_words, test_context_words, test_labels)).shuffle(BUFFER_SIZE)
test_dataset = test_dataset.batch(BATCH_SIZE, drop_remainder=True)

建立模型 (Building the Model)

Now let’s build the model by using the model subclassing method. In the majority of the cases, Sequential and Functional APIs are more appropriate, but you can still use model subclassing if you’re more of an object-oriented developer.

现在,让我们使用模型子类化方法来构建模型。 在大多数情况下,顺序和功能API更合适,但如果您更是面向对象的开发人员,则仍可以使用模型子类化。

For each training input, we have a pair of words (target word, context word) that we feed into the model and output as binary value labels (0 or 1) to indicate whether the input tuple is a negative sample or a true sample. Both the input words are fed to the embedding layer to generate encoded representations of size equal to the embedding dimension. The crucial point to note here is we’re sharing the embedding layer between both the inputs.

对于每个训练输入,我们有一对词(目标词,上下文词),我们将它们输入模型并作为二进制值标签(0或1)输出,以指示输入元组是负样本还是真样本。 两个输入词都被馈送到嵌入层以生成大小等于嵌入维数的编码表示。 这里要注意的关键点是我们在两个输入之间共享嵌入层。

These dense encoded vectors are then multiplied element-wise to construct a merged representation. Which in turn is goes through dense and dropout layers before it finally tries to predict the positive or negative nature of input sample.

然后将这些密集的编码向量逐个元素相乘以构造合并的表示形式。 在最终尝试预测输入样本的正负性质之前,它依次经历了稠密和丢失的层。

class SkipGramModel(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim):
      super(SkipGramModel, self).__init__()
      self.shared_embedding = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=1, name='word_embeddings')
      self.flatten = tf.keras.layers.Flatten(name='flatten')
      self.dense1 = tf.keras.layers.Dense(64, activation=tf.nn.relu, name='dense_one')
      self.dropout1 = tf.keras.layers.Dropout(0.2, name = 'dropout1')
      self.dense2 = tf.keras.layers.Dense(32, activation=tf.nn.relu, name='dense_two')
      self.dropout2 = tf.keras.layers.Dropout(0.2, name = 'dropout2')
      self.pred = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid, name='predictions')


    def call(self, target_word, context_word, training=True):
      x = self.word_embedding(target_word)
      y = self.word_embedding(context_word)
      x = self.flatten(x)
      y = self.flatten(y)
      shared = tf.multiply(x, y)
      dense_output1 = self.dense1(shared)
      if training: dense_output1 = self.dropout1(dense_output1)
      dense_output2 = self.dense2(dense_output1)
      if training: dense_output2 = self.dropout2(dense_output2)
      output = self.pred(dense_output2)
      return tf.reshape(output, [-1])
Image for post
Model Architecture (Image by author)
模型架构(作者提供)

训练与结果 (Training and Results)

With the model created now, we can jump right ahead into training. The model fit() method usually meets the requirements for training but custom training provides you finer control over optimization and other tasks associated with training. You could pick anyone depending on how complex your training’s going to be. Here we have employed a customized approach to train the model.

使用现在创建的模型,我们可以立即进行培训。 模型fit()方法通常可以满足训练的要求,但是自定义训练可以使您更好地控制优化和与训练相关的其他任务。 您可以根据培训的复杂程度来选择任何人。 在这里,我们采用了定制的方法来训练模型。

@tf.function
def train_step(target_words, context_words, labels):
    with tf.GradientTape() as tape:
      preds = model(target_words, context_words)
      loss = loss_fn(labels, preds)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimiser.apply_gradients(zip(gradients, model.trainable_variables))
    train_acc_metric.update_state(labels, preds)
    return loss


@tf.function
def test_step(target_words, context_words, labels):
    preds = model(target_words, context_words, training=False)
    loss = loss_fn(labels, preds)
    val_acc_metric.update_state(labels, preds)
    return loss


for epoch in range(EPOCHS):
  start_time = time.time()
  print("Starting epoch: %d " % (epoch,))
  cumm_loss = 0
  for step, (target_words, context_words, labels) in enumerate(train_dataset):
    train_loss = train_step(target_words, context_words, labels)
    cumm_loss += train_loss
  train_acc = train_acc_metric.result()
  print("Training acc over epoch: %.4f" % (float(train_acc),))
  train_acc_metric.reset_states()
  print("Cummulative loss: %.4f " % (cumm_loss,))


  test_cumm_loss = 0
  for step, (target_words, context_words, labels) in enumerate(test_dataset):
    test_loss = test_step(target_words, context_words, labels)
    test_cumm_loss += test_loss
  val_acc = val_acc_metric.result()
  print("Validation acc over epoch: %.4f" % (float(val_acc),))
  val_acc_metric.reset_states()
  print("Cummulative test loss: %f " % (test_cumm_loss,))
  print("Time taken: %.2fs" % (time.time() - start_time))
Starting epoch: 0
Training acc over epoch: 0.6382
Validation acc over epoch: 0.7458
Time taken: 374.43sStarting epoch: 1
Training acc over epoch: 0.8682
Validation acc over epoch: 0.8237
Time taken: 368.18sStarting epoch: 2
Training acc over epoch: 0.9438
Validation acc over epoch: 0.8494
Time taken: 374.33sStarting epoch: 3
Training acc over epoch: 0.9701
Validation acc over epoch: 0.8604
Time taken: 382.57sStarting epoch: 4
Training acc over epoch: 0.9800
Validation acc over epoch: 0.8656
Time taken: 376.69s

字嵌入投影仪 (Word Embeddings Projector)

For visualizing word embeddings, tensorflow offers a brilliant platform that can be used to load and visualize saved weights vector with just a couple lines of code! Here’s how we do it. First, extract and store the weights of the embedding layer. Then populate the word embeddings as shown below in two files: vecs.tsv which stores the actual vectors and meta.tsv contains associated metadata for visualizing.

为了可视化单词嵌入,tensorflow提供了一个出色的平台,仅需几行代码即可用于加载和可视化保存的权重向量! 这是我们的方法。 首先,提取并存储嵌入层的权重。 然后在两个文件中填充单词嵌入,如下所示:vecs.tsv,用于存储实际矢量,而meta.tsv包含用于可视化的关联元数据。

word_embeddings_layer = model.layers[0]
weights = word_embeddings_layer.get_weights()[0]
print("Word Embeddings shape: %s" % (weights.shape,))


out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')


for num, word in tokenizer.index_word.items():
  vec = weights[num] # skip 0, it's padding.
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
out_v.close()
out_m.close()

After that hop over to http://projector.tensorflow.org/ and load the files created in the previous step. That’s it! Tensorflow takes care of the rest. Let’s take a look at word embeddings that we learned above after training for 5 epochs.

之后,跳至http://projector.tensorflow.org/并加载在上一步中创建的文件。 而已! Tensorflow负责其余的工作。 让我们看一下在训练了5个历元之后我们在上面学习的词嵌入。

Image for post
Nearest words for climate (Image by author)
气候最接近的词(作者提供的图片)
Image for post
Nearest words for parliament (Image by author)
议会最近的话(作者提供)
Image for post
Nearest words for molecules (Image by author)
分子的最近单词(作者提供的图片)

The results and the accuracy of test sets are quite significant and promising considering the model was trained within half an hour without any GPU support for the first 5 million bytes! As shown in the above images, “Climate” is encoded as nearest to nautical, warm, cooler, temperatures, salinity, moisture among others. “Parliament” is most similar to bicameral, constituencies, ministers, seats, senators, etc. While “Molecules” is very much related to arsenic, compounds, ammonium, synthetic, calcium, etc. Interested readers can further explore and enhance the word embeddings by playing around with a more complex model architecture and larger data!

考虑到模型是在半小时内训练的,并且前500万字节没有任何GPU支持,因此测试结果和测试集的准确性非常重要,并且很有希望! 如上图所示,“ 气候 ”被编码为最接近航海,温暖,凉爽,温度,盐度,湿度等。 “ 议会 ”最类似于两院制,选区,部长,席位,参议员等。“ 分子”砷,化合物,铵,合成钙钙等非常相关。感兴趣的读者可以进一步探索和增强词缀通过玩弄更复杂的模型架构和更大的数据!

The full version of the code snippets shared in this article, along with images and learned word embeddings is made available on Github. If you like this article or have any feedback, please let me know in the comments section below!

Github上提供了本文共享的完整代码片段以及图像和学习的单词嵌入。 如果您喜欢本文或有任何反馈,请在下面的评论部分中告诉我!

翻译自: https://towardsdatascience.com/word-embeddings-deep-dive-hands-on-approach-a710eb03e4c5

词嵌入 网络嵌入

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值