PaddlePaddle深度学习教程：全局向量的词嵌入(GloVe)详解-CSDN博客

本文链接：https://blog.csdn.net/gitblog_00995/article/details/148578506

PaddlePaddle深度学习教程：全局向量的词嵌入(GloVe)详解

awesome-DeepLearning 深度学习入门课、资深课、特色课、学术案例、产业实践案例、深度学习知识百科及面试题库The course, case and knowledge of Deep Learning and AI 项目地址: https://gitcode.com/gh_mirrors/aw/awesome-DeepLearning

引言

在自然语言处理领域，词嵌入技术是构建语言模型的基础。GloVe(Global Vectors for Word Representation)作为一种经典的词嵌入方法，结合了全局统计信息和局部上下文窗口的优点，在许多NLP任务中表现出色。本文将深入解析GloVe模型的原理、实现细节以及在PaddlePaddle框架中的应用。

GloVe模型基础

1. 从共现统计到词向量

GloVe模型的核心思想是利用整个语料库中词与词的共现统计信息来学习词向量。与传统的跳元模型(Skip-gram)不同，GloVe直接对全局的共现计数矩阵进行建模，这使得它能够更有效地捕捉词语之间的语义关系。

2. 共现矩阵的构建

共现矩阵X中的每个元素x_ij表示词w_j出现在词w_i上下文中的次数。这个矩阵是对称的，因为如果词w_j出现在词w_i的上下文中，那么词w_i也必然出现在词w_j的上下文中。

GloVe模型数学原理

1. 损失函数设计

GloVe的损失函数设计非常精巧：

$$ J = \sum_{i,j=1}^V f(X_{ij})(w_i^T \tilde{w}_j + b_i + \tilde{b}j - \log X{ij})^2 $$

其中：

w_i和w̃_j分别是中心词和上下文词的向量表示
b_i和b̃_j是偏置项
f(X_ij)是权重函数，用于平衡高频词和低频词的影响

2. 权重函数的作用

权重函数f(x)的设计是GloVe的一个关键创新：

$$ f(x) = \begin{cases} (x/x_{\text{max}})^\alpha & \text{如果 } x < x_{\text{max}} \ 1 & \text{否则} \end{cases} $$

典型参数设置为x_max=100，α=0.75。这种设计可以防止模型过度关注高频词，同时也不会完全忽略低频词。

在PaddlePaddle中实现GloVe

1. 数据预处理

在PaddlePaddle中实现GloVe，首先需要构建词汇表并计算共现矩阵：

import paddle
import numpy as np
from collections import defaultdict

# 构建词汇表
def build_vocab(texts):
    vocab = defaultdict(int)
    for text in texts:
        for word in text.split():
            vocab[word] += 1
    return {word:i for i,word in enumerate(vocab.keys())}

# 计算共现矩阵
def build_cooccurrence_matrix(texts, vocab, window_size=5):
    vocab_size = len(vocab)
    cooccurrence = np.zeros((vocab_size, vocab_size))
    for text in texts:
        words = text.split()
        word_ids = [vocab[w] for w in words]
        for i, center_id in enumerate(word_ids):
            start = max(0, i - window_size)
            end = min(len(word_ids), i + window_size + 1)
            for j in range(start, end):
                if j != i:
                    context_id = word_ids[j]
                    cooccurrence[center_id][context_id] += 1.0 / abs(j - i)
    return cooccurrence

2. 模型定义

使用PaddlePaddle定义GloVe模型：

class GloVeModel(paddle.nn.Layer):
    def __init__(self, vocab_size, embedding_dim):
        super(GloVeModel, self).__init__()
        self.center_embeddings = paddle.nn.Embedding(
            vocab_size, embedding_dim)
        self.context_embeddings = paddle.nn.Embedding(
            vocab_size, embedding_dim)
        self.center_biases = paddle.nn.Embedding(
            vocab_size, 1)
        self.context_biases = paddle.nn.Embedding(
            vocab_size, 1)
        
    def forward(self, center_words, context_words, cooccurrence):
        center_embeds = self.center_embeddings(center_words)
        context_embeds = self.context_embeddings(context_words)
        center_bias = self.center_biases(center_words)
        context_bias = self.context_biases(context_words)
        
        dot_product = paddle.sum(center_embeds * context_embeds, axis=1)
        prediction = dot_product + center_bias.squeeze() + context_bias.squeeze()
        
        # 计算加权损失
        weights = paddle.clip(cooccurrence / 100.0, min=1.0) ** 0.75
        loss = weights * paddle.square(prediction - paddle.log(cooccurrence))
        return paddle.mean(loss)

3. 模型训练

训练GloVe模型的典型流程：

def train_glove(vocab, cooccurrence, embedding_dim=100, epochs=25):
    vocab_size = len(vocab)
    model = GloVeModel(vocab_size, embedding_dim)
    optimizer = paddle.optimizer.Adam(parameters=model.parameters())
    
    # 准备训练数据
    center_words = []
    context_words = []
    cooccurrences = []
    for i in range(vocab_size):
        for j in range(vocab_size):
            if cooccurrence[i,j] > 0:
                center_words.append(i)
                context_words.append(j)
                cooccurrences.append(cooccurrence[i,j])
    
    center_words = paddle.to_tensor(center_words)
    context_words = paddle.to_tensor(context_words)
    cooccurrences = paddle.to_tensor(cooccurrences)
    
    # 训练循环
    for epoch in range(epochs):
        loss = model(center_words, context_words, cooccurrences)
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()
        print(f"Epoch {epoch}, Loss: {loss.numpy()}")
    
    # 合并中心词和上下文词向量
    embeddings = model.center_embeddings.weight + model.context_embeddings.weight
    return embeddings.numpy()