deeplearning4j之GloVe实现实现

glove类似于word2vec,听说效果还比word2vec更加强大,可以用于做自然语言处理,正好学习deeplearning4j的时候看到了,顺便写在这,

文章用到的数据跟上一篇word2vec一样,看看效果吧,训练时间比word2vec要长太多,代码如下:

package com.meituan.deeplearning4j;

import org.datavec.api.util.ClassPathResource;
import org.deeplearning4j.models.glove.Glove;
import org.deeplearning4j.text.sentenceiterator.BasicLineIterator;
import org.deeplearning4j.text.sentenceiterator.SentenceIterator;
import org.deeplearning4j.text.tokenization.tokenizer.preprocessor.CommonPreprocessor;
import org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory;
import org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.File;
import java.io.FileNotFoundException;
import java.util.Arrays;
import java.util.Collection;

public class GloVeRaw {
	public static void main(String[] args) throws FileNotFoundException {
		String filePath = "/Users/shuubiasahi/Desktop/bayies/deeplearning/part-00000";
		SentenceIterator iter = new BasicLineIterator(new File(filePath));
		TokenizerFactory t = new DefaultTokenizerFactory();
		t.setTokenPreProcessor(new CommonPreprocessor());
		Glove glove = new Glove.Builder().iterate(iter).tokenizerFactory(t)
		.alpha(0.75).learningRate(0.1)
				.epochs(25)
				.xMax(100)
				.batchSize(1000)
				.shuffle(true)
				.symmetric(true).build();

		glove.fit();

		System.out.println("和微信最接近的10个词汇:" + glove.wordsNearest("微信", 10));
		System.out.println(Arrays.toString(glove.getWordVector("微信")));
		System.out.println("微信和qq的相似度为:" + glove.similarity("微信", "腾讯聊天账号"));
		System.out.println("和美女最接近的10个词汇:" + glove.wordsNearest("腾讯聊天账号", 10));

		System.exit(0);
	}

}






GloVe(Global Vectors for Word Representation)是一种用于生成词向量表示的算法。它结合了全局词汇统计信息和局部上下文窗口中的词共现计数信息。下面是python实现glove算法的基本步骤: 1. 导入所需的库 ```python import numpy as np from collections import Counter ``` 2. 定义函数来计算共现矩阵 ```python def co_occurrence_matrix(corpus, window_size): words = corpus.split() word_freq = dict(Counter(words)) vocab = list(word_freq.keys()) vocab_size = len(vocab) co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32) for i in range(len(words)): w_i = words[i] for j in range(i - window_size, i + window_size + 1): if j >= 0 and j < len(words) and j != i: w_j = words[j] co_matrix[vocab.index(w_i), vocab.index(w_j)] += 1 return co_matrix, vocab ``` 3. 定义函数来计算GloVe矩阵 ```python def glove_matrix(co_matrix, embedding_dim=50, learning_rate=0.05, epochs=100): np.random.seed(0) W = np.random.uniform(-0.5, 0.5, (co_matrix.shape[0], embedding_dim)) b = np.random.uniform(-0.5, 0.5, co_matrix.shape[0]) x_max = 100 alpha = 0.75 p_i = np.sum(co_matrix, axis=1) / np.sum(co_matrix) log_co_matrix = np.log(co_matrix + 1) for epoch in range(epochs): f_w = np.zeros_like(co_matrix, dtype=np.float32) for i in range(co_matrix.shape[0]): for j in range(co_matrix.shape[1]): if co_matrix[i][j] > 0: w_ij = np.dot(W[i], W[j]) + b[i] + b[j] f_wij = (co_matrix[i][j] / x_max) ** alpha if co_matrix[i][j] < x_max else 1 f_w[i][j] = f_wij * w_ij grad_w = np.zeros_like(W, dtype=np.float32) grad_b = np.zeros_like(b, dtype=np.float32) for i in range(co_matrix.shape[0]): for j in range(co_matrix.shape[1]): if co_matrix[i][j] > 0: w_ij = np.dot(W[i], W[j]) + b[i] + b[j] f_wij = (co_matrix[i][j] / x_max) ** alpha if co_matrix[i][j] < x_max else 1 delta = f_wij * (w_ij - np.log(co_matrix[i][j])) grad_w[i] += delta * W[j] grad_w[j] += delta * W[i] grad_b[i] += delta grad_b[j] += delta W -= learning_rate * grad_w b -= learning_rate * grad_b return W ``` 4. 使用函数来计算词向量 ```python corpus = "apple banana orange apple apple banana" co_matrix, vocab = co_occurrence_matrix(corpus, window_size=2) W = glove_matrix(co_matrix, embedding_dim=50, learning_rate=0.05, epochs=100) word_to_index = {word: i for i, word in enumerate(vocab)} index_to_word = {i: word for i, word in enumerate(vocab)} word_vecs = {} for word, i in word_to_index.items(): word_vecs[word] = W[i] ``` 这样,我们就可以得到一个包含每个单词词向量的字典。
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值