tensorflow2.0 word2vec=》item2vec 【gensim】比 tensorflow 快

from gensim.models import Word2Vec
import numpy as np
import pandas as pd
import collections
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

embedding_size = 32  # 嵌入向量的维度 vector.
max_vocabulary_size = 50000  # 词汇表中不同单词的总数words in the vocabulary.
min_occurrence = 10  # 删除出现小于n次的所有单词
skip_window = 3  # 左右各要考虑多少个单词
num_skips = 2  # 重复使用输入生成标签的次数
num_sampled = 64  # 负采样数量

# 读取数据
data_file = "C:/project/data/movielens-m1/ratings.dat"
orig_data = pd.read_csv(data_file, sep="::", names=["user_id", "item_id", "score", "timestamp"],
                        dtype={"user_id": int, "item_id": str, "score": int, "timestamp": int})
# 根据user_id合并item_ids
grouped_data = orig_data.groupby("user_id")["item_id"].apply(",".join).reset_index()
grouped_data.columns = ["user_id", "item_ids"]
grouped_data["item_ids_array"] = grouped_data["item_ids"].apply(lambda s: s.split(","))
sentences = grouped_data["item_ids_array"]

# class gensim.models.word2vec.Word2Vec(sentences=None, corpus_file=None, size=100,
# alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1,
# workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, ns_exponent=0.75,
# cbow_mean=1, hashfxn=<built-in function hash>, iter=5, null_word=0, trim_rule=None,
# sorted_vocab=1, batch_words=10000, compute_loss=False, callbacks=(), max_final_vocab=None)
model = Word2Vec(size=embedding_size, min_count=min_occurrence, max_vocab_size=max_vocabulary_size)
# train(self, sentences=None, corpus_file=None, total_examples=None, total_words=None,
# epochs=None, start_alpha=None, end_alpha=None, word_count=0, queue_factor=2,
# report_delay=1.0, compute_loss=False, callbacks=())
model.build_vocab(sentences)
model.train(sentences, total_words=max_vocabulary_size, epochs=500, start_alpha=0.1, end_alpha=0.02, compute_loss=True)
model.save("gensim/item2vec.model")
model.wv.save_word2vec_format("gensim/item2vec.txt", total_vec=max_vocabulary_size)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
以下是一段使用tensorflow2.0实现LSTM文本分类的代码,使用word2vec进行词嵌入: ```python import tensorflow as tf from tensorflow.keras.layers import Embedding, LSTM, Dense from tensorflow.keras.models import Sequential from gensim.models import Word2Vec import numpy as np # 加载word2vec模型 w2v_model = Word2Vec.load('word2vec.model') # 定义词向量维度和最大序列长度 embedding_dim = 100 max_length = 100 # 定义LSTM模型 model = Sequential() model.add(Embedding(input_dim=len(w2v_model.wv.vocab), output_dim=embedding_dim, input_length=max_length, weights=[w2v_model.wv.vectors])) model.add(LSTM(units=64, dropout=0.2, recurrent_dropout=0.2)) model.add(Dense(units=1, activation='sigmoid')) # 编译模型 model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) # 加载数据 x_train = np.load('x_train.npy') y_train = np.load('y_train.npy') x_test = np.load('x_test.npy') y_test = np.load('y_test.npy') # 训练模型 model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=10, batch_size=32) ``` 以上代码中,我们使用gensim库加载了预训练好的word2vec模型,然后将其作为Embedding层的权重传入LSTM模型中。在训练模型之前,我们需要先加载训练数据,并将其转换为数字序列,这里我们使用numpy库来加载数据。最后,我们使用fit方法来训练模型。 以下是word2vec的详细代码: ```python from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence # 加载语料库 sentences = LineSentence('corpus.txt') # 训练模型 model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4) # 保存模型 model.save('word2vec.model') ``` 以上代码中,我们使用gensim库中的Word2Vec类来训练word2vec模型。我们首先使用LineSentence类加载语料库,然后使用Word2Vec类训练模型。在训练模型时,我们可以指定词向量的维度、窗口大小、最小词频等参数。最后,我们使用save方法保存模型。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值