Java使用tensorflow向量库_tensorflow训练词向量

最新推荐文章于 2024-05-28 09:35:42 发布

郑业成

最新推荐文章于 2024-05-28 09:35:42 发布

阅读量221

点赞数

文章标签： Java使用tensorflow向量库

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_28025327/article/details/114743002

版权

该博客使用Python和TensorFlow库在Java数据上训练词向量。通过预处理文本，剔除低频词汇，采样处理，构建词到整数的映射，然后构建神经网络模型，采用负采样策略进行训练。最终，通过训练得到的词向量来查找相似的词。

摘要由CSDN通过智能技术生成

tf_w2v_sg_demo.py

# -*- coding: utf-8 -*-

import time

import numpy as np

import tensorflow as tf

import random

from collections import Counter

# 2加载数据

#

with open('data/Javasplittedwords',encoding='utf-8') as f:

text = f.read()

# 3 数据预处理

# 3.1筛选低频词

words = text.split(' ')

words_count = Counter(words)

words = [w for w in words if words_count[w] > 50]

# 3.2构建映射表

vocab = set(words)

vocab_to_int = {w: c for c, w in enumerate(vocab)}

int_to_vocab = {c: w for c, w in enumerate(vocab)}

print("total words: {}".format(len(words)))

print("unique words: {}".format(len(set(words))))

# 3.3对原文本进行vocab到int的转换

int_words = [vocab_to_int[w] for w in words]

# 4采样

# 对停用词进行采样，例如“the”，“of”以及“for”这类单词进行剔除。

# 剔除这些单词以后能够加快我们的训练过程，同时减少训练过程中的噪音。

t = 1e-5 # t值

threshold = 0.9 # 剔除概率阈值

# 统计单词出现频次

int_word_counts = Counter(int_words)

total_count = len(int_words)

# 计算单词频率

word_freqs = {w: c/total_count for w, c in int_word_counts.items()}

# 计算被删除的概率

prob_drop = {w: 1 - np.sqrt(t / word_freqs[w]) for w in int_word_counts}

<

最低0.47元/天解锁文章

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Java使用tensorflow向量库_tensorflow训练词向量

tf_w2v_sg_demo.py# -*- coding: utf-8 -*-import timeimport numpy as npimport tensorflow as tfimport randomfrom collections import Counter# 2加载数据#with open('data/Javasplittedwords',encoding='utf-8') as ...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。