gensim实现python对word2vec的训练和计算

最新推荐文章于 2024-07-03 16:43:24 发布

superhy_scut

最新推荐文章于 2024-07-03 16:43:24 发布

阅读量3.1w

点赞数 3

分类专栏：科研

本文链接：https://blog.csdn.net/qdhy199148/article/details/51754631

版权

本文介绍了如何使用gensim库在Python中训练和应用word2vec模型。内容涵盖初始化模型、增量式训练、基础查询以及词向量计算。通过安全模式避免大量语料一次性加载，并提供词向量的加减计算方法。作者开源项目graph-mind提供了进一步的封装和使用示例。

摘要由CSDN通过智能技术生成

词向量（word2vec）原始的代码是C写的，python也有对应的版本，被集成在一个非常牛逼的框架gensim中。

我在自己的开源语义网络项目graph-mind（其实是我自己写的小玩具）中使用了这些功能，大家可以直接用我在上面做的进一步的封装傻瓜式地完成一些操作，下面分享调用方法和一些code上的心得。

1.一些类成员变量：

def __init__(self, modelPath, _size=100, _window=5, _minCount=1, _workers=multiprocessing.cpu_count()):
        self.modelPath = modelPath
        self._size = _size
        self._window = _window
        self._minCount = _minCount
        self._workers = _workers

modelPath是word2vec训练模型的磁盘存储文件（model在内存中总是不踏实），_size是词向量的维度，_window是词向量训练时的上下文扫描窗口大小，后面那个不知道，按默认来，_workers是训练的进程数（需要更精准的解释，请指正），默认是当前运行机器的处理器核数。这些参数先记住就可以了。

2.初始化并首次训练word2vec模型

完成这个功能的核心函数是initTrainWord2VecModel，传入两个参数：corpusFilePath和safe_model，分别代表训练语料的路径和是否选择“安全模式”进行初次训练。关于这个“安全模式”后面会讲，先看代码：

def initTrainWord2VecModel(self, corpusFilePath, safe_model=False):
        '''
        init and train a new w2v model
        (corpusFilePath can be a path of corpus file or directory or a file directly, in some time it can be sentences directly
        about soft_model:
            if safe_model is true, the process of training uses update way to refresh model,
        and this can keep the usage of os's memory safe but slowly.
            and if safe_model is false, the process of training uses the way that load all
        corpus lines into a sentences list and train them one time.)
        '''
        extraSegOpt().reLoadEncoding()
        
        fileType = localFileOptUnit.checkFileState(corpusFilePath)
        if fileType == u'error':
            warnings.warn('load file error!')
            return None
        else:
            model = None
            if fileType == u'opened':
                print('training model from singleFile!')
                model &

最低0.47元/天解锁文章

superhy_scut

关注

3
点赞
踩
23

收藏

觉得还不错? 一键收藏
10
评论
gensim实现python对word2vec的训练和计算

词向量（word2vec）原始的代码是C写的，python也有对应的版本，被集成在一个非常牛逼的框架gensim中。我在自己的开源语义网络项目graph-mind（其实是我自己写的小玩具）中使用了这些功能，大家可以直接用我在上面做的进一步的封装傻瓜式地完成一些操作，下面分享调用方法和一些code上的心得。1.一些类成员变量：def __init__(self, modelPath, _
复制链接

扫一扫