fasttext工具训练词向量

采集获取数据
该数据是英语维基百科的部分网页信息
wget -c http://mattmahoney.net/dc/enwik9.zip 
解压得到enwik9的文件
$ unzip enwik9.zip 

enwik9数据保存的格式

(nlp) [root@bhs data]# more enwik9
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:sch
emaLocation="http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd" version="0.3" xml:lang
="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <base>http://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.6alpha</generator>
    <case>first-letter</case>
      <namespaces>
      <namespace key="-2">Media</namespace>
      <namespace key="-1">Special</namespace>
      <namespace key="0" />

数据处理,过滤掉XML/HTML格式的数据

wikifil.pl enwik9 > fil9

过滤之后文本fil9的数据内容

(nlp) [root@bhs data]# more fil9 
 anarchism originated as a term of abuse first used against early working class radicals including the diggers of the eng
lish revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to descr
ibe
训练词向量
>>> model = fasttext.train_unsupervised('data/fil9', "cbow", dim=300, epoch=1, lr=0.1, thread=8)


参数解释:
无监督训练模式: 'skipgram' 或者 'cbow', 默认为'skipgram', 在实践中,skipgram模式在利用子词方面比cbow更好.
词嵌入维度dim: 默认为100, 但随着语料库的增大, 词嵌入的维度往往也要更大.
数据循环次数epoch: 默认为5
学习率lr: 默认为0.05, 根据经验, 建议选择[0.011]范围内.
使用的线程数thread: 默认为12个线程, 一般建议和你的cpu核数相同

训练结束获取某个单词对应的词向量

>>> model.get_word_vector('the')
array([-2.40050092e-01, -1.95936069e-01, -6.32880390e-01, -1.04872927e-01,
        3.69175017e-01,  9.78435855e-04, -4.04583544e-01, -6.08086407e-01,
       -7.89057538e-02,  3.68165284e-01, -2.00652070e-02, -3.89455110e-01,
       -6.22386694e-01, -7.05020785e-01, -1.90653861e+00, -1.14761412e-01,
       -7.23317981e-01,  1.72821879e+00,  9.29246664e-01,  1.74631700e-01,
       -3.97295058e-01, -4.57086593e-01,  2.73532718e-01,  1.75126567e-01,
       -7.02150285e-01,  8.08026910e-01, -4.60327297e-01, -6.47872865e-01,
       -7.70961940e-01, -9.13587883e-02, -1.46508798e-01,  3.21023762e-01,
        5.42958558e-01, -4.32848155e-01, -6.22740090e-01, -2.15316027e-01,
        3.03125858e-01,  1.76334515e-01, -4.12740290e-01,  1.21726441e+00,
        1.28870988e+00,  2.21328810e-01,  3.02456737e-01,  3.43191952e-01,
       -3.35683167e-01,  1.07879341e-01, -4.15570438e-01,  3.49437058e-01,
       -6.76526368e-01,  5.84482014e-01, -1.87016428e-01,  7.80748248e-01,
       -6.39917850e-01, -2.61718899e-01,  7.39042580e-01,  1.30696505e-01,
        1.07203627e+00, -5.26710272e-01, -6.92395389e-01,  2.39853457e-01,
        3.57318670e-01,  7.51733541e-01,  1.51219666e-01, -4.12798464e-01,
       -7.44705021e-01, -6.84746504e-01,  1.70669988e-01,  1.07712710e+00,
       -3.39158356e-01, -1.52987885e+00,  1.09384096e+00, -4.54250097e-01,
       -1.10697113e-02, -8.11372876e-01,  1.53196323e+00,  6.07076824e-01,
       -1.51879653e-01,  1.63555220e-01, -5.44529796e-01,  7.85924137e-01,
       -1.05206549e+00, -7.44569063e-01, -1.45621169e+00, -3.48930240e-01,
       -7.31964931e-02,  3.96538168e-01,  3.45483865e-03, -9.91007984e-02,
       -1.69376388e-01,  8.20461452e-01, -5.13499826e-02, -1.43764064e-01,
       -4.71283525e-01, -7.47936010e-01,  1.07978940e+00,  1.44067216e+00,
       -2.76158541e-01,  8.05234194e-01, -2.64122784e-01,  7.14614749e-01,
       -2.75607497e-01,  1.58570260e-01, -6.65432870e-01, -1.48548639e+00,
       -7.12232769e-01,  1.64294231e+00, -3.74271363e-01,  5.19163013e-01,
        2.20753342e-01,  8.14575478e-02, -9.91819263e-01, -6.68195859e-02,
        8.42194974e-01, -1.11236417e+00, -6.30728841e-01,  4.92352635e-01,
        6.29572868e-01,  9.86076951e-01, -2.66191602e-01, -8.61157060e-01,
        1.28934574e+00,  8.20774019e-01,  1.35654598e-01, -6.03160799e-01,
       -5.12832046e-01,  5.01171872e-02,  3.00273299e-01, -2.32950017e-01,
       -5.14157414e-01,  5.98636568e-01,  3.87961686e-01,  6.07027352e-01,
        1.03336227e+00, -1.43141612e-01,  9.02619362e-01,  1.59850210e-01,
       -9.36119556e-01, -5.02088785e-01,  2.66743630e-01,  4.10576701e-01,
        5.07164657e-01, -1.68225312e+00,  6.84738338e-01,  8.21013212e-01,
       -4.25240606e-01,  3.73900384e-01,  2.87017405e-01,  5.80932498e-01,
        6.54161334e-01, -2.55300999e-01, -2.44218796e-01,  4.71038580e-01,
       -1.77474529e-01,  3.14508677e-01, -4.93917987e-02, -2.12814584e-01,
       -6.06304288e-01, -1.14386928e+00, -2.10981041e-01,  2.88650483e-01,
       -6.71522983e-04,  2.41575018e-01,  4.21473056e-01,  9.90042269e-01,
       -7.08687842e-01,  3.59151095e-01,  1.08175015e+00, -7.78073967e-01,
        1.85326010e-01, -1.05325794e+00,  8.06448832e-02,  1.42552006e+00,
        2.14579463e-01, -4.41795528e-01, -8.25945795e-01,  8.31834257e-01,
       -8.19991380e-02, -3.02930534e-01,  3.60262334e-01, -3.02095413e-01,
        6.99657798e-01,  5.72522581e-01, -9.07035947e-01,  9.56989467e-01,
        8.43555331e-01, -4.61010575e-01, -5.30865610e-01, -7.02177510e-02,
        2.56720960e-01, -8.36370230e-01,  1.74392194e-01, -9.26310122e-02,
       -4.75168705e-01, -6.82441652e-01, -1.07436347e+00, -2.89227068e-01,
        2.20277280e-01, -4.95301664e-01, -5.89004159e-01, -1.04854929e+00,
        7.71572351e-01, -5.07593155e-01,  4.99838978e-01, -7.44045004e-02,
        8.68184566e-01, -5.18415451e-01,  1.14308774e-01, -4.56971407e-01,
        1.28684020e+00,  3.77036870e-01, -5.87315679e-01, -1.17326510e+00,
       -1.03616095e+00,  1.05946529e+00, -4.26467508e-01, -1.44005561e+00,
       -1.90337822e-02,  3.56762230e-01, -9.78742242e-01, -2.81119525e-01,
        1.69721901e-01, -8.83528590e-01, -7.94641912e-01, -3.74035954e-01,
       -7.42775261e-01,  1.12977457e+00,  3.35744411e-01,  7.18433738e-01,
       -1.05081677e+00, -1.16853333e+00, -6.60338283e-01,  8.72176737e-02,
        4.12553102e-01,  6.79606199e-01,  1.79481819e-01, -8.99737030e-02,
        7.10959613e-01,  4.14395690e-01, -4.17709440e-01,  4.23936486e-01,
       -9.90137815e-01, -3.57081085e-01,  6.18476510e-01, -6.82233572e-01,
       -7.94733614e-02,  1.12701046e+00,  2.22074240e-01,  2.54456490e-01,
       -8.35089803e-01, -3.95485133e-01,  5.55131473e-02,  2.22720695e+00,
        4.47311640e-01, -3.69377166e-01,  5.74232519e-01, -4.07226592e-01,
        2.82875270e-01,  4.44723785e-01, -4.82321173e-01, -4.66798931e-01,
        2.34293818e-01, -1.01615584e+00,  7.19777197e-02,  1.28374135e+00,
        4.40646291e-01,  6.18108749e-01, -6.26284063e-01,  4.26211715e-01,
        3.16519409e-01,  1.90637514e-01,  1.25102079e+00, -7.60308728e-02,
       -6.56022847e-01,  1.26399517e+00,  3.50388348e-01,  1.04573660e-01,
        7.95926332e-01,  3.88161466e-02, -2.80209988e-01,  3.53194773e-01,
       -2.18353438e+00, -2.05612525e-01,  7.24893510e-01,  8.70515406e-02,
       -1.79120656e-02,  4.59726661e-01, -8.10881495e-01,  7.10043609e-02,
       -2.39707962e-01, -1.18325806e+00, -1.49373269e+00, -1.07335567e+00,
        9.18981433e-01,  5.30046403e-01,  2.00748533e-01,  5.67615449e-01,
        9.48272884e-01, -1.65072358e+00, -2.71418661e-01, -3.29914004e-01],
      dtype=float32)

效果检验
# 查找"音乐"的邻近单词, 我们可以发现与音乐有关的词汇.
>>> model.get_nearest_neighbors('music')

[(0.8908010125160217, 'emusic'), (0.8464668393135071, 'musicmoz'), (0.8444250822067261, 'musics'), (0.8113634586334229, 'allmusic'), (0.8106718063354492, 'musices'), (0.8049437999725342, 'musicam'), (0.8004694581031799, 'musicom'), (0.7952923774719238, 'muchmusic'), (0.7852965593338013, 'musicweb'), (0.7767147421836853, 'musico')]

模型保存与加载
使用save_model保存模型
>>> model.save_model("fil9)model.bin")

使用fasttext.load_model加载模型
>>> model = fasttext.load_model("fil9_model.bin")

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: FastText是一个快速,高效的文本分类器,它使用深度学习技术来生成文本特征向量。示例代码如下:from gensim.models.fasttext import FastText# 初始化 FastText 模型 model = FastText(size=300, window=3, min_count=1)# 导入文本数据 sentences = ["我爱你", "你爱我吗"]# 构建词汇表 model.build_vocab(sentences)# 训练模型 model.train(sentences,total_examples=len(sentences), epochs=10)# 生成词向量 word_vectors = model.wv['我'] ### 回答2: FastText是一种用于生成词向量的强大工具。以下是使用FastText生成词向量的代码示例: ```python import fasttext # 定义并训练FastText模型 model = fasttext.train_unsupervised('text.txt', dim=100, epoch=10) # 保存训练好的模型 model.save_model("model.bin") # 加载训练好的模型 model = fasttext.load_model("model.bin") # 获取单词的词向量 word_vector = model.get_word_vector("word") # 打印单词的词向量 print(word_vector) # 获取最相似的词汇 similar_words = model.get_nearest_neighbors("word", k=5) # 打印最相似的词汇 for i in similar_words: print(i[1], i[0]) ``` 上述代码首先训练了一个FastText模型,训练数据保存在名为"text.txt"的文本文件中。模型包含100维的词向量,经过10个轮次的训练训练完成后,将模型保存为"model.bin"。接下来,加载训练好的模型并通过`get_word_vector`方法获取特定单词的词向量,打印出来。随后,通过`get_nearest_neighbors`方法检索与给定单词最相似的5个词汇,并将结果打印出来。 通过这段代码,我们可以使用FastText生成词向量,并利用这些词向量进行后续的自然语言处理等任务。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值