A. FastText 以下代码的完整版
2种方式:
windows10系统安装gensim即可 pip install gensim
1. gensim
(1) 可用于训练词向量
(2) 可找出相似的词向量
from gensim.models import FastText
# data_token为分词 可运用自己的语料库
data_token =[['update', 'orion', '9'], ['import', 'file', "n't", 'work', 'anymore', 'orion.eclipse.org']]
model_FastText = FastText(data_token, size=5, window=5, min_count=5, workers=4,sg=1) '
# 参数解释
# min_count:出现次数min_count的忽略
# size:词向量维度
# window:窗口大小
# alpha:初始学习率
# min_alpha:学习率线性减少到min_alpha
# sg:=1表示skip-gram,=0表示CBOW
# Hs:=1表示层次softmax,=0表示负采样
model_FastText['data'] #输出data的词向量
model_FastText.wv.most_similar("crash") #找到相似的单词
(3)在线更新语料库,代码引用(极简使用︱Gensim-FastText 词向量训练以及OOV(out-of-word)问题有效解决)
from gensim.models import FastText
# 在线更新训练 fasttext
sentences_1 = [["cat", "say", "meow"], ["dog", "say", "woof"]]
sentences_2 = [["dude", "say", "wazzup!"]]
# 使用sentences_1训练一个model
model = FastText(min_count=1)
model.build_vocab(sentences_1)
model.train(sentences_1, total_examples=model.corpus_count, epochs=model.iter)
# 使用sentences_2更新
model.build_vocab(sentences_2, update=True)
model.train(sentences_2, total_examples=model.corpus_count, epochs=model.iter)
2. fasttext
安装:在网址中下载fasttext库,pip install 库名(不要修改名称)
参考链接:
Windows下fasttext文本分类
(1)将原始数据集转换为可输入的文本(非常重要的一步)
import fasttext
import pandas as pd
# 读入原始数据集
data = pd.read_csv('F:/VirtualEnvs-Projct/nlp/data/eclipse_ECD.csv')
dataD = pd.DataFrame(data)
# 对一句话进行分词
def token_word(summary):
tmp = fasttext.tokenize(summary)
data_tmp = str(tmp).replace('[','').replace(']','').replace(",",'').replace("'",'')
return data_tmp
# 对数据集的标签进行数字替换
def get_label(severity):
if severity == 'blocker'or severity=='critical' or severity == 'major':
label = 1
elif severity == 'blocker'or severity=='trivial' or severity == 'enhancement':
label = 0
else:
label = 2
return '__label__'+str(label)
# apply对一列执行相同的操作
dataD['Token_Summary']=dataD.Summary.apply(token_word)
dataD['label']=dataD.Severity.apply(get_label)
dataD['token'] = dataD['label']+' '+dataD['Token_Summary']
# 构造符合fasttext的文件
def write_data(sentences,fileName):
print("writing data to fasttext format...")
out=open(fileName,'w',encoding='utf-8')
for sentence in sentences:
out.write(sentence+"\n")
print("done!")
write_data(dataD['token'],'file.txt')
# 分为训练集和测试集
from sklearn.model_selection import train_test_split
train, test = train_test_split(dataD['token'], test_size=0.2,random_state=0)
write_data(train,'train.txt')
write_data(test,'test.txt')
可进行处理的文件具体形式如下:
(2)fasttext训练模型和预测
model=fasttext.train_supervised('F:/VirtualEnvs-Projct/nlp/train.txt')
model.test('F:/VirtualEnvs-Projct/nlp/test.txt') #输出结果为测试集大小,查准率,查全率
# 预测某句话的结果
model.predict(['data loss'])
# 结果
# ([['__label__2']], [array([0.9201928], dtype=float32)])
#预测前k个类别的概率
model.predict(['data loss'],k=2)
# 结果
# ([['__label__2', '__label__0']],
# [array([0.9201928 , 0.05301081], dtype=float32)])
train_supervisedcan参数: