【NLP】11其它句向量生成方法——Tf-idf模型、腾讯AI实验室汉字词句嵌入语料库求平均生成句向量

本文链接：https://blog.csdn.net/qq_41897800/article/details/115251555

其它句向量生成方法

1. Tf-idf训练
2. 腾讯AI实验室汉字词句嵌入语料库求平均生成句向量
小结

Linux服务器复制后不能windows粘贴？
远程桌面无法复制粘贴传输文件解决办法：重启rdpclip.exe进程，Linux 查询进程：

ps -ef |grep rdpclip

1. Tf-idf训练

from gensim.models import TfidfModel
from gensim.corpora import Dictionary
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

path = '/mnt/Data1/ysc/Data.txt'
txt = open(path, 'r', encoding='utf-8')

print('开始生成字典')
dictionary = Dictionary([[]])
corpus = []
i = 0
for line in txt.readlines():
    tmp_list = line.strip('\n').strip(' ').split(' ')
    dictionary.add_documents([tmp_list], prune_at=3000000)
    i += 1
    if i % 10000 == 0:
        # print('转化字典的文件数：{}'.format(i))
        dictionary.filter_extremes(no_below=2, no_above=0.8)
txt.close()
dictionary.filter_extremes(no_below=2, no_above=0.8)
print('开始保存字典')
dictionary.save('/mnt/Data1/ysc/Tfidf.dic')
txt = open(path, 'r', encoding='utf-8')
print('开始计算词频')
corpus = [dictionary.doc2bow(line.strip('\n').strip(' ').split(' ')) for line in txt.readlines()]
txt.close()
print('开始保存词频')
out_path = '/mnt/Data1/ysc/corpus.txt'
txt = open(out_path, 'a', encoding='utf-8')
for item in corpus:
    txt.write(str(item) + '\n')
txt.close()
print(type(corpus))
print(corpus[0])
print('开始训练模型')
tf_idf_model = TfidfModel(corpus,normalize=False)
print('开始保存模型')
tf_idf_model.save('/mnt/Data1/ysc/tfidf.model')

2021-03-26 18:45:57,971 : INFO : keeping 100000 tokens which were in no less than 2 and no more than 1888665 (=80.0%) documents
2021-03-26 18:45:58,052 : INFO : resulting dictionary: Dictionary(100000 unique tokens: ['一', '一个', '一串', '一些', '一份']...)
2021-03-26 18:45:58,053 : INFO : saving Dictionary object under /mnt/Data1/ysc/Tfidf.dic, separately None
开始保存字典
开始计算词频
2021-03-26 18:45:58,102 : INFO : saved /mnt/Data1/ysc/Tfidf.dic
开始保存词频
2021-03-26 18:52:37,529 : INFO : collecting document frequencies
2021-03-26 18:52:37,529 : INFO : PROGRESS: processing document #0
<class 'list'>
[(0, 9), (1, 9), ..., (94745, 1), (94939, 1)] 
开始训练模型
2021-03-26 18:52:38,025 : INFO : PROGRESS: processing document #10000

...
2021-03-26 18:53:23,966 : INFO : PROGRESS: processing document #2360000
2021-03-26 18:53:23,982 : INFO : calculating IDF weights for 2360831 documents and 100000 features (378829414 matrix non-zeros)
开始保存模型
2021-03-26 18:53:24,109 : INFO : saving TfidfModel object under /mnt/Data1/ysc/tfidf.model, separately None
2021-03-26 18:53:24,330 : INFO : saved /mnt/Data1/ysc/tfidf.model

生成句向量：

import jieba

from gensim.models import TfidfModel
from gensim.corpora import Dictionary

model = TfidfModel.load('/mnt/Data1/ysc/TF-IDF/tfidf.model')
dictionary = Dictionary.load('/mnt/Data1/ysc/TF-IDF/Tfidf.dic')

from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load('/mnt/Data1/ysc/TF-IDF/vectors.kv')

def get_sentence_vec(sentence):
    list = ' '.join(jieba.cut(sentence)).split(' ')
    bow = dictionary.doc2bow(list)
    vecsum = [0] * word_vectors.vector_size
    cnt = 0
    for item in model[bow]:
        try:
            wordvec = word_vectors[dictionary[item[0]]]
            wordvec = [i * item[1] for i in wordvec]
            vecsum = [i + j for i,j in zip(vecsum, wordvec)]
            cnt += 1
        except:
            print(dictionary[item[0]] + 'not in vocab!')
    vecsum = [i/cnt for i in vecsum]
    return vecsum

注意到构建词典时，把出现频率超过80%的词给去掉了，这些词不一定是在停词表里面的词，例如：‘我’

import jieba

from gensim.models import TfidfModel
from gensim.corpora import Dictionary

model = TfidfModel.load('/mnt/Data1/ysc/TF-IDF/tfidf.model')
dictionary = Dictionary.load('/mnt/Data1/ysc/TF-IDF/Tfidf.dic')

from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load('/mnt/Data1/ysc/TF-IDF/vectors.kv')

def get_sentence_vec(sentence):
    import numpy
    list = ' '.join(jieba.cut(sentence)).split(' ')
    bow = dictionary.doc2bow(list)
    vecsum = [0] * word_vectors.vector_size
    cnt = 0
    for item in model[bow]:
        try:
            wordvec = word_vectors[dictionary[item[0]]]
            wordvec = [i * item[1] for i in wordvec]
            vecsum = [i + j for i,j in zip(vecsum, wordvec)]
            cnt += 1
        except:
            if item[0] == 88727:continue       # dictionary[88727]==''
            print(item[0])
            print(dictionary[item[0]] + 'not in vocab!')
    if cnt == 0: return numpy.array(vecsum)
    vecsum = [i/cnt for i in vecsum]
    return numpy.array(vecsum)


path = '/mnt/Data1/ysc/'
file = open(path + 'Data_Small.txt', 'r', encoding='utf-8')
output = open(path + 'Vec_Small_Tf-idf.txt', 'a', encoding='utf-8')

for line in file.readlines():
    vec = get_sentence_vec(line[:-4])
    emotion = line[-4:-1]
    if vec.any() != 0:
        output.write(str(vec).replace('\n','') + ' ' + emotion + '\n')
    else:
        print('All zeros!')

数据集共有609677个句子，3764个句子返回的句向量为空

测试集返回句向量为空的句子：

140亮度高
507声场还是有些局促
961字太大
970字太大
1042层次分明
1360可能屏小
1441反应速度快
1684方方正正
1708不过功放一般
2190通话质量可以
2214色彩艳丽
2440机器配置可以
2449除了亮度高

280色彩艳丽

14路感可以
20声噪大
423低油耗
668小排量
669大马力
670低油耗
823低油耗

479价格低
484机形虽不大
509超清
690价格不菲
819价格合理
1359屏不比屏差
1397直出比差
1581反应速度快
2133色彩艳丽
2163但是网速慢啊

代码如下：

import re

path = '/mnt/Data1/ysc/TF-IDF/Vec_Small_Tf-idf.txt'
file = open(path, 'r', encoding='utf-8')
train_data = []
train_label = []
i = 0
for line in file.readlines():
    if line[-4:-1] == 'POS':
        train_label.append(1)
    elif line[-4:-1] == 'NEG':
        train_label.append(-1)
    elif line[-4:-1] == 'ORM':
        continue
    try:
        line = re.sub(r'\[[ ]+','[',line[:-5])
        train_data.append(eval(re.sub(r'[ ]+', ', ', line.replace(' ]', ']'))))
        # train_data.append(eval(re.sub(r'[ ]+', ', ', (line[:-5]).replace('[  ','[ ').replace('[ ', '[').replace(' ]', ']'))))
    except:
        print(line)

    i += 1
    if i % 10000 == 0: print(i)
file.close()
print(len(train_data) == len(train_label))
print('总训练句向量数据：%d' % len(train_data))

path = '/mnt/Data1/ysc/TF-IDF/Vec_test_Tf-idf.txt'
file = open(path, 'r', encoding='utf-8')
test_data = []
test_label = []

for line in file.readlines():
    if line[-4:-1] == 'POS':
        test_label.append(1)
    elif line[-4:-1] == 'NEG':
        test_label.append(-1)
    elif line[-4:-1] == 'ORM':
        continue
    line = re.sub(r'\[[ ]+','[',line[:-5])
    test_data.append(eval(re.sub(r'[ ]+', ', ', line.replace(' ]', ']'))))
file.close()
print(len(test_data) == len(test_label))
print('总测试句向量数据：%d' % len(test_data))


def svm(X_train, y_train, X_test, y_test):  # 支持向量机
    from sklearn.svm import LinearSVC  # 导入支持向量机分类器SVC
    svm = LinearSVC()  # *, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, tol=0.001, cache_size=200
    svm.fit(X_train, y_train)  # 训练模型
    print('Accuracy of svm on training set:{:.2f}'.format(svm.score(X_train, y_train)))  # 打印训练集的预测准确率
    print('Accuracy of svm on test set:{:.2f}'.format(svm.score(X_test, y_test)))  # 打印测试集的预测准确率
    predict = svm.predict(X_test)  # 预测标签
    return predict  # 返回预测的标签值


def cal_accuracy(predict, testing_labels):  # 由预测值和实际值标签计算准确率
    if len(predict) != len(testing_labels):
        print('Error!')
        return
    correct_classification = 0  # 将正确的分类数记为correct_classification
    for i in range(0, len(predict)):  # 对于每一个测试集
        if testing_labels[i] == predict[i]:
            correct_classification += 1  # 如果正确分类则correct_classification ++1
    # print("The accuracy rate is:" + str(correct_classification / testing_data_num))       # 可以打印出准确率
    return correct_classification / len(predict)  # 返回正确率


predict = svm(train_data, train_label, test_data, test_label)
print(cal_accuracy(predict, test_label))

支持向量机结果（最大迭代次数10000）：

Accuracy of svm on training set:0.69
Accuracy of svm on testset:0.77
0.7656846283010228

随机梯度下降结果

Accuracy of sgd on training set:0.62
Accuracy of sgd on test set:0.64
0.6368493359792398

随机梯度下降早停（0.1）结果

Accuracy of sgd on training set:0.59
Accuracy of sgd on test set:0.60
0.6020454892382843

支持向量机（最大迭代次数1000）结果：

Accuracy of svm on training set:0.65
Accuracy of svm on test set:0.62
0.6177682796519616

可以发现结果也烂爆了…

2. 腾讯AI实验室汉字词句嵌入语料库求平均生成句向量

同样的训练方法，剔除没有词向量以及句向量的数据集：

经朋 POS
TAT NORM
， NORM
TT NORM
I LOVE YOU....... POS
.. NORM
我无语...... NEG
， NORM
.... NORM
... NORM
13716034788 NORM
TAT NEG
..................................... NORM
噴笑 POS
Good NORM
地摊货，，，，，，，，， NEG
SUMN IS COMING POS
美尚雯婕 POS
SO COOOOOOL POS
2008Teen Choice Awards Arrivals POS
Macbook Pro POS
Amy Winehouse POS
Teardrop POS
Knock Knock POS
15921107722 POS
ZUHAIR MURAD POS
SNOOPY X LACOSTE POS
........................ POS
bonniebonnie POS
Happy Birthday To You POS
V5 POS
2or4or7or9 POS
AUV ，继克爷 POS
Wesley Sneijder POS
Monchhichi POS
Give Me Five O POS
........... POS
Gwen Stefani POS
Happy National Day POS
y3st3RdAy On3c3 MoR3 POS
Dmop X Sanrio X Ground Zero Party 20AUG2010 POS
guanguijie POS
小新小新HELLO KITTY POS
叽驥 ITTY POS
608806608806.... POS
JoyStick 起貼 Lauch Party POS
You Jump I Jump POS
togethermore POS
Sexy Nikita ...Maggie Q POS
既生翔，何生鹏 POS
Salvatore Ferragamo POS
SSSSSSSSSSSSS POS
Louis Scola 42points 12rebounds POS
Occasional POS
Apple Magic Mouse POS
Angelina Jolie POS
kakakakaka POS
Fifth Avenue Shoe Repair Winter 2010Collection POS
whatdoyouthink POS
操温鬼暖枫.... POS
CooooooooooolDubBigSpin POS
300027300002 POS
......... POS
longyinhuxiao POS
...TWILIGHT ... POS
01619416999 POS
1949Roadmaster Riviera Hardtop POS
Visiting CW11 Morning Show POS
PartyQueen POS
Lan Somerhalder POS
Cute ...... POS
LALALA ..... POS
LV Louis Vuitton POS
Promotionals POS
Happeeee Smile .... POS
201010241111 POS
Emma Roberts POS
ohnooooooooooooooooo NEG
Freja Gucci S S 11Backstage NEG
yiwangfanni bendeyaoside benbenbenbenbenbieliwole NEG
............................ NEG
Milan Fashion Week NEG
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ........................... NEG
I LOVE STARBUCKS NEG
C6 OR N86 NEG
dulldulldull ......... NEG
OH MY Lady Yangyang NEG
Black Friday NEG
Auto Dos NEG
.......... NEG
yangqi ..... NEG
153529858901535298589215352985896 NEG
LOVE LIFE NEG
1369139978613522025847 NEG
2010LOST TWO NEG
HAPPY BIRTHDAY TO MYSELF NEG
EAST WEST HOME IS BEST NEG
blekx blekx blekxx NEG
13400728099 NEG
Depanche Mode ... NEG
.................... NEG
景皓哥 NEG
NO ONE LOVES ME NEG
................. NEG
Goodbye My Car NEG
13718195241 NEG
chaojihaoxiao NEG
GOING BACK ....TML CONTINUE ...........................CRAZY NEG
153529858901535298589215352985896 NEG
LOVE LIFE NEG
1369139978613522025847 NEG
2010LOST TWO NEG
HAPPY BIRTHDAY TO MYSELF NEG
EAST WEST HOME IS BEST NEG
blekx blekx blekxx NEG
13400728099 NEG
Depanche Mode ... NEG
.................... NEG
景皓哥 NEG
NO ONE LOVES ME NEG
................. NEG
Goodbye My Car NEG
13718195241 NEG
chaojihaoxiao NEG
GOING BACK ....TML CONTINUE ...........................CRAZY NEG

关于从服务器向Windows发送文件上次装的软件有开机自启动的问题，这里卸载掉然后参考此文章，利用Windows10自带的ssh服务器进行连接

services.msc

netstat -ant

Active Connections

  Proto  Local Address          Foreign Address        State           Offload State

  TCP    0.0.0.0:22             0.0.0.0:0              LISTENING       InHost

可以看到22端口（ssh）监听中，说明sshserver服务成功开启了

再以管理员身份打开cmd,输入:

net user sshuser ysc123 /add
net user sshuser ysc123 /active
net user

-------------------------------------------------------------------------------
Administrator            BvSsh_VirtualUsers       DefaultAccount
Guest                    sshd                     sshuser
WDAGUtilityAccount       Yang SiCheng
The command completed successfully.

报错参考此文章，在服务器里打开/home/ysc/.ssh/known_hosts清空内容即可

训练数据集剔除无句向量的句子：

字太大
字太大

声噪大

屏不比屏差

支持向量机(max_iter=1000)：

Accuracy of svm on training set:0.75
Accuracy of svm on test set:0.78
0.7750076010945576

支持向量机(max_iter=10000)：

Accuracy of svm on training set:0.75
Accuracy of svm on test set:0.78
0.7750076010945576

随机梯度下降(max_iter=10000)：

Accuracy of sgd on training set:0.75
Accuracy of sgd on test set:0.78
0.7780480389176041

随机梯度下降(max_iter=10000，早停)：

Accuracy of sgd on training set:0.75
Accuracy of sgd on test set:0.78
0.7771359075706902

可以发现结果与之前Word2Vec差不多的

小结

Tf-idf没有想象中效果好，与Doc2Vec一样很拉跨，还不如Doc2Vec
腾讯AI实验室汉字词句嵌入语料库求平均生成句向量也没有比之前模型效果好一些，可能原因在于这是200维的向量