Word2vec代码解析3–模型构建
代码在最后
1.1InPut[3]解析;
函数原型:numpy.around(a, decimals=0, out=None)
参数解析:
a为输入列表或矩阵;
decimals为n对输入近似后保留小数点后n位,默认为0,若值为-n,则对小数点左边第n位近似;
out为可选参数,一般不用,用于保存近似返回结果。
1.2InPut[4]解析;
shape函数的功能是读取矩阵的长度,比如shape[0]就是读取矩阵第一维度的长度,相当于行数。数组有两个维度(即行和列)时,和我们的逻辑思维一样,a.shape返回的元组表示该数组的行数与列数。根据输出结果我们可以发现X_train是一个三维矩阵阵
1.3InPut[5]解析:
详见1.1
1.4InPut[6]解析:
t_y=pd.read_csv(‘w_Y.csv’, header=None, index_col=None)
pandas的read_csv函数会自动将该文件进行读取
index_col:我们在读取文件之后所得到的DataFrame的索引默认是0、1、2……,我们可以通过set_index设定索引,但是也可以在读取的时候就指定某列为索引。
index_col:我们在读取文件之后所得到的DataFrame的索引默认是0、1、2……,我们可以通过set_index设定索引,但是也可以在读取的时候就指定某列为索引。
1.5InPut[7]解析:
求各种评价指标,包括recall,SN,SP,GM,TP,TN,FP,FN
参考文献
[1]Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a) “Efficient estimation of word representations in vector space.” arXiv preprintarXiv:1301.3781.
[2]https://blog.csdn.net/runmin1/article/details/89174511
[3]https://blog.csdn.net/jiaxinhong/article/details/81771885
[4]https://zhuanlan.zhihu.com/p/340441922
代码在这里
InPut[3]:
X_train=[]
for i in range(0,17808):
s=[]
# for word in document3[i]:
# s.append(model3.wv[word])
for word in document3[i]:
s.append(model3.wv[word])
# for word in document5[i]:
# s.append(model5.wv[word])
X_train.append(s)
# print(np.shape(X_train))
# for word2 in document4:
# s=[]
# for i2 in word2:
# s.append(model4.wv[i2])
# X_train.append(s)
# print(np.shape(X_train))
# for word3 in document5:
# s=[]
# for i3 in word3:
# s.append(model5.wv[i3])
# X_train.append(s)
# print(np.shape(X_train))
InPut[4]:
np.shape(X_train)
输出
(17808, 39, 150)
InPut[5]:
import numpy as around
X_train=np.around(X_train,4)
InPut[6]:
import pandas as pd
import numpy as np
#t_x=pd.read_csv('wanmeir.csv', header=None, index_col=None)
t_y=pd.read_csv('w_Y.csv', header=None, index_col=None)
xx= np.array(X_train)
xx_y=t_y.values
#xx = np.expand_dims(xx, axis=2)
#tt_x=pd.read_csv('fufufud.csv', header=None, index_col=None)
#tt_y=pd.read_csv('fuy.csv', header=None, index_col=None)
#test_x=tt_x.values
#test_y=tt_y.values
#test_x = np.expand_dims(test_x, axis=2)
InPut[7]:
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
import math
def performance(labelArr, predictArr):
#labelArr[i] is actual value,predictArr[i] is predict value
TP = 0.; TN = 0.; FP = 0.; FN = 0.
for i in range(len(labelArr)):
if labelArr[i] == 1 and predictArr[i] == 1:
TP += 1.
if labelArr[i] == 1 and predictArr[i] == 0:
FN += 1.
if labelArr[i] == 0 and predictArr[i] == 1:
FP += 1.
if labelArr[i] == 0 and predictArr[i] == 0:
TN += 1.
if (TP + FN)==0:
SN=0
else:
SN = TP/(TP + FN) #Sensitivity = TP/P and P = TP + FN
if (FP+TN)==0:
SP=0
else:
SP = TN/(FP + TN) #Specificity = TN/N and N = TN + FP
if (TP+FP)==0:
precision=0
else:
precision=TP/(TP+FP)
if (TP+FN)==0:
recall=0
else:
recall=TP/(TP+FN)
GM=math.sqrt(recall*SP)
#MCC = (TP*TN-FP*FN)/math.sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))
return precision,recall,SN,SP,GM,TP,TN,FP,FN