Doc2vec对M10语料库进行多分类 python

语料库:是文献引用关系的语料库,将文献分成10类
包含3个txt,一个是文档ID+文档标题信息,一个是文档ID之间的引用关系,一个是文档类别
语料库下载:m10

doc2vec和word2vec不同,直接是对文档进行训练,得到的就是一个个文档向量。
主要分为三步,一步就是提取文档信息,一步进行训练,最后分类。

第一步提取比较简单 主要就是提取文档标题信息做words,然后doc2需要给每一个文档打tags然后一起投入训练(用文档的ID也是可以的,这里用的数字1,2,3…..)
原本gensim.doc2vec内有 LabeledSentence使用,这里用的是namedtuple结构,其实差不多,每一个元素就是文档标题words+文档tags+文档的类别。

def readdata(path):

    fileindex={}
    fileinfor=[]
    labels=set()
    f1=open(path+'/docs.txt',encoding='utf-8')
    f2=open(path+'/labels.txt',encoding='utf-8')
    i=0
    for l1 in f1:
        tokens=ut.to_unicode(l1.lower()).split()
        words=tokens[1:]  #文档信息
        tags=[i]#   ID
        i=i+1
        l2=f2.readline()
        tokens2=ut.to_unicode(l2).split()
        label=tokens2[1]  #文档类别
        labels.add(label)
        fileinfor.append(NetworkSentence(words,tags,label))   #文档信息+文档ID+文档类别+文档index

    return fileinfor,list(labels)

第二步就是训练doc2模型 通过shuffle语料库,来提高模型准确率
其中参数dm=0代表用的pv-dbow模型
(这里hs=1,用的是基于 Hierarchical Softmax 的模型,效果好很多,如果Hs=0,即采用负样本采样模型,结果不是很好,但是和论文中结果相似……)

def traindo2vec(docs):
    model = Doc2Vec(dm=0, size=300, dm_mean=0, window=8,hs=1, negative=2,min_count=5, workers=4)  # PV-DBOW
    print('Building Vocabulary')
    model.build_vocab(docs)
    for i in range(9):
        print("刷新 %d" %i)
        shuffle(docs        model.train(docs,total_examples=model.corpus_count,epochs=model.iter)
    return  model

最后训练出的每一个文档是一个向量,放在model.docvecs中,即docvecs[0]就是第一个文档的文档向量,因此我们的训练和测试向量得到如下:

    path='data/M10'
    docs,classlabels=readdata(path)
    do2model=traindo2vec(docs)

    train, test = train_test_split(docs, train_size=0.7, random_state=0)  # 随机划分训练集和测试集

    print("共有%d个文件" %len(docs))
    print("共有%d个类别" %len(classlabels))

    #docvecs 每个文档一个向量 通过tags定位向量在模型的位置
    train_vec=[do2model.docvecs[doc.tags[0]] for doc in train]
    test_vec = [do2model.docvecs[doc.tags[0]] for doc in test]

    train_y = [doc.label for doc in train]
    test_y = [doc.label for doc in test]

最后就是训练啦

    print('练集有: %d个 , 测试集有:%d  个' %(len(train_vec),len(test_vec)))
    from sklearn import svm
    from sklearn.metrics import f1_score
    from sklearn.metrics import accuracy_score
    clf = LinearSVC()
    clf.fit(train_vec, train_y)
    result = clf.predict(test_vec)
    print(result)

    macro_f1 = f1_score(test_y, result,pos_label=None, average='macro')
    micro_f1 = f1_score(test_y, result,pos_label=None, average='micro')
    acc = accuracy_score(test_y, result)
    print(acc)
    print(macro_f1,micro_f1)

最后的结果(Hs=1采用哈夫曼树的时候)
只有文档标题信息,效果一般
将DeepWalk 和DOC2结合还不错大概有72%
这里写图片描述

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 6
    评论
Python Doc2Vec is an algorithm for generating vector representations of documents. It is an extension of the Word2Vec algorithm, which generates vector representations of words. Doc2Vec is used for tasks such as text classification, document similarity, and clustering. The basic idea behind Doc2Vec is to train a neural network to predict the probability distribution of words in a document. The network takes both the document and a context word as input, and predicts the probability of each word in the vocabulary being the next word in the document. The output of the network is a vector representation of the document. Doc2Vec can be implemented using the Gensim library in Python. The Gensim implementation of Doc2Vec has two modes: Distributed Memory (DM) and Distributed Bag of Words (DBOW). In DM mode, the algorithm tries to predict the next word in the document using both the context words and the document vector. In DBOW mode, the algorithm only uses the document vector to predict the next word. To use Doc2Vec with Gensim, you need to first create a corpus of documents. Each document should be represented as a list of words. You can then create a Doc2Vec model and train it on the corpus. Once the model is trained, you can use it to generate vector representations of new documents. Here's an example of training a Doc2Vec model using Gensim: ``` from gensim.models.doc2vec import Doc2Vec, TaggedDocument from nltk.tokenize import word_tokenize # create a corpus of documents doc1 = TaggedDocument(words=word_tokenize("This is the first document."), tags=["doc1"]) doc2 = TaggedDocument(words=word_tokenize("This is the second document."), tags=["doc2"]) doc3 = TaggedDocument(words=word_tokenize("This is the third document."), tags=["doc3"]) corpus = [doc1, doc2, doc3] # create a Doc2Vec model and train it on the corpus model = Doc2Vec(corpus, vector_size=100, window=5, min_count=1, workers=4, epochs=50) # generate vector representations of new documents new_doc = word_tokenize("This is a new document.") vector = model.infer_vector(new_doc) ``` In this example, we create a corpus of three documents and train a Doc2Vec model with a vector size of 100, a window size of 5, a minimum word count of 1, and 50 epochs. We then generate a vector representation of a new document using the `infer_vector` method.
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值