作者主题模型

最新推荐文章于 2022-05-20 16:08:18 发布

蔡艺君小朋友

最新推荐文章于 2022-05-20 16:08:18 发布

阅读量1.5k

点赞数

分类专栏： python 文章标签： atmodel

本文链接：https://blog.csdn.net/qq_32482091/article/details/80876776

版权

作者主题模型ATMODEL

最近一篇关于JAVA的博客中处理的结果，直接用于该python代码运行。
遇到的bug：

1.BUG1

perwordbound = at_model.bound(at_model.corpus, author2doc=at_model.author2doc,
                              doc2author=at_model.doc2author) / corpus_words

ValueError:bound cannot be called with authors not seen during training.
原因：author.txt中存在

湖北大学,2,5,6,8
湖北大学
,9,28
湖北中医药大学
，7,9,56

这样的不规范数据

2.BUG2

for a, a_doc_ids in author2doc.items():
    for i, doc_id in enumerate(a_doc_ids):
        author2doc[a][i] = doc_id_dict[doc_id]

提示author2doc[a][i] = doc_id_dict[doc_id]出错，KeyValue:’ ’
原因：author.txt中存在某行最后以“，”结尾，导致切割时出现空文档，如下面的第一行

中南大学,11,12,
北京大学,15

下面代码有如下问题：

有冗余部分
最后导出的结果sim.csv用Excel打开是乱码，但用Notepad++打开正常，要在Notepad++将编码改为UTF-8-BOM
功能不够完整
对内存不友好，但不知道如何调整目前的文件格式

# -*- coding:utf-8 -*-
import os
import re
from gensim.corpora import Dictionary
from gensim.models import AuthorTopicModel
from gensim.models import atmodel
from pprint import pprint
from sklearn.manifold import TSNE
# from bokeh.io import output_notebook
from bokeh.models import HoverTool
from bokeh.plotting import figure, output_file, show, ColumnDataSource
from gensim import matutils
import pandas as pd
from pandas import DataFrame
import xlrd
import openpyxl
from xlutils.copy import copy
import csv

最低0.47元/天解锁文章

蔡艺君小朋友

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
3
评论
作者主题模型

作者主题模型ATMODEL最近一篇关于JAVA的博客中处理的结果，直接用于该python代码运行。遇到的bug：1.BUG1perwordbound = at_model.bound(at_model.corpus, author2doc=at_model.author2doc, doc2author=at_model...
复制链接

扫一扫