pyndri_dictionary

最新推荐文章于 2021-04-06 19:00:25 发布

小饼干超人

最新推荐文章于 2021-04-06 19:00:25 发布

阅读量112

点赞数

分类专栏： pyndri

本文链接：https://blog.csdn.net/m0_37586991/article/details/89648097

版权

pyndri 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

统计文档的词频

import pyndri
import sys
index_path='../../Dataset/Robust2004/robust2004_idx'

with pyndri.open(index_path) as index:
    token2id, id2token, id2df = index.get_dictionary()
    id2tf = index.get_term_frequencies()
    
    print('Index contains %d documents.' % index.document_count())
>>>
Index contains 528155 documents.

token2id, id2token, id2df，id2tf这四个都是字典，四个字典的长度都是845146

token2id         词 ：词id
id2token        词id：词
id2df           词id：文档频率（在多少篇文档中出现）
id2tf           词id：词频（在文档集中出现多少次）

print(token2id)
>>> 
{'the':1,'of':2,'to':3,'and':4,5:'in',...}

print(id2token) 
>>>
{1:'the',2:'of',3:'to',4:'and','in':5...}

print(id2df)
>>>
{1:513196,2:505628,3:494494,4:491889,5:494175,...

print(id2tf）
>>>
{1:1679390,2:8079187,3:6570031,4:5991150,5:5216661,...}

print(len(token2id))
>>>
845146

index.document_count()代表的是文档集中共有多少篇文档

>>> index.document_count()
528155

参考：
pyndri/examples/dictionary.py

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

小饼干超人

关注关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pyndri_dictionary

统计文档的词频import pyndriimport sysindex_path='../../Dataset/Robust2004/robust2004_idx'with pyndri.open(index_path) as index: token2id, id2token, id2df = index.get_dictionary() id2tf = index.g...
复制链接

扫一扫