tfidf算法 python_Python和tfidf算法,使其更快?

I am implementing the tf-idf algorithm in a web application using Python, however it runs extremely slow. What I basically do is:

1) Create 2 dictionaries:

First dictionary: key (document id), value (list of all found words (incl. repeated) in doc)

Second dictionary; key (document id), value (set containing unique words of the doc)

Now, there is a petition of a user to get tfidf results of document d. What I do is:

2) Loop over the unique words of the second dictionary for the document d, and for each unique word w get:

2.1) tf score (how many times w appears in d: loop over the the list of words of the first dictionary for the document)

2.2) df score (how many docs contain w: looping over the set of words of all documents (second dictionary) and check if w is contained). I am using a set since it seems to be faster for checking if a set contains a word compared to a list.

Step 2.2 is terribly slow. For example, having 1000 documents, and for a document with 2313 unique words, it takes around 5 minutes to output the results.

Is there any other way to make step 2.2 faster? Are dictionaries that slow for iterating?

解决方案

Well, you have to re-think and re-design somehow the way you hold your data, or in other words, implement an "orthodox" version of your "inverted index".

Your bottleneck is the "on-the-fly" calculation of the document frequency (DF) for the terms. It would be a clever idea for this to be dynamic, so every time you update your corpus (collection of documents), do some processing and update the DFs for every term in a document (and of course, save the results in a persistent way, aka a database etc..) .

The only structure you need is a nested dictionary like that

{ "term1" : { "DF" : x, "some_doc_id" : tf , "some_other_doc_id" : tf, etc } ,

"term2" : ...

etc..

}

properly updated every time you "feed" your corpus.

And of course, keep somewhere your corpus cardinality...

As a hobby and part of my work, I am implementing a python - redis backed small search engine. You might get some other ideas as well. Take a look here.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值