tfidf算法 python_Python和tfidf算法，使其更快？-CSDN博客

本文链接：https://blog.csdn.net/weixin_39591916/article/details/111850227

I am implementing the tf-idf algorithm in a web application using Python, however it runs extremely slow. What I basically do is:

1) Create 2 dictionaries:

First dictionary: key (document id), value (list of all found words (incl. repeated) in doc)

Second dictionary; key (document id), value (set containing unique words of the doc)

Now, there is a petition of a user to get tfidf results of document d. What I do is:

2) Loop over the unique words of the second dictionary for the document d, and for each unique word w get:

2.1) tf score (how many times w appears in d: loop over the the list of words of the first dictionary for the document)

2.2) df score (how many docs contain w: looping over the set of words of all documents (second dictionary) and check if w is contained). I am using a set since it seems to be faster for checking if a set contains a word compared to a list.

Step 2.2 is terribly slow. For example, having 1000 documents, and for a document with 2313 unique words, it takes around 5 minutes to output the results.

Is there any other way to make step 2.2 faster? Are dictionaries that slow for iterating?

解决方案

Well, you have to re-think and re-design somehow the way you hold your data, or in other words, implement an "orthodox" version of your "inverted index".

Your bottleneck is the "on-the-fly" calculation of the document frequency (DF) for the terms. It would be a clever idea for this to be dynamic, so every time you update your corpus (collection of documents), do some processing and update the DFs for every term in a document (and of course, save the results in a persistent way, aka a database etc..) .

The only structure you need is a nested dictionary like that

{ "term1" : { "DF" : x, "some_doc_id" : tf , "some_other_doc_id" : tf, etc } ,

"term2" : ...

etc..

}

properly updated every time you "feed" your corpus.

And of course, keep somewhere your corpus cardinality...

As a hobby and part of my work, I am implementing a python - redis backed small search engine. You might get some other ideas as well. Take a look here.