TF-IDF Implementation with C++
TF-IDF weight is widely used in text mining. It measures the importances of a word to a document in corpus. Recently I was doing with music recommendation algirhtms, and I have found that many papers were using the TF-IDF to measure the lyric similarity between musics. I have searched and did not find a TF-IDF library, so I decided to code one by myself.
Basics
TF-IDF weight is calculated by 2 components, Term Frequency (TF) and Inverse Document Frequency (IDF). The definations of TF-IDF weight of a term j in document i is shown below.
where tfij is the frequency of term j in document i, N is total number of documents, and nj is number of documents contains term j.
For more details, please refer to TF-IDF Tutorial.
Code
Assume we have 25 text files, each text file is a document. The code here will compute the TF-IDF weight for these 25 documents.
Firstly, we need to split the input file contents to words. Here I use the boost::tokenizer
, it can split the std::string
b