Scalable Recognition with a Vocabulary Tree
1,Text Retrieval Approach
The text retrieval approach:
(1)Parsing an article into words;
(2)Some words have the same stem, e.g.,“walk”,”walking”,”walks”, these different variants have the same stem: walk;
(3)Some words like “the” and “an” are extremely common in articles,and have almost no contribution to text retrieval. So they should be excluded;
(4)Each article represented as a histogram vector, and each element of the vector is the frequency of some word (actually some stem); such as TF (short for Term Frequency):
where t is the number of total stems;ni is the number of words which have the same stem i in the article;
(5)Considering the fact that different words have different contribution to the retrieval, so weighting is necessary.Such as IDF (short for Inverse Document Frequency):
where wi is for the weight of stem i;
(6)Finally, an article represented as: