BM25 vs Lucene Default Similarity Comparing Precision and Recall
In the previous article I looked into the difference between BM25 and the tf-idf similarity as they are defined. The default similarity in Lucene and hence also Elasticsearch is however not a pure tf-idf implementation. It does actually have a form of document length normalization too. Thanks to1 for mentioning this! This makes it a little more difficult to compare it with the solutions described in academic papers and the conclusion that BM25 is mostly superior to tf-idf might not be applicable to the tf-idf implemation in Lucene. To investigate this issue I will do a comparison of precision and recall using Wikipedia articles.
Using the Wikipedia river indexing the articles is pretty straight forward, but if you plan on indexing the entire dataset there are a few gotchas. Due to the sheer size of the dataset just downloading the data takes a few hours and you will likely have to leave it working over night. At my first attempt I used the default image, but this of course was updated to a new image midway in the download and it failed. The solution was simple, use a specific image. This also had the added benefit of making it easier to reproduce the exact same documents for indexing them with a different similarity model. Should you want to do something similar, I recommend you start querying right away as the river does not overload your cluster and it’s better to figure out that you should have changed the mappings or some other index setting before having wasted all that time waiting.
The river settings:
Index settings with default similarity:
Index settings BM25 similarity:
As you can see changing the similarity for a specific field is as simple as changing the mapping.
For this test I used Elasticsearch 0.90.7 and elasticsearch-river-wikipedia 1.3.0.
I started out experimenting with a basic match query, adressing all fields and comparing the results I got with what I could find in the search box on wikipedia.org. Two things became clear, the search on wikipedia.org mainly tries to match your keywords with titles and the document collection also has many documents that simply represent alternate titles and a redirect link to another document. Considering Wikipedia’s encyclopedic style, the matching of search queries with titles is a natural corollary of the strong aboutness that the title has for such articles. The redirect titles are a good source of alternate titles and spelling, but the fact that the redirects appear as pages in the document collection is more of an implementation detail regarding mediawiki and when indexed with Elasticsearch, it would better fit as a bundle with the targeted document.
The Wikipedia river actually detects the redirects and marks them with the boolean redirect field making it easy to filter them out. I decided to filter them away as for some documents the redirect document ended up with a better score as it contained the desired terms and where that much shorter. To better get to know the data I also tried some queries on just the titles and it was easy to find the articles I where looking for. If I were to create a new searchbox for Wikipedia with Elasticsearch I would definitely give a boost to a title or redirect title match.
Now to address the real issue at hand. I needed a way to create a query set where each query has a preknown answer. Having witnessed the aboutness that the titles represent for their articles I constructed my test by simply drawing 1000 random articles and using the title as the query keywords and the document text as the answer.
I then wrote a script that executed the below query replacing <title> for each chosen article.
For this test the redirects would ideally not have been indexed at all, but the above filters handles this. The term filter uses the redirect field created by the river and the regexp catches some remaining redirects that where not detected by the river.
The query is limited to the 10 top ranking hits (default in Elasticsearch). Usually similarity models are compared by precision, the percentage of relevant queries within the result, and recall, the percentage of all relevant documents that are included in the result, but the limit of 10 hits and only one relevant document for each query makes these values a little skewed as all queries will either have a recall of zero or one hundred percent and a precision of either zero or ten percent. Instead of the usual academic criteria I will for this non-academic test use percentage of queries that did not find the desired document as an indicator of recall and the average rank of the desired document for those queries who did find it as an indicator of precision.
The result of the script is an indication of precision and recall for both similarities when matching Wikipedia document titles with document text (titles excluded).
|Similarity||Average rank||Documents not found|
Clearly BM25 performed far better than the default similarity for this case, but it is important to keep in mind the 10 hits limit on the result size when interpreting these results. If one where to retrieve more hits, it is likely that the percentage of documents not found would drop for both similarities, but then the average rank would also grow.
By processing the hits of every query until the desired number of hits is found or not one could calculate the actual precision and recall, but this would still only be circumstantial proof as the result would only be in regards to matching titles to articles in the English version of Wikipedia. As this experiment would be more in line with practical search than a formal proof I chose to also use the 10 hit limit as this is about as many hits as I would expect the average user to bother to consider reading before refining his query if the desired document was not found.
In other words the result of this experiment is not a general proof that BM25 is always better than the default similarity, but it does suggest there can be a significant potential in using BM25 over the default similarity, at the very least for some cases. I strongly recommend taking the time to test both, with documents from your own use of Elasticsearch.