数据库的全文检索性能对比

最新推荐文章于 2024-04-20 21:44:04 发布

fxsjy

最新推荐文章于 2024-04-20 21:44:04 发布

阅读量864

点赞数

文章标签：全文检索数据库 dataset database postgresql returning

本文链接：https://blog.csdn.net/fxsjy/article/details/4748632

版权

http://stackoverflow.com/questions/1037991/how-much-more-performant-is-postgres-than-mysql-on-fulltext-search

I ran benchmarks a few years ago on large datasets and found that :

MySQL FULLTEXT

Is pretty slow. Another drawback is that it forces MyISAM on you which brings a lot of problems. Also index updates are quite slow once the index reaches a certain size : when you insert a new row, a substantial part of the index is re-generated, sometimes a few hundred megabytes of index are rewritten just to insert a forum post. In other words, it's OK for a small forum with a few MBytes of posts, but there is a reason Wikipedia doesn't use it...

PostgreSQL fulltext

Is about 10-100x faster than MySQL fulltext, is a lot more powerful, gist is fast on inserts/updates, no problem with locks, in other words it's a totally decent solution.

However searches get slow when the data set is larger than RAM because of MVCC, postgres needs to check the visibility of rows by hitting the heap. Note this may change in a future version. If your query returns 10 rows, no problem. However, if you want to SELECT WHERE (fulltext query) ORDER BY date LIMIT 10 and the fulltext matches 10.000 rows, it can get pretty slow. Still faster than MySQL but not the performance you'd want.

Xapian : I tested this, there are also Lucene and Sphinx which have good reputation.

Xapian does not have to conform to the same restrictions as a database, so it can make a lot more opimizations. For instance, it's a single-writer multiple-reader concurrency model, so you'll need some sort of update queue to update your index in the background. It also has its own on-disk format. The result is that it is incredibly fast, even when the dataset is much larger than RAM, and especially on complicated queries matching lots of rows, with sorts, and returning only the most relevant ones.

The index is huge too, it probably contains lots of duplicated stuff. The consequence is that it doesn't need to seek to retrieve the stuff.

Basically once Postgres started to hit the IO-seek wall, MySQL was long dead, and Xapian kept blazing fast.

But it is not as nicely integrated in the database, so it is more work to use. It is only worth it if you have a huge dataset. If this is your case, try it, it's amazing. If your dataset fits in RAM, postgres will just work with a lot less hassle. Also if you want to combine fulltext and database queries, well, integration becomes important.