This blog is a result of lessons learned while working with the new Integrated Full Text Search in SQL Server 2008. It is not intended to be an in-depth tutorial on how to implement it. Instead, it will concentrate on best practices. The basics of how the new full text search feature works is found in Books On Line.
If you would like a good introductory tutorial, start with SQL 2008 Books Online, then read this whitepaper: http://msdn.microsoft.com/en-us/library/cc721269(SQL.100).aspx. It is one of the best whitepapers I’ve seen recently on a product feature.
Also for more introductory lessons here is an excellent link: http://sqlblogcasts.com/blogs/simons/archive/2008/02/19/SQL-Server-2008---iFTS-Introduction.aspx
Simon Sabin also has a nice series of posts on the new iFTS features: http://www.sqlskills.com/blogs/simon/2008/02/20/SQLServer2008IFTSNewFeatures.aspx
The new iFTS feature is not rocket science, but you should expect to spend some time learning about the new vocabulary, like Fragments, Master Merge, etc. You will also spend some time learning the new DMVs (like sys.dm_fts_index_keywords and sys.dm_fts_index_keywords_by_document) .
Performance
Here is what we noticed about the performance of iFTS in a production environment.
The full text searches and queries are very fast if you can keep the full text indexes up to date. There are two choices – automatic or manual index updates. Most applications should be able to run with automatic tracking turned on. The higher volume transaction systems may start to experience some blocking on inserts and updates at times. The blocks are short and do not last more than a few milliseconds most of the time. However, if your transaction volume is high enough where the blocking starts to impact the user experience, then turn off automatic updates and run manual merges. A merge will take the word fragments and merge them with the master fragment.
To get the best performance from your full text index, do the following new iFTS Best Practices:
1. Put the full text index in its own filegroup. This will avoid fragmenting the main data file(s).
2. Use Varchar(max) instead of image or text for the fields that you want to use as the base for your full text index.
3. Turn off auto updates if you see blocking on inserts and updates and do manual merges.
4. Avoid running Master Merges and a rebuild/reorg on another index on the same table at the same time. Make sure a Master Merge job does not overlap with a job that rebuilds a non clustered index on the same table.
Another interesting thing we learned is that if you have to rebuild the entire full text index for any reason, the full text queries will still run. You will notice that they may be very fast at first. This is because a full rebuild will drop the index and all the fragments, so the queries will be running against an empty index at first. They will continue to run as the index is built. The good thing is that you should rarely, if ever, have to do a full rebuild of the index. Instead, any major work should be done with ALTER INDEX START FULL POPULATION. This not a perfect solution since it does UPDATE statements to the index, which does its normal exclusive lock. It does the update in chunks which means that the locks are only briefly held. However, you will experience more blocking than normal for the full text queries while the update is running.
iFilters
Different document types need iFilters installed. Most common document types have available iFilters. The one exception at the time of this writing is the 64 bit PDF filter. There is one available but it has bugs that cause aborts. Hopefully Adobe can deliver the fixes soon.
Limitations
As good as the performance is on the new iFTS feature, there are still some limitations that will cause concern for some projects. First, the table must have a single column primary key. Second, the SWITCH command for partitions does not work. Third, there are no good scale out solutions with iFTS. Multiple indexes cannot appear as one and iFTS does not work with distributed partitioned views. We are working on improving these limitations for the next version of SQL Server. If you need a higher volume solution, ask your local Microsoft contact about FAST.
One other item you will be interested in is that the full text index size in SQL 2008 is approximately 50% bigger than the full text indexes in SQL 2005. This is generally not a problem, just a fact that you should know.
Handling multiple languages in a single document is a hard problem. Which word breaker do you use to shred the original document, and which language are you going to specify for the query? For example, if you have a document with Korean and English and you use the Korean word breaker to process the document, then if you search the document for English words it will only find the exact words and not any other forms of the words (like ing and s).
Memory Settings
The new full text daemon (FDHost) runs as a windows program and is not part of the SQL Server process, which means that they will compete for memory. This is normally not a concern because SQL Server generally plays well with other processes running on the same server. However, we noticed that if you have very large documents, like greater than 100MB, or even hundreds of 2MB documents being parsed at the same time, then you may want to make some configuration changes.
There are two options in this case:
1. Sp_fulltext_Service can be used to set the block size. Try increasing the block size to 512k.
2. Use sp_configure to set the maximum memory for SQL Server down a bit. How much down will depend on how many documents you intend to process simultaneously.
Other notes about FDHost
· The Katmai FDhost runs outside of SQL server memory space. There are a lot of similarities between Yukon MSFTEFD and Katmai FDHost processes.
· The Filter daemon manager (runs within SQL process) manages the FDHost process and it will launch one FDHost process for single thread and one for multi-thread FDHost processing. These two do not necessarily start at the same time.
o Single or multithreaded fdhost depends entirely of the iFilter used: if the document filter supports multithreading, then we load it into a multithreaded ready fdhost, otherwise not.
o Named Pipes must be enabled on the SQL server box because Filter daemon manager uses named pipes to communicate with FDHost.
· FDHost processes are not shared across multiple Katmai iFTS instances on the same server.
· During the full population of the FT indexes, the FDHost process may need a lot of memory. It’s a good practice to:
o Schedule the full FT index population off business hours when the SQL server is not being used by other processes or queries.
o You may need to manually reduce the SQL Server max memory to give more memory to FDHost as noted above. However, if SQL Server is not being heavily used, FDHost will get the memory left on the box.
· To improve the performance between FDHost manager and FDHost, users can change some configuration settings (such as ISM – internal shared memory, batch size, etc.) to improve the consumption process of the chunks of data sent to the FDHost.
Miscellaneous notes:
· The New Database dialog box in Management Studio has an option grayed out. Just below the name and owner there is a grayed out check box. In the released version of SQL Server 2008 the full text options are on by default. This was left in place in case any customers had references to it in scripts.
· iFTS support for Danish, Polish and Turkish is not on by default. These word breakers are delivered with SQL Server. Since they are written by 3rd party companies it was decided to leave them off by default.
Upgrade
If you are upgrading from SQL 2005, then you have a decision to make on whether to just upgrade the full text index or start over and rebuild it. The best practice recommendation is that if there is a new word breaker available for your language, then you should completely rebuild the full text index after upgrading, if you want to use the new word breakers and stemmers. http://msdn.microsoft.com/en-us/library/cc721269(SQL.100).aspx has a list of which word breakers are exactly the same and which are new.
Kevin Cox and Xiaoyu Li