Arthur Fuller discusses a very slick alternative to using string indexes known as hash buckets or hash keys. Find out how using hash keys can produce a gain in performance and save you disk space.
A common SQL Server application requirement is to support searches based on one or more long strings. (In this article, we refer to anything with more than 20 characters as a long string.) Suppose that your front-end application wants to allow the user to supply two long strings; you fire a stored procedure that performs a search on these two strings, and then you optimize the search by supplying an index on the two relevant columns. In a small table, you might not notice the impact; however, if the table contains 50 million rows, it will impact storage requirements and search performance.
A great way to avoid the hit in disk space or performance is to use a very slick alternative to using string indexes known as hash keys (which refers to one individual hash) or hash buckets (which are a collection of hash keys).
What is a hash?
A hash is the integer result of an algorithm applied to a given string. There are numerous hashing algorithms, but the one that is used most often is the built-in SQL function Checksum(). In general, you feed said function a string, and it returns an integer (although there's no guarantee the integer will be unique, particularly in large tables).
Hash tables in database design
Suppose that we have these columns in the table of interest:
A multi-column index on both of these columns would consume 50 + 50 characters per row, plus the overhead—given 50 million rows, this amounts to a sizable problem.
A hash key based on these two columns is vastly smaller, i.e., four bytes per row. It's even better if we don't store the hash keys except in the index on this column. Instead, we should create a calculated column whose formula is the hash key of these two columns, and then index that column and disregard the index on the pair of strings.
The user (whether it's a human or an application) queries the values of interest; then we convert the parameters into a hash key and search the hash index. The set of duplicates will be dramatically smaller than the set of rows the engine must visit to do an exact match on the query values. We then combine the search for the hash key with a comparison of the two columns of interest, isolating a small subset of rows, and then examining the two columns for matches. A search based on an integer column is dramatically faster than a search based on a lengthy string key, and more so if it is a compound key.
Hash key algorithms using the Checksum function
Try running this sample code, which demonstrates how you obtain the hash key for any given value or combination of values:
SELECT Name, GroupName, Checksum(Name,GroupName)AS HashKey
ORDER BY HashKey
This results in the rows in the following table (which are clipped to 10 for brevity).
In a live environment, you would create a calculated column, which we'll call Name_GroupName_hb. Assuming the front-end passes in the target values for Name and GroupName, here is how you might approach this problem:
CREATE PROCEDURE DemoHash
-- USE AdventureWorks
DECLARE @id as int SET @id = Checksum(@Name,@GroupName)
SELECT * FROM Adventureworks.HumanResources.Department
WHERE HashKey = @id
AND Name = @Name
AND GroupName = @GroupName
Let's imagine that in a 50-million row table we get back 100 rows with the specified hash key. Since there are no other indexes available on these two columns, the query optimizer uses the hash bucket index. This quickly isolates the 100 rows of interest. Then we visit these rows, examining their Name and GroupName columns for exact matches. We increase performance significantly and, at the same time, save vast amounts of disk space.
The example assumes that the search targets exist in a single table. Given a search that selects targets from multiple tables, we can apply the same technique. All we have to do is create a table function that joins the tables, and then create an index that hashes the columns from the various tables.
On relatively small tables, creating an indexed hash bucket may not result in a dramatic performance increase, but it will save you disk space. This tip will be most beneficial if you're working with larger tables.
Arthur Fuller has been developing database applications for more than 20 years. He frequently works with Access ADPs, Microsoft SQL 2000, MySQL, and .NET.