Boost performance and use less disk space with hash keys

使用哈希桶代替字符串索引，可以显著提升大型数据库的搜索性能并节省磁盘空间。通过将长字符串转换为哈希键，减少索引大小，加速整数列的搜索速度。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

by Arthur Fuller | More from Arthur Fuller | Published: 3/6/06

Takeaway:
Arthur Fuller discusses a very slick alternative to using string indexes known as hash buckets or hash keys. Find out how using hash keys can produce a gain in performance and save you disk space.

A common SQL Server application requirement is to support searches based on one or more long strings. (In this article, we refer to anything with more than 20 characters as a long string.) Suppose that your front-end application wants to allow the user to supply two long strings; you fire a stored procedure that performs a search on these two strings, and then you optimize the search by supplying an index on the two relevant columns. In a small table, you might not notice the impact; however, if the table contains 50 million rows, it will impact storage requirements and search performance.

A great way to avoid the hit in disk space or performance is to use a very slick alternative to using string indexes known as hash keys (which refers to one individual hash) or hash buckets (which are a collection of hash keys).

What is a hash?

A hash is the integer result of an algorithm applied to a given string. There are numerous hashing algorithms, but the one that is used most often is the built-in SQL function Checksum(). In general, you feed said function a string, and it returns an integer (although there's no guarantee the integer will be unique, particularly in large tables).

Hash tables in database design

Suppose that we have these columns in the table of interest:

A multi-column index on both of these columns would consume 50 + 50 characters per row, plus the overhead—given 50 million rows, this amounts to a sizable problem.

A hash key based on these two columns is vastly smaller, i.e., four bytes per row. It's even better if we don't store the hash keys except in the index on this column. Instead, we should create a calculated column whose formula is the hash key of these two columns, and then index that column and disregard the index on the pair of strings.

The user (whether it's a human or an application) queries the values of interest; then we convert the parameters into a hash key and search the hash index. The set of duplicates will be dramatically smaller than the set of rows the engine must visit to do an exact match on the query values. We then combine the search for the hash key with a comparison of the two columns of interest, isolating a small subset of rows, and then examining the two columns for matches. A search based on an integer column is dramatically faster than a search based on a lengthy string key, and more so if it is a compound key.

Hash key algorithms using the Checksum function

Try running this sample code, which demonstrates how you obtain the hash key for any given value or combination of values:

USE AdventureWorks

SELECT Name, GroupName, Checksum(Name,GroupName)AS HashKey

FROM Adventureworks.HumanResources.Department

ORDER BY HashKey

This results in the rows in the following table (which are clipped to 10 for brevity).

Name	GroupName	HashKey
Tool Design	Research and Development	-2142514043
Production	Manufacturing	-2110292704
Shipping and Receiving	Inventory Management	-1405505115
Purchasing	Inventory Management	-1264922199
Document Control	Quality Assurance	-922796840
Information Services	Executive General and Administration	-904518583
Quality Assurance	Quality Assurance	-846578145
Sales	Sales and Marketing	-493399545
Production Control	Manufacturing	-216183716
Marketing	Sales and Marketing	-150901473

In a live environment, you would create a calculated column, which we'll call Name_GroupName_hb. Assuming the front-end passes in the target values for Name and GroupName, here is how you might approach this problem:

CREATE PROCEDURE DemoHash

( 
  @Name Varchar(50), 
  @GroupName Varchar(50)

)

AS

-- USE AdventureWorks

DECLARE @id as int SET @id = Checksum(@Name,@GroupName)

SELECT * FROM Adventureworks.HumanResources.Department

WHERE HashKey = @id 
    AND Name = @Name 
    AND GroupName = @GroupName

Let's imagine that in a 50-million row table we get back 100 rows with the specified hash key. Since there are no other indexes available on these two columns, the query optimizer uses the hash bucket index. This quickly isolates the 100 rows of interest. Then we visit these rows, examining their Name and GroupName columns for exact matches. We increase performance significantly and, at the same time, save vast amounts of disk space.

The example assumes that the search targets exist in a single table. Given a search that selects targets from multiple tables, we can apply the same technique. All we have to do is create a table function that joins the tables, and then create an index that hashes the columns from the various tables.

Conclusion

On relatively small tables, creating an indexed hash bucket may not result in a dramatic performance increase, but it will save you disk space. This tip will be most beneficial if you're working with larger tables.

Arthur Fuller has been developing database applications for more than 20 years. He frequently works with Access ADPs, Microsoft SQL 2000, MySQL, and .NET.