Boost performance and use less disk space with hash keys

转载 2006年05月22日 13:20:00

by  Arthur Fuller  |  More from Arthur Fuller  |  Published: 3/6/06

Arthur Fuller discusses a very slick alternative to using string indexes known as hash buckets or hash keys. Find out how using hash keys can produce a gain in performance and save you disk space.

A common SQL Server application requirement is to support searches based on one or more long strings. (In this article, we refer to anything with more than 20 characters as a long string.) Suppose that your front-end application wants to allow the user to supply two long strings; you fire a stored procedure that performs a search on these two strings, and then you optimize the search by supplying an index on the two relevant columns. In a small table, you might not notice the impact; however, if the table contains 50 million rows, it will impact storage requirements and search performance.

A great way to avoid the hit in disk space or performance is to use a very slick alternative to using string indexes known as hash keys (which refers to one individual hash) or hash buckets (which are a collection of hash keys).

What is a hash?

A hash is the integer result of an algorithm applied to a given string. There are numerous hashing algorithms, but the one that is used most often is the built-in SQL function Checksum(). In general, you feed said function a string, and it returns an integer (although there's no guarantee the integer will be unique, particularly in large tables).

Hash tables in database design

Suppose that we have these columns in the table of interest:


A multi-column index on both of these columns would consume 50 + 50 characters per row, plus the overhead—given 50 million rows, this amounts to a sizable problem.

A hash key based on these two columns is vastly smaller, i.e., four bytes per row. It's even better if we don't store the hash keys except in the index on this column. Instead, we should create a calculated column whose formula is the hash key of these two columns, and then index that column and disregard the index on the pair of strings.

The user (whether it's a human or an application) queries the values of interest; then we convert the parameters into a hash key and search the hash index. The set of duplicates will be dramatically smaller than the set of rows the engine must visit to do an exact match on the query values. We then combine the search for the hash key with a comparison of the two columns of interest, isolating a small subset of rows, and then examining the two columns for matches. A search based on an integer column is dramatically faster than a search based on a lengthy string key, and more so if it is a compound key.

Hash key algorithms using the Checksum function

Try running this sample code, which demonstrates how you obtain the hash key for any given value or combination of values:

USE AdventureWorks
SELECT Name, GroupName, Checksum(Name,GroupName)AS HashKey
FROM Adventureworks.HumanResources.Department

This results in the rows in the following table (which are clipped to 10 for brevity).

Name GroupName HashKey
Tool Design Research and Development
Production Manufacturing
Shipping and Receiving Inventory Management
Purchasing Inventory Management
Document Control Quality Assurance
Information Services Executive General and Administration
Quality Assurance Quality Assurance
Sales Sales and Marketing
Production Control Manufacturing
Marketing Sales and Marketing

In a live environment, you would create a calculated column, which we'll call Name_GroupName_hb. Assuming the front-end passes in the target values for Name and GroupName, here is how you might approach this problem:

  @Name Varchar(50),
  @GroupName Varchar(50)
-- USE AdventureWorks
DECLARE @id as int SET @id = Checksum(@Name,@GroupName)
SELECT * FROM Adventureworks.HumanResources.Department
WHERE HashKey = @id
    AND Name = @Name
    AND GroupName = @GroupName

Let's imagine that in a 50-million row table we get back 100 rows with the specified hash key. Since there are no other indexes available on these two columns, the query optimizer uses the hash bucket index. This quickly isolates the 100 rows of interest. Then we visit these rows, examining their Name and GroupName columns for exact matches. We increase performance significantly and, at the same time, save vast amounts of disk space.

The example assumes that the search targets exist in a single table. Given a search that selects targets from multiple tables, we can apply the same technique. All we have to do is create a table function that joins the tables, and then create an index that hashes the columns from the various tables.


On relatively small tables, creating an indexed hash bucket may not result in a dramatic performance increase, but it will save you disk space. This tip will be most beneficial if you're working with larger tables.

Arthur Fuller has been developing database applications for more than 20 years. He frequently works with Access ADPs, Microsoft SQL 2000, MySQL, and .NET.


Performance of Using Keys in SELECT with FOR ALL ENTRIES

You would get insight from Performance perspective when not using the Keys in SELECT with FOR ALL EN...

14.10 InnoDB Disk IO and File Space Management

MYSQL5.6官方文档 14.10 InnoDB Disk I/O and File Space Management 中文只是我个人阅读时的理解,歧义的地方请以英文内容为准。 word版连...

Disable Keyboard Shortcuts and Combination Keys with C# (3): Disable Ctrl + Alt + Del

In this article how to disable 'Ctrl + Alt + Del' and 'Ctrl + Shift + Esc' combination keys with C# ...

Hadoop Streaming Made Simple using Joins and Keys with Python

There are a lot of different ways to write MapReduce jobs!!! Sample code for this post https://gi...

Disable Keyboard Shortcuts and Combination Keys with C# (2): Disable Win + L

In this article how to disable 'Win + L' combination keys with C# programming is introduced.

How to use boost::property_tree to load and write JSON

Boost's Property Tree Property Tree is a sublibrary of boost that allow you handling tree of propert...

Linux Performance Monitoring with Vmstat and Iostat Commands


Apache/Tomcat with Failover and Load Balancing in 20 minutes or less…

In order to get this done, you’ll need Apache, Tomcat, and the MOD_JK connector library. I’ve includ...

play with bootstrap and less


Difference Between Performance Testing, Load Testing and Stress Testing – With Examples

1) Performance Testing: Performance testing is the testing, which is performed, to ascertain how th...