Boost performance and use less disk space with hash keys

转载 2006年05月22日 13:20:00

by  Arthur Fuller  |  More from Arthur Fuller  |  Published: 3/6/06

Arthur Fuller discusses a very slick alternative to using string indexes known as hash buckets or hash keys. Find out how using hash keys can produce a gain in performance and save you disk space.

A common SQL Server application requirement is to support searches based on one or more long strings. (In this article, we refer to anything with more than 20 characters as a long string.) Suppose that your front-end application wants to allow the user to supply two long strings; you fire a stored procedure that performs a search on these two strings, and then you optimize the search by supplying an index on the two relevant columns. In a small table, you might not notice the impact; however, if the table contains 50 million rows, it will impact storage requirements and search performance.

A great way to avoid the hit in disk space or performance is to use a very slick alternative to using string indexes known as hash keys (which refers to one individual hash) or hash buckets (which are a collection of hash keys).

What is a hash?

A hash is the integer result of an algorithm applied to a given string. There are numerous hashing algorithms, but the one that is used most often is the built-in SQL function Checksum(). In general, you feed said function a string, and it returns an integer (although there's no guarantee the integer will be unique, particularly in large tables).

Hash tables in database design

Suppose that we have these columns in the table of interest:


A multi-column index on both of these columns would consume 50 + 50 characters per row, plus the overhead—given 50 million rows, this amounts to a sizable problem.

A hash key based on these two columns is vastly smaller, i.e., four bytes per row. It's even better if we don't store the hash keys except in the index on this column. Instead, we should create a calculated column whose formula is the hash key of these two columns, and then index that column and disregard the index on the pair of strings.

The user (whether it's a human or an application) queries the values of interest; then we convert the parameters into a hash key and search the hash index. The set of duplicates will be dramatically smaller than the set of rows the engine must visit to do an exact match on the query values. We then combine the search for the hash key with a comparison of the two columns of interest, isolating a small subset of rows, and then examining the two columns for matches. A search based on an integer column is dramatically faster than a search based on a lengthy string key, and more so if it is a compound key.

Hash key algorithms using the Checksum function

Try running this sample code, which demonstrates how you obtain the hash key for any given value or combination of values:

USE AdventureWorks
SELECT Name, GroupName, Checksum(Name,GroupName)AS HashKey
FROM Adventureworks.HumanResources.Department

This results in the rows in the following table (which are clipped to 10 for brevity).

Name GroupName HashKey
Tool Design Research and Development
Production Manufacturing
Shipping and Receiving Inventory Management
Purchasing Inventory Management
Document Control Quality Assurance
Information Services Executive General and Administration
Quality Assurance Quality Assurance
Sales Sales and Marketing
Production Control Manufacturing
Marketing Sales and Marketing

In a live environment, you would create a calculated column, which we'll call Name_GroupName_hb. Assuming the front-end passes in the target values for Name and GroupName, here is how you might approach this problem:

  @Name Varchar(50),
  @GroupName Varchar(50)
-- USE AdventureWorks
DECLARE @id as int SET @id = Checksum(@Name,@GroupName)
SELECT * FROM Adventureworks.HumanResources.Department
WHERE HashKey = @id
    AND Name = @Name
    AND GroupName = @GroupName

Let's imagine that in a 50-million row table we get back 100 rows with the specified hash key. Since there are no other indexes available on these two columns, the query optimizer uses the hash bucket index. This quickly isolates the 100 rows of interest. Then we visit these rows, examining their Name and GroupName columns for exact matches. We increase performance significantly and, at the same time, save vast amounts of disk space.

The example assumes that the search targets exist in a single table. Given a search that selects targets from multiple tables, we can apply the same technique. All we have to do is create a table function that joins the tables, and then create an index that hashes the columns from the various tables.


On relatively small tables, creating an indexed hash bucket may not result in a dramatic performance increase, but it will save you disk space. This tip will be most beneficial if you're working with larger tables.

Arthur Fuller has been developing database applications for more than 20 years. He frequently works with Access ADPs, Microsoft SQL 2000, MySQL, and .NET.


首先创建模板 在模板里创建探索规则 Mounted filesystem discovery vfs.fs.discovery filters 为{#FSTYPE}  matches @File ...
  • linuxlsq
  • linuxlsq
  • 2016年06月26日 10:21
  • 5246

ceph 报警 [ low disk space] 解决

参考报警信息 [root@hh-yun-puppet-129021 ~]# ceph health detail HEALTH_WARN mon.hh-yun-ceph-cinder026-12807...
  • signmem
  • signmem
  • 2016年03月03日 09:32
  • 3422

boost.unordered 基于hash table的四种容器(即hash_map,unordered_mutimap,hash_set,hash_multiset)

boost.unordered  即基于hash table的四种容器(即hash_map,unordered_mutimap,hash_set,hash_multiset) 接口类似于std:ma...
  • u011676589
  • u011676589
  • 2013年08月29日 17:32
  • 2708

Improve performance using hash keys

use CheckSum function to turn a string key to a hash key(integer), this will improve performance
  • EricYeung
  • EricYeung
  • 2008年03月11日 09:40
  • 364

linux下如何查看disk space硬盘应用情况

主要是运用 df 指令,df是disk filesystem的缩写 其后面还可以加很多有用的tag df -h...
  • u012204343
  • u012204343
  • 2014年07月15日 05:00
  • 1358

Ubuntu --- not enough free disk space

Ubuntu系统更新时出现not enough free disk space.   原因是系统的就内核占满了/boot 的空间,只要将旧内核删除就ok了   首先,命令 uname -r  查看当前...
  • u011221820
  • u011221820
  • 2017年02月10日 12:08
  • 806

centos low disk space问题

今天在虚拟机的centos7 上安装harbor时,拷贝文件时出现: 查看后发现是磁盘空间不足的问题,使用df命令查看,发现/dev/mapper/cl-root 的使用率是93%,  ...
  • lucy06
  • lucy06
  • 2017年12月21日 13:19
  • 48

MyEclipse has detected that less than 5% of the 29MB of PS Survivor Space (Heap memory) space remain

问题: 根据提示,打开myeclipse安装路径下的myeclipse.ini,原文件内容如图: 按照提示修改后的myeclipse.ini如下图:...
  • yangyao_iphone
  • yangyao_iphone
  • 2015年03月16日 11:43
  • 1462

Ubuntu 解决更新软件包的时候出现的 “Low Disk Space” 存储空间不足问题

  • github_35160620
  • github_35160620
  • 2017年01月19日 17:25
  • 3294


  • sharpbladepan
  • sharpbladepan
  • 2006年02月12日 22:42
  • 7110
您举报文章:Boost performance and use less disk space with hash keys