查找是我们现实生活中经常需要做的事情。例如用字典查一个英文单词的释意,首先我们会定位到这个单词所在的页,然后再看与该单词关联的解释。这个过程中,单词是key,而单词+单词的释意则是一个record。信息的组织方式很大程度上影响了查找的方法,《编程珠玑》和《数据结构与算法分析》一样,介绍了针对两种信息组织方式的不同的查找方法。
首先是链表/数组的线性组织方式。在这种情况下,如果数据是排序的,那么二分查找显然是最高效的了。下面是我写的两种二分查找:
//假如有n个value,该查找将返回任意一个
int bSearch( int* array, int n, int value ){
int l = 0, r = n-1;
int m;
while( l<=r ){
m = (l+r)/2;
if( *(array+m)<value ) l=m+1;
else{
if( *(array+m) == value ) return m;
else r=m-1;
}
}
return -1;
}
//该查找将返回第一个值为value的key
int bSearch2( int* array, int n, int value ){
int l=0, r = n-1;
int m;
while( l<r ){
m = (l+r)/2;
if( *(array+m)<value ) l=m+1;
else r=m;
}
if( *(array+l) == value )
return l;
else
return -1;
}
如果是非排序的,那么只能用线性查找了。《数据结构与算法分析》里面还讨论了一种自组织链表,他的目的在于尽量把越经常被查找的记录放在链表的越前面,这样平均查找一个key的需要访问其他多余key的期望值就减少了。但是我们又几乎没有办法预知访问的模式的(access pattern)的,因此只好通过动态地改变链表中key的分布来达到该目的,这种链表就是自组织链表(self organized list)。如何动态地改变链表中key分布?有三种模式:
(1)Count。对每个key都记录它被访问过的次数,访问次数越多的,就被它排到越前面,猜测被访问得最多的记录很可能再次被访问。这种模式的缺点在于,首先必须使用额外的空间来记录key被访问过的次数,其次,访问次数的记录并不能很好的反映访问频率随时间的变化。例如一个记录在一小段时间内被访问了很多次从而到了链表的最前面,可是过后便不再被访问,它也将需要经过很长一段时间才从会慢慢远离队首。
(2)MoveToFront。每当一个key被访问,它将被放到链表的最前面去,猜测刚刚被访问过的值很可能会被再次访问。这种方式如果用链表来实现会比较容易,如果用数组,将可能导致很多次数据转移。
(3)Transpose。每次key被访问一次,它就跟它前面的key交换一下位置。如果他被访问过很多次,他将最终被移到队首来。这种模式在某些情况下会很糟糕,比方说记录A和B都在队尾,交替地访问A和B将使他们两个一直在队尾徘徊。
根据书中的思想,我实现三种自组织链表,跟书中一样,我也使用了模板:
#include<iostream>
using namespace std;
//self organized list for
//count
//move to front
//transpose
class KeyComp{
public:
static bool eq( const int i, const int j ){
return i==j;
}
static bool eq( const char a, const char b ){
return a==b;
}
static bool eq( const char* a, const char* b ){
while( (*a!='\0')&&(*b!='\0')&&(*a++==*b++) );
return *a == *b;
}
};
//总结:
//模板将被编译两次,第一次检查语法,第二次具现化。不同模板参数将产生一个不同的具现化
//当编译器遭遇CountList定义时,并不知道它继承什么样的class,因为它的父类SelfOrganizedList<T,KeyComp>还未被具现化
//因此,在第一次编译时,编译器就会报错,说明num,keys未声明。
//有三种解决方法,分别如CountList,MoveToFrontList和Transpose所示。
template<class T, class Comp>
class SelfOrganizedList{
protected:
T* keys;
int num;
SelfOrganizedList( int n, T* k ){
keys = k; num = n;
}
int searchHelp( T value ){
T hold = keys[num];
keys[num] = value;
int i=-1;
while( !Comp::eq( keys[++i], value ));
keys[num] = hold;
return i;
}
public:
virtual T* search( T value )=0;
void printKeys(){
for( int i=0; i<num; i++ )
cout<<keys[i]<<" ";
cout<<endl;
}
virtual ~SelfOrganizedList(){}
};
template<class T, class Comp>
class CountList: public SelfOrganizedList<T, Comp>{
private:
int* counts;
//方法一:使用using声明将被掩盖的base class名称带入一个derived class作用域
protected:
using SelfOrganizedList<T,Comp>::num;
using SelfOrganizedList<T,Comp>::keys;
public:
CountList( int n, T* k ):SelfOrganizedList<T,Comp>(n, k){
counts = new int[num+1];
counts++;
for( int i=0; i<num; i++ ) counts[i]=0;
};
T* search( T value ){
int i=searchHelp( value );
if( i == this->num )
return 0;
counts[i]++;
counts[-1] = counts[i];
T hold = keys[i];
while( counts[i-1]<counts[-1] ){
counts[i]=counts[i-1];
keys[i]=keys[i-1];
i--;
}
counts[i]=counts[-1];
keys[i]=hold;
return &keys[i];
}
~CountList(){
delete []counts;
}
};
template<class T, class Comp>
class MoveToFrontList:public SelfOrganizedList<T, Comp>{
//方法二:每次使用父类的成员时,使用this指针。
public:
MoveToFrontList( int n, T* k ):SelfOrganizedList<T, Comp>( n, k ){
}
T* search( T value ){
int i=searchHelp( value );
if( i == this->num )
return 0;
T hold = this->keys[i];
while( i>0 )
this->keys[i]=this->keys[--i];
this->keys[0] = hold;
return &this->keys[0];
}
};
template<class T, class Comp> class Transpose : public SelfOrganizedList<T, Comp>{
//方法三:明确指出该成员位于base class内
public:
Transpose( int n, T* k ) : SelfOrganizedList<T, Comp>( n, k ){
}
T* search( T value ){
int i = searchHelp( value );
if( i == SelfOrganizedList<T, Comp>::num )
return 0;
if( i>0 ){
T hold = SelfOrganizedList<T, Comp>::keys[i-1];
SelfOrganizedList<T, Comp>::keys[i-1] = SelfOrganizedList<T, Comp>::keys[i];
SelfOrganizedList<T, Comp>::keys[i] = hold;
}
return &SelfOrganizedList<T, Comp>::keys[i-1];
}
};
int main(){
int NUM = 8, SRC = 12;
char keys[] = { 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H' };
char pattern[] = { 'F', 'D', 'F', 'G', 'E', 'G', 'F', 'A', 'D', 'F', 'G', 'E' };
SelfOrganizedList<char, KeyComp>* test;
// test = new CountList<char, KeyComp>( NUM, keys );
// test = new MoveToFrontList<char, KeyComp>( NUM, keys );
test = new Transpose<char, KeyComp>( NUM, keys );
char* res;
for( int i=0; i<SRC; i++ )
if( (res=test->search(pattern[i])) != 0 )
cout<<*res<<" ";
else
cout<<"not searched"<<endl;
cout<<endl;
test->printKeys();
return 0;
}
除了排序和非排序的线性组织方法,其实还有一种,就是通过key与其下标的映射函数来插入和查找key,这就是哈希函数(hashing function)。一般情况下,我们很难找到key和下标(位置)之间的一一映射,如果两个Key计算出来的下标相同,我们就说发生了冲突。好的哈希函数应该尽量让发生冲突的机率小一些。
但冲突总是存在的。冲突解决方法基本可以分成两种,一种称为open hashing,open hashing通过将计算出相同下标的Key放在同一个链表上来解决冲突,查找key的时候,首先计算该key的下标,然后通过哈希表中该下标指向的链表,对key进行(线性)查找。就像我们前面的桶排序一下,先把key分在不同的桶里,然后再对小数据进行查找或排序。
另一种称为close hashing,close hasing将所有的记录都放在哈希表上。把key通过哈希函数h(k)计算出来的下标称为home。如果home已经被其他记录占据,close hashing通过为它找到下一个下标而将其仍旧存放在哈希表中。有两种方法获取下一个下标。第一种是bucket hashing.把哈希表分成几个bucket,每个bucket有多个slot,哈希函数计算出来的下标将是指bucket的序号。除了有标号的bucket之外,还有一个overflow bucket,当有标号的bucket里的slot被用完了,元素将被放到overflow bucket里面去。因此,查找一个key时,如果在它对应的bucket里找不到它,而且该bucket是满的,那么必须去overflow buckte里面找。而我们知道overflow bucket里面的Key分布是没有规律的。
第二种是probing(探询)。本质上就是为一个key值设计一个探询函数 p(K, i),如果它的home已经被占据,将根据这个探询函数来为它计算另一个位置,如果这个位置又被占了,再计算第二个,...,第i个... pos=(home+p(K,i))%M,M是下标的范围。最简单的探询函数是线性探询p(K,i)=i,也就是,如果Key的home被占了,那就找沿着它往下挨个找,直到找到一个空的位置。好的探询函数应该使所有的空槽被选中的机会相等,但是是像p(K,i)=i这样的线性探询,将会使探询序列聚合在一起,例如对于home=1的情况,其接下来的可能位置将是2, 3, 4,...,对于Home=2,其接下来的可能位置将是3, 4, 5, ..,跟home=1的序列将很快撞到一起,从而影响性能。这种现象在hashing过程中称为primary clustering。应该尽量使不同Home的候选序列尽早地分散开来。例如使用一个随机数列Perm[]={2,5,32,...},p(K,i)=Perm[i],这样对于home= 1,其候选序列为{3,6,33....}而home=2的候选队列为{4, 7, 34,...}。或者使用二次探询函数:p(K,i)=i^2,则home=1的候选队列为{2,5,10,...},home=2的候选队列为{3,6,11,...}。
伪随机探询和二次探询解决了primary clustering, 但是对于home位置相同的Key,它们的候选队列将是一样的,这形成了second clustering问题。解决的办法是使用复式散列(double hashing):p(K,i)=i*h_2(K).
关于哈希表的大小,《STL源码分析》里侯捷这样说到:“如果我们假设表格(哈希表)大小为质数,而且永远保持负载系数在0.5以下(只用了一半以下的糟)(也就是说超过0.5就重新配置并重新整理表格),那么就可以确定每插入一个新元素所需要的探测次数不多于2.
对于哈希表来说,主要有三个操作:插入(insert),查找(search),以及删除(remove)一个元素。下面实现了前面提到的OpenHashing, CloseHashing中的BucketHashing和ProbeHashing:
首先是虚基类Hashing:
template<class Elem, class Key, class KEComp, class EEComp> class Hashing{
protected:
int M, current;
int (*h)(const Key& k); //hash function
Key (*getKey)( const Elem& e ); //getting key of an element
Hashing( int m, int (*hf)(const Key&), Key (*getkey)(const Elem&) ){
M = m; current = 0;
h = hf;
getKey = getkey;
}
public:
virtual bool insert( const Elem& ) = 0;
virtual bool search( const Key&, Elem& ) const = 0;
virtual bool remove( const Key&, Elem& ) = 0;
virtual void clear() = 0;
virtual void print() = 0;
virtual ~Hashing(){};
int size(){
return current;
}
};
接着是OpenHashing:
template<class Elem, class Key, class KEComp, class EEComp>
class OpenHashing: public Hashing<Elem, Key, KEComp, EEComp>{
private:
struct node{
Elem e;
node* next;
node(){ next = 0; }
node( Elem ee, node* n ){ e = ee; next = n; }
};
node* HT;
//searchHelp返回指向元素的前一个node的指针,这样在删除的时候就比较方便
node* searchHelp( const Key& k, Elem& e ) const{
int pos = h(k);
node* loc = &HT[pos];
while( loc->next != 0 && !KEComp::eq( k, loc->next->e ) )
loc = loc->next;
if( loc->next == 0 )
return 0;
e = loc->next->e;
return loc;
}
//释放从HT链出去的node
void clearHelp(){
node *p1, *p2;
for( int i=0; i<M; i++ ){
p1 = &HT[i];
while( p1!=0 && p1->next != 0 ){
p2 = p1->next;
p1 = p2->next;
delete p2;
}
if( p1 != 0 )
delete p1;
}
}
protected:
using Hashing<Elem, Key, KEComp, EEComp>::current;
using Hashing<Elem, Key, KEComp, EEComp>::M;
using Hashing<Elem, Key, KEComp, EEComp>::h;
using Hashing<Elem, Key, KEComp, EEComp>::getKey;
public:
OpenHashing( int s, int (*hf)(const Key& k), Key (*getkey)( const Elem& ) ):
Hashing<Elem, Key, KEComp, EEComp>(s, hf, getkey){
HT = new node[M];
}
bool insert( const Elem& e ){
Key k = getKey( e );
int pos = h(k);
node* loc = &HT[pos];
while( loc->next != 0 && !EEComp::eq( e, loc->next->e ) )
loc = loc->next;
if( loc->next == 0 ){
node* t = new node( e, 0 );
loc->next = t;
current ++;
return true;
}else{
//duplicate records
return false;
}
}
bool search( const Key& k, Elem& e ) const {
Elem temp;
node* res = searchHelp( k, temp );
if( res == 0 )
return false;
e = temp;
return true;
}
bool remove( const Key& k, Elem& e ){
node* res = searchHelp( k, e );
if( res == 0 )
return false;
current --;
node* p = res->next;
res->next = p->next;
delete p;
return true;
}
void clear(){
clearHelp();
current = 0;
for( int i=0; i<M; i++ )
HT[i].next = 0;
}
void print(){
node* cur;
for( int i=0; i<M; i++ ){
cur = &HT[i];
while( cur->next != 0 ){
cout<<cur->next->e<<" ";
cur = cur->next;
}
cout<<endl;
}
}
~OpenHashing(){
clearHelp();
delete HT;
}
};
接着是虚基类CloseHashing:
template<class Elem, class Key, class KEComp, class EEComp>
class CloseHashing :public Hashing<Elem, Key, KEComp, EEComp>{
protected:
Elem* HT; //hashing table
Elem EMPTY, TRASH;
using Hashing<Elem, Key, KEComp, EEComp>::current;
using Hashing<Elem, Key, KEComp, EEComp>::M;
using Hashing<Elem, Key, KEComp, EEComp>::h;
using Hashing<Elem, Key, KEComp, EEComp>::getKey;
CloseHashing( int s, int (*h)(const Key&), Key (*getkey)(const Elem&), Elem e, Elem t ):
Hashing<Elem, Key, KEComp, EEComp>( s, h, getkey ){
EMPTY = e;
TRASH = t;
HT = new Elem[M];
for( int i=0; i<M; i++ )
HT[i] = EMPTY;
}
~CloseHashing(){
delete HT;
}
void clear(){
for( int i=0; i<M; i++ )
HT[i] = EMPTY;
current = 0;
}
void print(){
for( int i=0; i<M; i++ )
cout<<HT[i]<<" ";
cout<<endl;
}
};
然后是CloseHashing的实现:BucketHashing和ProbeHashing:
template<class Elem, class Key, class KEComp, class EEComp>
class BucketHashing : public CloseHashing<Elem, Key, KEComp, EEComp>{
private:
int num; //number of buckets
int rec; //max number of records in a bucket
int overflow; //begin place of overflow bucket
//search in the range [start, end), otherwise overflow bucket is checked
int searchHelp( const Key& k, Elem& e, int start, int end ) const{
int i=start;
while( i<end && (!EEComp::eq(HT[i], EMPTY)) && (!KEComp::eq(k, HT[i])) )
i++;
if( i<end ){
if( KEComp::eq( k, HT[i] ) ){
e = HT[i];
return i;
}
return -1;
}
if( start<overflow )
return searchHelp( k, e, overflow, M );
return -1;
}
//insert to the range [start, end), otherwise overflow bucket is checked
bool insertHelp( const Elem& e, int start, int end ){
int i=start;
int empty_slot = -1;
while( i<end && (!EEComp::eq(HT[i], EMPTY))){
if( EEComp::eq( HT[i], e ) ) return false;
if( EEComp::eq(HT[i], TRASH) && empty_slot<0 ) empty_slot = i;
i++;
}
if( i<end ){
if( empty_slot>=0 )
i = empty_slot;
HT[i] = e;
current ++;
return true;
}
i = overflow;
while( i<M && (!EEComp::eq(HT[i], EMPTY)) ){
if( EEComp::eq(HT[i],e) ) return false;
if( EEComp::eq(HT[i], TRASH) && empty_slot<0 ) empty_slot = i;
i++;
}
if( i<M ){
if( empty_slot>=0 )
i = empty_slot;
HT[i] = e;
current ++;
return true;
}
return false;
}
public:
//n: number of buckets in the hash table
//r: max number of record in a bucket
//o: max number of record in the overflow bucket
BucketHashing( int n, int r, int o, int (*h)(const Key&), Key (*getkey)(const Elem&), Elem e, Elem d ):
CloseHashing<Elem, Key, KEComp, EEComp>( n*r+o, h, getkey, e, d ){
num = n;
rec = r;
overflow = n*r; //the begin place of overflow bucket;
}
bool insert( const Elem& e ){
if( current == M ) //table is full
return false;
Key k = getKey( e );
int home = h(k);
return insertHelp( e, home*rec, home*rec+rec );
}
bool search( const Key& k, Elem& e ) const{
int home = h(k);
return (searchHelp( k, e, home*rec, home*rec+rec )>=0);
}
bool remove( const Key&k, Elem& e ){
int home = h(k);
int res;
if( (res=searchHelp( k, e, home*rec, home*rec+rec )) > 0 ){
HT[res] = TRASH;
current --;
return true;
}
return false;
}
};
ProbeHashing:
//closing hashing, probe.
template<class Elem, class Key, class KEComp, class EEComp>
class ProbeHashing :public CloseHashing<Elem, Key, KEComp, EEComp>{
private:
int (*p)( const Key& k, int i ); //the probing function
int searchHelp( const Key& k, Elem& e ) const;
protected:
using CloseHashing<Elem, Key, KEComp, EEComp>::HT;
using CloseHashing<Elem, Key, KEComp, EEComp>::EMPTY;
using CloseHashing<Elem, Key, KEComp, EEComp>::TRASH;
public:
ProbeHashing( int sz, Elem e, Elem d, int (*hf)(const Key&), int (*pf)(const Key&, int), Key (*getkey)(const Elem&) )
:CloseHashing<Elem,Key, KEComp, EEComp>(sz, hf, getkey, e, d ){
p = pf;
}
bool insert( const Elem& e );
bool search( const Key& e, Elem& ) const;
bool remove( const Key& e, Elem& e );
};
template<class Elem, class Key, class KEComp, class EEComp>
bool ProbeHashing<Elem, Key, KEComp, EEComp>::insert( const Elem& e ){
Key k = getKey( e );
int home = h( k );
int pos = home;
bool hasTrashPlace = false;
for( int i=1; !EEComp::eq(EMPTY, HT[pos]); i++ ){
pos = (home + p(k,i))%M;
if( EEComp::eq(e, HT[pos]) ) return false;
if( EEComp::eq( TRASH, HT[pos] ) ) hasTrashPlace = true;
}
if( !hasTrashPlace ){
HT[pos] = e;
current ++;
return true;
}
pos = home;
for( int i=1; !(EEComp::eq(TRASH, HT[pos])); i++ )
pos = (home + p(k,i))%M;
HT[pos] = e;
current ++;
return true;
}
template<class Elem, class Key, class KEComp, class EEComp>
int ProbeHashing<Elem, Key, KEComp, EEComp>::searchHelp( const Key& k, Elem& e ) const{
int home = h(k);
int pos = home;
int i=1;
while( (!KEComp::eq(k, HT[pos]))&&(!EEComp::eq(EMPTY, HT[pos])) )
pos = (home+p(k, i++));
if( KEComp::eq(k, HT[pos]) ){
e = HT[pos];
return pos;
}
else return -1;
}
template<class Elem, class Key, class KEComp, class EEComp>
bool ProbeHashing<Elem, Key, KEComp, EEComp>::search( const Key& k, Elem& e ) const{
Elem temp;
if( searchHelp( k, temp ) >= 0 ){
e = temp;
return true;
}
return false;
}
template<class Elem, class Key, class KEComp, class EEComp>
bool ProbeHashing<Elem, Key, KEComp, EEComp>::remove( const Key& k, Elem& e ){
int pos = searchHelp( k, e );
if( pos>=0 ){
HT[pos]=TRASH;
current--;
return true;
}
return false;
}
最后是测试代码:
class Compare{
public:
static bool eq( const char a, const char b ){ return a==b; }
static bool eq( const char* a, const char* b ){
while( *a!='\0' && *b!='\0' && *a++==*b++ );
return *a == *b;
}
static bool eq( const int a, const int b ){ return a==b; }
};
//EXAMPLE:
//以int为例子,假设key的值在[0,100],那么range以外的值都可以拿来当EMPTY和TRASH
int EMPTY = -1;
int TRASH = -2;
int M_SIZE = 11;
int BUCKET_SIZE = 3;
int OVERFLOW_BUCKET = 10;
//int的值本身就是key
int getkey( const int& a ){
return a;
}
//简单的哈希函数
int inth( const int& a ){
return a%M_SIZE;
}
//线性探询函数
int intp( const int& a, int i ){
return i;
}
int main(){
Hashing<int, int, Compare, Compare>* test;
//test = new BucketHashing<int, int, Compare, Compare>( M_SIZE, BUCKET_SIZE, OVERFLOW_BUCKET, inth, getkey, EMPTY, TRASH );
test = new ProbeHashing<int, int, Compare, Compare>( M_SIZE, EMPTY, TRASH, inth, intp, getkey );
//test = new OpenHashing<int, int, Compare, Compare>( M_SIZE, inth, getkey );
test->insert( 23 );
test->insert( 12 );
test->insert( 1 );
test->insert( 34 );
test->print();
cout<<"---------------------------------------"<<endl;
int a=-1;
test->search(34, a);
cout<<a<<endl;
test->search(30, a);
cout<<a<<endl;
test->search(1, a);
cout<<a<<endl;
cout<<"--------------------------------------:"<<endl;
test->remove( 23, a );
test->remove( 1, a );
test->search( 34, a );
cout<<a<<endl;
test->print();
cout<<"---------------------------------------"<<endl;
test->insert( 10 );
test->insert( 1 );
test->insert( 34 );
test->print();
cout<<"---------------------------------------"<<endl;
cout<<test->size()<<endl;
test->clear();
test->print();
cout<<"---------------------------------------"<<endl;
}
除了链表/数组的线性组织方式,另一种方式即树的组织方式,我在《基于树的索引结构介绍》(http://philoscience.iteye.com/admin/blogs/1112759 )中总结过,不过至今未实现其代码,呵呵,接下来就来试试。