数据结构全流程学习笔记（三）：散列

最新推荐文章于 2024-10-09 17:48:02 发布

Wiayetr

最新推荐文章于 2024-10-09 17:48:02 发布

阅读量1k

点赞数 10

文章标签：数据结构学习笔记

本文链接：https://blog.csdn.net/Wiayetr/article/details/136267733

版权

本文详细介绍了散列表数据结构的核心概念，包括散列函数的选择、散列冲突的处理方法（分离链接法、开放定址法中的线性探测和平方探测，以及双散列），以及再散列策略。同时提及了可扩散列在大数据场景的应用。

摘要由CSDN通过智能技术生成

数据结构全流程学习笔记（三）：散列

基于《数据结构与算法分析：C语言描述》的笔记，可作为学习参考。
手边有这本书看起来更方便，没有也没关系。

5.1 一般想法

理想的散列表数据结构是一个含有关键字的具有固定大小的数组。
通过散列函数（hash function）将每个关键字映射到不同的单元。
需要解决的问题：选择一个函数，决定当两个关键字散列到同一个值的时候应该做什么和如何确定散列表的大小。

5.2 散列函数

//由散列函数返回的类型
typedef unsigned int Index;

//通常，关键字是字符串，我们将字符逐个相加来处理整个字符串
//这是一个简单的散列函数
Index Hash( const char *Key, int TableSize )
{
    unsigned int HashVal = 0;

    while( *Key != '\0' )
        HashVal += *Key++;
    
    return HashVal % TableSize;
}

但是，当散列表很大的时候，这个函数只能把值映射到一个很小的范围。
接下来看一个很好的散列函数。

//一个好的散列函数
Index Hash(const char* Key, int TableSize)
{
    unsigned int HashVal = 0;

    while(*Key != '\0')
        HashVal = (HashVal << 5) + *Key++;
        //位左移运算符，相当于乘32（2的5次方）

    return HashVal % TableSize;
}

不过，如果关键字特别长，计算会花费较多时间。一般会取关键字中的一部分进行计算。

当一个元素被插入处已经存在另一个元素，我们需要消除这个散列值相同的冲突，接下来讨论相关方法。

5.3 分离链接法

将散列到同一个值的所有元素保留在一个表中。
以下是分离链接法的代码实现。

ListNode声明与链表相同。散列表结构包括一个链表数组（以及数组中的链表的个数），它们在散列表结构初始化的时候动态分配空间。此处的HashTable类型就是指向该结构的指针类型。
TheList域实际上是一个指向 指向ListNode结构的指针 的指针。注意typedef的使用，它使代码更加清晰。

#include <stdio.h>
#include <stdlib.h>

#ifndef _HashSep_H

struct ListNode;
typedef struct ListNode* Position;
struct HashTbl;
typedef struct HashTbl* HashTable;
typedef int ElementType;
typedef Position List;
typedef int Index;

HashTable InitializeTable(int TableSize);
void DestroyTable(HashTable H);
Position Find(ElementType Key, HashTable H);
ElementType Retrieve(Position P);

#endif


struct ListNode
{
    ElementType Element;
    Position Next;
};


//*TheList是一个链表的数组
struct HashTbl
{
    int TableSize;
    List *TheList;
};

int IsPrime(int a)
{
    if(a == 1)
        return 0;
    if(a == 2)
        return 1;
    for (int i = 0; i * i < a; i++)
    {
        if(a % i == 0)
            return 0;
    }

    return 1;
    
}

int NextPrime(int a)
{
    int i;
    for (i = a; !IsPrime(i) ; i++)
    return i;
}

HashTable InitializeTable(int TableSize)
{
    HashTable H;
    int i;

    if(TableSize < 5 )
    {
        printf("Table is too small");
        return NULL;
    }

    H = malloc(sizeof(struct HashTbl));
    if(H == NULL)
        return NULL;
    
    H->TableSize = NextPrime(TableSize);

    H->TheList = malloc(sizeof(List) * H->TableSize);
    if(H->TheList == NULL)
        return NULL;
    
    for (int i = 0; i < H->TableSize; i++)
    {
        H->TheList[i] = malloc(sizeof(struct ListNode));
        if(H->TheList[i] == NULL)
            return NULL;
        else
            H->TheList[i]->Next = NULL;
    }

    return H;
    
}


//一个好的散列函数
Index Hash(const char* Key, int TableSize)
{
    unsigned int HashVal = 0;

    while(*Key != '\0')
        HashVal = (HashVal << 5) + *Key++;
        //位左移运算符，相当于乘32（2的5次方）

    return HashVal % TableSize;
}

Position Find(ElementType Key, HashTable H)
{
    Position P;
    List L;

    L = H->TheList[Hash(Key, H->TableSize)];
    P = L->Next;
    while(P != NULL && P->Element != Key)
        P = P->Next;
    
    return P;
}

void Insert(ElementType Key, HashTable H)
{
    Position Pos, NewCell;
    List L;

    Pos = Find(Key, H);
    if(Pos == NULL)
    {
        NewCell = malloc(sizeof(struct ListNode));
        if(NewCell == NULL)
            return NULL;
        else
        {
            L = H->TheList[ Hash(Key, H->TableSize) ];
            NewCell->Next = L->Next;
            NewCell->Element = Key;
            L->Next = NewCell;
        }
    }
}

这个算法的缺点是需要指针，给新单元分配地址需要时间。
我们定义装填因子 $\lambda$ 为散列表元素个数与散列表大小的比值。
$\lambda = 1$
此时有最理想情况。

5.4 开放定址法

如果有冲突发生，就尝试选择另外的单元，直到找出空的单元为止。
单元h₀(X), h₁(X), h₂(X),…相继试选。
$h_i(X) = (Hash(X)+F(i))\,mod\,TableSize$
其中F是冲突解决办法，F(0) = 0

5.4.1 线性探测法

$F$ 是 $i$ 的线性函数，逐个探测每个单元以查找出一个空单元。
花费的时间多，而且可能造成“聚集”的问题，即一些元素互相靠近形成一个区块。
可以对这种探测方法进行数学分析， $\lambda = 0.5$ 时，这种方法的性能还是不错的。

5.4.2 平方探测法

$F(i) = i^2$ 是最流行的选择。
定理：如果使用平方探测，且表的大小是素数，那么我们保证总能插入一个新的元素。
此处忽略数学证明，须知：哪怕表有比一半多一个的位置被填满，插入都有可能失败。同样的，表的大小必须是素数。

以下是这种方法的代码实现。

#include <stdio.h>
#include <stdlib.h>

#ifndef _HashQuad_H

typedef unsigned int Index;
typedef Index Position;
typedef int ElementType;

struct HashTbl;
typedef struct HashTbl *HashTable;

HashTable InitializeTable(int TableSize);
void DestroyTable(HashTable H);
Position Find(ElementType Key, HashTable H);
void Insert(ElementType Key, HashTable H);
ElementType Retrieve(Position P, HashTable H);
HashTable Rehash(HashTable H);

#endif

enum KindOfEntry{Legitimate, Empty, Deleted};

struct HashEntry
{
    ElementType Element;
    enum KindOfEntry Info;
};

typedef struct HashEntry Cell;

struct HashTbl
{
    int TableSize;
    Cell *TheCells;
};

int IsPrime(int a)
{
    if(a == 1)
        return 0;
    if(a == 2)
        return 1;
    for (int i = 0; i * i < a; i++)
    {
        if(a % i == 0)
            return 0;
    }

    return 1;
    
}

int NextPrime(int a)
{
    int i;
    for (i = a; !IsPrime(i) ; i++)
    return i;
}

HashTable InitializeTable(int TableSize)
{
    HashTable H;
    int i;

    H = malloc( sizeof( struct HashTbl ) );

    H->TableSize = NextPrime(TableSize);

    H->TheCells = malloc(sizeof(Cell) * H->TableSize);

    for ( i = 0; i < H->TableSize; i++)
        H->TheCells[i].Info = Empty;
    
    return H;
}

Position Find(ElementType Key, HashTable H)
{
    Position CurrentPos;
    int CollisionNum;

    CollisionNum = 0;
    CurrentPos = Hash(Key, H->TableSize);
    while( H->TheCells[CurrentPos].Info != Empty && H->TheCells[CurrentPos].Element != Key )
    {
        CurrentPos += 2 * ++CollisionNum - 1;
        if(CurrentPos >= H->TableSize)
            CurrentPos -= H->TableSize;
    }

    return CurrentPos;
}

5.4.3 双散列

一种流行的选择是 $F_i(X) = i * hash_2(X)$ 。即为应用第二个散列函数，但这非常考验对散列函数的选择。
在实践中通常使用平方探测，双散列相较而言更加复杂、速度更慢。

5.5 再散列

对于使用平方探测的开放定址散列法，如果表的元素填得太满，那么操作的运行时间将开始消耗过长，而且Insert操作可能失败。
此时一种解决方法是建立另外一个大约两倍大的表（而且使用一个相关的新散列函数），扫描原始散列表，计算每个元素的新散列值并将其插入到新表中。
以下是再散列的实现：

HashTable Rehash(HashTable H)
{
    int i, OldSize;
    Cell *OldCells;

    OldCells = H->TheCells;
    OldSize = H->TableSize;

    H = InitializeTable(2 * OldSize);

    for(i = 0; i < OldSize; i++)
        if(OldCells[i].Info == Legitimate)
            Insert(OldCells[i].Element, H);
        
    free(OldCells);

    return H;
}

5.6 可扩散列

最后讨论一下数据量过大的情况，在此情况下，主存无法装进所有需要处理的数据，因此需要对磁盘进行操作。
一种选择叫可扩散列（extendible hashing），它的思想有些类似于B树。

Wiayetr

关注

10
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫