hash函数进阶二

最新推荐文章于 2022-09-14 10:40:24 发布

qlcms_hj

最新推荐文章于 2022-09-14 10:40:24 发布

阅读量391

点赞数

分类专栏：学习杂记

本文链接：https://blog.csdn.net/qlcms_hj/article/details/40709137

版权

学习杂记专栏收录该内容

3 篇文章 0 订阅

订阅专栏

本文深入探讨了Hash函数的理论与实践，包括生成hashcode的算法、散列算法以及冲突解决策略。在理论篇中，文章指出好的hashcode算法应确保不同对象的hashcode值不同，散列算法则应尽量减少冲突。冲突解决策略如开放定址法和链地址法被详细讲解。在实例篇，文章以JDK中的Hashtable和C++ STL的hashtable为例，分析了它们的实现过程和冲突处理方式。

摘要由CSDN通过智能技术生成

hash函数进阶二

1.理论篇

hash可以说是一种数据结构，类似于数组，链表；也可以说是一种算法过程，用于将数据散列，便于访问。

如何使用好hash，主要体现在以下三点：

(1)生成hashcode的算法

(2)散列算法

(3)冲突处理策略

1.1.生成hashcode的算法

对于不同的对象，需要对应不同的关键字，因此算法需要尽量满足不同对象的hashcode值不同。

譬如，现有最多仅有3个字符的全字母字符串待获取hashcode，以下两种算法效果就完全不同：

unsigned int gethashcode(const char *str)
{
	unsigned int hashcode = 0;
	
	while(*str)
	{
		hashcode = hashcode + *str++;
	}
	
	return (hashcode & 0x7FFFFFFF);
}

unsigned int gethashcode(const char *str)
{
	unsigned int hashcode = 0;
	
	while(*str)
	{
		hashcode = 24 * hashcode + *str++;
	}
	
	return (hashcode & 0x7FFFFFFF);
}

第一种生成的hashcode重复概率很大(如字符串"ab"和"ba")，因此不是一个好的算法。

1.2.散列算法

好的hashcode算法，会使得不同的对象有不同的hashcode，而散列就是通过hashcode，将对象存放在特定的位置；因此散列算法需要尽量使得不同的hashcode得到的位置不同，即冲突越少越好。

譬如，有如下两种实现算法：

int getindex(int hashcode, int tablelength)
{
	return (hashcode % tablelength);
}

int getindex(int hashcode, int a, int b)
{
	return (a * hashcode + b);
}

第一种方法是 除留余数法：关键点在于除数的选取，一般选择除数的原则是小于或等于表长的最大质数。

第二种方法是直接寻址法：简单，不常用，可用于特定应用中，如输入对象是1980到2000年段的大量数据，统计每个出生年有多少人。

hashcode算法：hashcode = value - 1980
散列算法：index = hashcode

1.3.冲突解决策略

即使能够保证hashcode值不同，对于大量的数据需要存放在长度远小于数据量的情况，必然会出现散列冲突的情况，因此需要设计好的冲突解决方法，以提高效率。

常用的策略有：

开放定址法：冲突时，利用散列结果再散列，如：f(key) = (f(key) + di) % len；(di值可任选)
链地址法：冲突时，在冲突的位置利用链表链接起来。

bool putvalue(int hashcode, int value)
{
	int index = 0, mark = 0;
	
	index = getindex(hashcode, len);
	while((array[index] != -1) && (mark < len))
	{
		index = getindex(index + 1, len);
		mark++;
	}
	
	if(mark != len)
	{
		array[index] = value;
		return true;
	}
	else
	{
		return false;
	}
}

bool putvalue(int hashcode, int value)
{
	int index = 0;
	struct Node *pnode = NULL, *node = NULL;
	
	index = getindex(hashcode, len);
	pnode = array[index];
	while(pnode->next != NULL)
	{
		pnode = pnode->next;
	}
	
	/*malloc node and set value*/
	/*.......*/
	pnode->next = node;
	
	return true;
}

2.实例篇

我们以JDK和C++STL中的hash实现为例子，分析hash的实现过程。

2.1JDK中的hashtable应用

(1)存储目标对象数组及初始化

/**
     * The hash table data.
     */
    private transient Entry<?,?>[] table;

public Hashtable(int initialCapacity, float loadFactor) {
        if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal Capacity: "+
                                               initialCapacity);
        if (loadFactor <= 0 || Float.isNaN(loadFactor))
            throw new IllegalArgumentException("Illegal Load: "+loadFactor);

        if (initialCapacity==0)
            initialCapacity = 1;
        this.loadFactor = loadFactor;
        table = new Entry<?,?>[initialCapacity];
        threshold = (int)Math.min(initialCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
    }

thresdhold是一个门限值，当达到该值后，在加入对象会加大长度，重做table。

(2)对象及hashcode算法

对象：<k key, V value>；使用key作为关键字，通过hashcode算法散列到table，而后将<key, value>对象存储。

默认的hashcode算法是native类型，无法查看到源代码，但可以参考string对象的覆写实现：

public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }

注：因对于相同值得string对象，其hashcode需要一样，因此借助辅助变量hash来判断是否通过string(str)继承而来。

(3)添加：散列算法、冲突策略

public synchronized V put(K key, V value) {
        // Make sure the value is not null
        if (value == null) {
            throw new NullPointerException();
        }

        // Makes sure the key is not already in the hashtable.
        Entry<?,?> tab[] = table;
        int hash = key.hashCode();
        int index = (hash & 0x7FFFFFFF) % tab.length;
        @SuppressWarnings("unchecked")
        Entry<K,V> entry = (Entry<K,V>)tab[index];
        for(; entry != null ; entry = entry.next) {
            if ((entry.hash == hash) && entry.key.equals(key)) {
                V old = entry.value;
                entry.value = value;
                return old;
            }
        }

        addEntry(hash, key, value, index);
        return null;
    }

private void addEntry(int hash, K key, V value, int index) {
        modCount++;

        Entry<?,?> tab[] = table;
        if (count >= threshold) {
            // Rehash the table if the threshold is exceeded
            rehash();

            tab = table;
            hash = key.hashCode();
            index = (hash & 0x7FFFFFFF) % tab.length;
        }

        // Creates the new entry.
        @SuppressWarnings("unchecked")
        Entry<K,V> e = (Entry<K,V>) tab[index];
        tab[index] = new Entry<>(hash, key, value, e);
        count++;
    }

散列算法：采用的是除留余数法。

冲突策略：链接法。

(4)查询、获取

public synchronized boolean containsKey(Object key) {
        Entry<?,?> tab[] = table;
        int hash = key.hashCode();
        int index = (hash & 0x7FFFFFFF) % tab.length;
        for (Entry<?,?> e = tab[index] ; e != null ; e = e.next) {
            if ((e.hash == hash) && e.key.equals(key)) {
                return true;
            }
        }
        return false;
    }

@SuppressWarnings("unchecked")
    public synchronized V get(Object key) {
        Entry<?,?> tab[] = table;
        int hash = key.hashCode();
        int index = (hash & 0x7FFFFFFF) % tab.length;
        for (Entry<?,?> e = tab[index] ; e != null ; e = e.next) {
            if ((e.hash == hash) && e.key.equals(key)) {
                return (V)e.value;
            }
        }
        return null;
    }