字符串哈希基础与应用

最新推荐文章于 2022-03-14 13:10:39 发布

Shmilky

最新推荐文章于 2022-03-14 13:10:39 发布

阅读量632

点赞数

文章标签：字符串哈希

本文链接：https://blog.csdn.net/weixin_43870697/article/details/99684095

版权

字符串哈希

例题

A:POJ-3461 Oulipo

B:POJ-2406 Power Strings

C:POJ-2752 Seek the Name, Seek the Fame

D:HDU-1880 魔咒词典

E:POJ-1743 Musical Theme

F:SCU-4438 Censor

G:HDU-1280 前m大的数

H:HDU-1496 Equations

字符串哈希

一、引入

哈希算法是通过一个哈希函数H，将一种数据（如字符串）转化为另一种数据（通常转化为整形数值），有些题可用map做，但数据一大就要用到字符串哈希

二、字符串哈希

寻找长度为n的主串S中的匹配串T（长度为m）出现的位置或次数属于字符串匹配问题。朴素算法（或称为暴力）就是枚举所有子串的起始位置，每枚举一次就要使用O（m）的时间，总共要O（nm）的时间。当然字符串匹配可以用KMP做，但这里介绍一下字符串哈希。

字符串哈希就是将每个字符串转化为一个数值，然后遍历主串，判断在主串起始位置为i长度为m的字符串的哈希值与匹配串的哈希值是否相等即可，每次判断为O(1)的时间。这样就可以转化为O（n）的时间完成判断。那么问题来了，怎么预处理哈希值呢？

我们选用两个互质常数base和mod，假设匹配串T=abcdefg……z（注意这里不是指T只有26位）那么哈希值为 H（T）=（a*base^(m-1)+b*base^(m-2)+c*base^(m-3)+……+z）%mod。相当于把每个字符串转换为一个base进制数，所以对于每道题我们取base时，要大于每一位上的值（避免重复），例如我们用的十进制数每一位都是小于10的。

例如字符串C="ABDB",则H（C）=‘A’+'B'*base+'D'*base^2+'B'*base^3（本人习惯直接取字符askII码值，也可以使‘A’=1）

那么怎么判断主串起始位置为i长度为m的字符串的哈希值与匹配串的哈希值是否相等呢？这里有个公式，若求字符串中第i位到第j位的哈希值（i<j），则这个值为H(j)-H(i-1)*base^(j-i+1)。有了这个公式，我们可以预处理一个数组H[i]表示字符串从第一位到第i位的哈希值和数组power[i]表示base^i。加上判断的时间，总时间为O（n+m）。

在计算时，我们可以使用无符号类型（通常本人习惯使用unsigned long long）的自然溢出，这样就可以不用%mod，包括减法也方便许多。

当然哈希会有可能重复，base值越大重复可能性越小，本人通常取131或233317。也可使用双哈希，即两个不同的mod
具体例子可看例题

例题

A:POJ-3461 Oulipo:这个题kmp里面写过，基础题，判断子串在模式串中出现的次数，简单的哈希处理下就行了，代码：

#include <iostream>
#include <cstring>
#include <cstdio>
 
using namespace std;
typedef unsigned long long ll;
const int base = 31;
const int maxn = 1000050;
 
char sub[maxn],str[maxn];
ll xp[maxn];
ll hash[maxn];
 
int main()
{
    int T,i;
    scanf("%d",&T);
 
    xp[0]=1;
    for(i=1;i<maxn;i++)
        xp[i]=xp[i-1]*base;
 
    while(T--)
    {
        memset(sub,0,sizeof(sub));
        memset(str,0,sizeof(str));
        scanf("%s",sub);
        scanf("%s",str);
        int L=strlen(sub);
        int n=strlen(str);
 
        ll sub_num=0;
        for(i=L-1;i>=0;i--)
        {
            sub_num=sub_num*base+(sub[i]-'a')+1;
        }
 
        hash[n]=0;
        for(i=n-1;i>=0;i--)
        {
            hash[i]=hash[i+1]*base+(str[i]-'a')+1;
        }
 
        int ans=0;
        for(i=0;i<=n-L;i++)     ///Caution!!! it is (i<=n-L) or (i<n-L+1)
        {
            if(sub_num==hash[i]-hash[i+L]*xp[L])
                ans++;
        }
        printf("%d\n",ans);
    }
    return 0;
}

B:POJ-2406 Power Strings:给定若干个长度的字符串，询问每个字符串最多是由多少个相同的子字符串重复连接而成的。如：ababab则最多有 3个 ab 连接而成。哈希做法即为枚举重复长度，然后判断即可，具体看代码：

#include <iostream>
#include <cstring>
#include <cstdio>
using namespace std;
typedef unsigned long long ll;
const int maxn = 1e6 + 100;
const int temp = 31;
char s[maxn];
ll f[maxn], Hash[maxn];

int main() {
    f[0] = 1;
    for (int i = 1; i < maxn; i++)
        f[i] = f[i - 1] * temp;
    while (scanf("%s",s)) {
        if (s[0] == '.') break;
        int l = strlen(s);
        Hash[l] = 0;
        for (int i = l - 1; i >= 0; i--) {
            Hash[i] = Hash[i + 1] * temp + (s[i] - 'a') + 1;
        }
        int ans = 0;
        for (int i = 1; i <= l; i++) {
            if (l % i != 0) continue;
            ll ha = Hash[0] - Hash[i] * f[i];
            int k = 0;
            for (k = i; k < l; k = k + i) {
                if (ha != Hash[k] - Hash[k + i] * f[i]) break;
                else ha = Hash[k] - Hash[k + i] * f[i];
            }
            if (k == l) {
                ans = l / i;
                break;
            }
        }
        printf("%d\n",ans);
    }
    return 0;
}

C:POJ-2752 Seek the Name, Seek the Fame:题意：对于一个字符串s,找出所有相同的前缀后缀长度。思路：暴力枚举长度，然后判断前后缀值是否一样即可。

#include <iostream>
#include <cstdio>
#include <cstring>
#include <cmath>
#include <string>
#include <algorithm>
using namespace std;
typedef long long ll;
const int N=400005;
const int mod = (1 << 15) - 1;
const int seed=31;
ll Hash[N];
ll f[N];
char s[N];
int sl;
int main()
{
    while(~scanf("%s",s))
    {
        sl=strlen(s);
        f[0]=1;
        for(int i=1;i<=sl;i++) f[i]=f[i-1]*seed;
        Hash[0]=s[0]-'a';
        for(int i=1;i<sl;i++) Hash[i]=Hash[i-1]*seed+s[i]-'a';
        bool flag=true;
        for(int i=0;i<sl;i++)
        {
            if(Hash[i]==Hash[sl-1]-Hash[sl-2-i]*f[i+1])
            {
               if(flag)
               {
                   printf("%d",i+1);
                   flag=false;
               }
               else printf(" %d",i+1);
            }
        }
        printf("\n");
    }
    return 0;
}

D:HDU-1880 魔咒词典: 哈利波特在魔法学校的必修课之一就是学习魔咒。据说魔法世界有100000种不同的魔咒，哈利很难全部记住，但是为了对抗强敌，他必须在危急时刻能够调用任何一个需要的魔咒，所以他需要你的帮助。给你一部魔咒词典。当哈利听到一个魔咒时，你的程序必须告诉他那个魔咒的功能；当哈利需要某个功能但不知道该用什么魔咒时，你的程序要替他找到相应的魔咒。如果他要的魔咒不在词典中，就输出“what?”。

#include <iostream>
#include<cstring>
#include<cstdio>
#include<algorithm>
#include<map>
#include<queue>
#include<set>
#include<cmath>
#include<stack>
#include<string>
const int maxn=1e5+10;
const int mod=1e9+7;
const int inf=1e8;
typedef long long ll;
using namespace std;
char str1[maxn][30],str2[maxn][100];
int len=0;
struct node
{
    int has,i;
    bool friend operator<(node a,node b)
    {
        return a.has<b.has;
    }
}cnt1[maxn],cnt2[maxn];///cnt1保存魔咒的hash值，cnt2保存对应功能的hash值
int gethash(char *str)///算一个字符串的hash值
{
    int sum=0,seed=131;
    int l=strlen(str);
    for(int i=0;i<l;i++)
        sum=sum*seed+str[i];
    return sum;
}
void solve()///将魔咒和对应功能的字符串转化成相应的hash值
{
    for(int i=0;i<len;i++)
    {
        cnt1[i].has=gethash(str1[i]);
        cnt2[i].has=gethash(str2[i]);
        cnt1[i].i=cnt2[i].i=i;
    }
    sort(cnt1,cnt1+len);///排序，方便后面二分查找。
    sort(cnt2,cnt2+len);
}
int main()
{
    while(scanf("%s",str1[len])&&strcmp(str1[len],"@END@"))
    {
        getchar();
        gets(str2[len++]);
    }
    solve();
    int n;
    scanf("%d",&n);
    getchar();
    for(int i=0;i<n;i++)
    {
        char str[105];
        gets(str);
        node temp;
        if(str[0]=='[')
        {
            temp.has=gethash(str);
            int pos=lower_bound(cnt1,cnt1+len,temp)-cnt1;///输入的魔咒，在对应数组里面找
            if(temp.has==cnt1[pos].has)
                printf("%s\n",str2[cnt1[pos].i]);///若有这个魔咒，输出对应的功能
            else
                printf("what?\n");
        }
        else
        {
            temp.has=gethash(str);
            int pos=lower_bound(cnt2,cnt2+len,temp)-cnt2;///跟上面原理一样
            if(temp.has==cnt2[pos].has)
            {
                int len=strlen(str1[cnt2[pos].i]);
                str1[cnt2[pos].i][len-1]='\0';///这步和下面的+1都是为了不输出那个中括号
                printf("%s\n",str1[cnt2[pos].i]+1);
            }
            else
                printf("what?\n");
        }
    }
    return 0;
}

E:POJ-1743 Musical Theme:该题题意是给定一个音乐串，要求最长的主题串满足：可以找到两个这样的串，在对方的每一位添加一个数字。两个串互相不能够有重叠。

#include <cstdio>
#include <cstring>
#include <iostream>
using namespace std;
typedef unsigned long long ulint;
const ulint seed = 30007uLL;
#define maxn 20020
#define mod 100003
ulint H[maxn], xp[maxn];
int s[maxn], N;

void initHash()
{
    H[0] = s[0];
    for(int i = 1; i < N; i++)
        H[i] = H[i - 1]*seed + s[i];
}

ulint askHash(int l, int r)
{
    if(l == 0) return H[r];
    return H[r] - H[l - 1]*xp[r - l + 1];
}



ulint h[mod];
int bg[mod], nx[mod], pos[mod];

bool check(int len)
{
    memset(bg, 0, sizeof(bg));

    ulint ht;
    int e = 0;

    for(int i = 0, l, r; i + len - 1 < N; i++)
    {
        l = i, r = i + len - 1;
        ht = askHash(l, r);

        for(int p = bg[ht % mod]; p; p = nx[p])
            if(h[p] == ht && i - pos[p] >= len)
                return true;

        h[++e] = ht;           //这几个数组用法可以记住，当模板用
        nx[e] = bg[ht % mod];
        bg[ht % mod] = e;
        pos[e] = i;
    }

    return false;
}

int main()
{
    xp[0] = 1;
    for(int i = 1; i < maxn; i++)
    {
        xp[i] = xp[i-1] * seed;
    }

    while(scanf("%d", &N) && N)
    {
        for(int i = 0; i < N; i++)
        {
            scanf("%d", &s[i]);
        }
        N--;
        for(int i = 0; i < N; i++)
        {
            s[i] = 89 + s[i+1] - s[i];//+89防止出现负数
        }

        initHash();

        int ans = 0, l = 1, r = N / 2, m;
        while(l <= r)
        {
            m = (l + r) >> 1;

            if(check(m))
            {
                l = m + 1;
                ans = m;
            }
            else r = m - 1;
        }
        cout << ((ans >= 4)? ans+1: 0) << endl;  //>=4因为求的是间歇的个数，4个间歇相等即5个字符满足
    }
    return 0;
}

F:SCU-4438 Censor:给定一个字符串A和一个字符串B，如果如果B中存在A字符串，就在B中把A字符串去掉，输出最后去掉A字符串之后B字符串，注意aaabcbc可以去2次abc，结果为a。思路应该都有，没有看代码应该也能明白。代码:

#include <iostream>
#include <cstdio>
#include <cstring>
#include <cmath>
#include <string>
#include <algorithm>
using namespace std;
typedef long long ll;
const int N=5e6+10;
const int mod = (1 << 15) - 1;
const int seed=31;
ll Hash1,Hash2[N];
ll f[N];
char ans[N];
char s1[N],s2[N];
int len1,len2;
int main()
{
    while(~scanf("%s %s",s1,s2))
    {
        int tot=0;
        len1=strlen(s1);
        len2=strlen(s2);
        if(len1>len2){
            cout<<s2<<endl;
            continue;
        }
        Hash1=0;
        f[0]=1;
        for(int i=1;i<=len2;i++) f[i]=f[i-1]*seed;
        for(int i=0;i<len1;i++) Hash1=Hash1*seed+s1[i]-'a'+1;
        Hash2[0]=0;
        for(int i=0;i<len2;i++)
        {
            ans[tot++]=s2[i];
            Hash2[tot]=Hash2[tot-1]*seed+s2[i]-'a'+1;
            if(tot>=len1&&(Hash2[tot]-Hash2[tot-len1]*f[len1]==Hash1))
                tot-=len1;
        }
        for(int i=0;i<tot;i++)
            cout<<ans[i];
        cout<<endl;
    }
    return 0;
}

G:HDU-1280 前m大的数:给n个数，求前m大的数，每两个数可以两两相加：

#include<stdio.h>
#include<cstring>
int main()
{
    int hash[10010];
    int a[3010];
    int n,m;
    int i,j;
    while(scanf("%d %d",&n,&m)!=EOF)
    {
        memset(hash,0,sizeof(hash));
        for(i=0;i<n;i++)
        {
            scanf("%d",&a[i]);
        }
        int xx=0;
        for(i=0;i<n;i++)
        {
            for(j=0;j<i;j++)
            {
                hash[a[i]+a[j]]++;  //我也不明白这个解法为什么见哈希，可能数组名为hash（手动滑稽）
                if(a[i]+a[j]>xx)
                {
                    xx=a[i]+a[j];
                }
            }
        }
        for(i=xx;i>=0;i--)
        {
            while(hash[i]&&m>1)
            {
                printf("%d ",i);
                hash[i]--;
                m--;
            }
            if(m==1&&hash[i])
            {
                printf("%d\n",i);
                break;
            }
        }
    }
    return 0;
}

H:HDU-1496 Equations:题目大意：给定a,b,c,d。ax1^2+bx2^2+cx3^2+dx4^2=0,其中x1~x4 在 [-100,100]区间内， a,b,c,d在[-50,50] 区间内。求满足上面那个式子的所有解的个数。思路：将等式变形为ax1^2+bx2^2= -(cx3^2+dx4^2) 先用两重循环列举a,b的所有情况，将等式的左边结果存入hash表。再用两重循环列举c,d的所有情况，看看结果的相反数在不在hash表中。统计输出。

#include <iostream>
#include <stdio.h>
#include <string.h>
 
using namespace std;
 
const int N = 100;
const int N2 = N * N * N;
int sum1[N2 + 1], sum2[N2 + 1];
 
int main()
{
    int a, b, c, d;
    while(~scanf("%d%d%d%d", &a, &b, &c, &d)) {
        if((a > 0 && b > 0 && c > 0 && d > 0) || (a < 0 && b < 0 && c < 0 && d < 0)) {
            printf("0\n");
            continue;
        }
 
        memset(sum1, 0, sizeof(sum1));
        memset(sum2, 0, sizeof(sum2));
 
        int sum = 0;
        for(int i = 1; i <= N; i++)
            for(int j = 1; j <= N; j++) {
                int k = a * i * i + b * j * j;
                if(k >= 0)
                    sum1[k]++;
                else
                    sum2[-k]++;
            }
        for(int i = 1; i <= N; i++)
            for(int j = 1; j <= N; j++) {
                int k = c * i * i + d * j * j;
                if(k > 0)
                    sum += sum2[k];
                else
                    sum += sum1[-k];
            }
 
        // 每个解有正有负，所以结果有2^4种
        printf("%d\n", 16 * sum);
    }
 
    return 0;
}

Shmilky

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
字符串哈希基础与应用

目录字符串哈希例题A:POJ-3461 OulipoB:POJ-2406 Power StringsC:POJ-2752 Seek the Name, Seek the FameD:HDU-1880 魔咒词典E:POJ-1743 Musical ThemeF:SCU-4438 CensorG:HDU-1280 前m大的数H:HDU-1496 Equat...
复制链接

扫一扫