1005 Programming Pattern (35分)(C++)

Programmers often have a preference among program constructs. For example, some may prefer if(0==a), while others may prefer if(!a). Analyzing such patterns can help to narrow down a programmer's identity, which is useful for detecting plagiarism.

Now given some text sampled from someone's program, can you find the person's most commonly used pattern of a specific length?

Input Specification:

Each input file contains one test case. For each case, there is one line consisting of the pattern length N (1≤N≤1048576), followed by one line no less than N and no more than 1048576 characters in length, terminated by a carriage return \n. The entire input is case sensitive.

Output Specification:

For each test case, print in one line the length-N substring that occurs most frequently in the input, followed by a space and the number of times it has occurred in the input. If there are multiple such substrings, print the lexicographically smallest one.

Whitespace characters in the input should be printed as they are. Also note that there may be multiple occurrences of the same substring overlapping each other.

Sample Input 1:

4
//A can can can a can.

Sample Output 1:

 can 4

Sample Input 2:

3
int a=~~~~~~~~~~~~~~~~~~~~~0;

Sample Output 2:

~~~ 19

题目大意:给出一串字符串以及子串长度N,求该字符串的长度为N的子字符串中出现次数最多的那个,如果两个子字符串出现次数相同,输出字典序较小的一个。注意个子字符串之间可以重叠。

后缀数组

关于后缀数组的原理可以参考:后缀数组详解五分钟搞懂后缀数组后缀数组——处理字符串的有力工具(强推!)

后缀数组通过将后缀按字典序排序,得到关键的3个数组:
sa[i]:排名为 i 的后缀的起始下标。
rank[i]:起始下标为 i 的后缀的排名。
height[i]:Suffix[sa[i]]和Suffix[sa[i-1]]的最长公共前缀,即为排名相邻的两个后缀的最长公共前缀。

本题中的应用是求一串字符串中长度为n的子字符串中出现次数最多的那个,求长度为len、出现频率最高、字典序最小的子串,我们观察高度数组可以发现,如果有连续的tempcnth[i]>=len,那必定有tempcnt+1个这样的子串,将tempcnt记录下来cnt;从前往后遍历,由于本身已经按字典序排好序,我们之后找到另一个tempcnt大于先前记录的cnt就进行更新;

#include <bits/stdc++.h>
using namespace std;
const int maxn = 1048576+50;
string str;
int n, m, Len;
int s[maxn]={}, sa[maxn]={}, c[maxn]={}, tax[maxn]={}, tp[maxn]={}, Rank[maxn]={}, height[maxn]={};
void getSA(){
    m = 128;
    int *x = tax, *y = tp;
    for(int i = 0; i < m; ++ i) c[i] = 0;
    for(int i = 0; i < n; ++ i) ++c[x[i]=s[i]];
    for(int i = 1; i < m; ++ i) c[i] += c[i-1];
    for(int i = n-1; i >= 0; -- i) sa[--c[x[i]]] = i;
    for(int k = 1; k <= n; k <<= 1){
        int p = 0;
        for(int i = n-k; i < n; ++ i) y[p++] = i;
        for(int i = 0; i < n; ++ i) if(sa[i] >= k) y[p++] = sa[i]-k;
        for(int i = 0; i < m; ++ i) c[i] = 0;
        for(int i = 0; i < n; ++ i) ++c[x[y[i]]];
        for(int i = 1; i < m; ++ i) c[i] += c[i-1];
        for(int i = n-1; i >= 0; -- i) sa[--c[x[y[i]]]] = y[i];
        swap(x, y);
        p = 1;
        x[sa[0]] = 0;
        for(int i = 1; i < n; ++ i) 
            x[sa[i]] = y[sa[i-1]]==y[sa[i]] && y[sa[i-1]+k]==y[sa[i]+k] ? p-1 : p++;
        if(p >= n) break;
        m = p;
    }
}
void getHeight(){
    int k = 0;
    for(int i = 0; i < n; ++ i)
        Rank[sa[i]] = i;
    for(int i = 0; i < n; ++ i){
        if(k)  k--;
        int j = sa[Rank[i]-1];
        while(j+k<n && i+k<n && s[i+k] == s[j+k]) ++k;
        height[Rank[i]] = k;
    }
}
int main(){
    scanf("%d\n", &Len);
    getline(cin, str);
    n = str.length();
    for(int i = 0; i < n; ++ i)
        s[i] = str[i];
    getSA();
    getHeight();
    int p=sa[0], cnt=0, tempcnt=1;
    for(int i = 1; i < n; ++ i){
    	if(height[i] >= Len)
    		++ tempcnt;
    	else
			tempcnt = 1;
		if(sa[i]+Len <= n && cnt < tempcnt){
			cnt = tempcnt;
			p = sa[i];
		}
	}
	printf("%s %d", str.substr(p, Len).c_str(), cnt);
}

Hash

hash很好理解,将长度为n的子字符串映射到table中,一一计数即可,有几个地方需要注意

1.可能存在的子字符串最多为1048576个,所以我们的table大小要为超过1048576的数,这里取10000010,index值取10000007(需为素数)

2.字符共有256个,相对于子字符串数目1048576,这个数值过小,为避免两个不同的子串的hash值相同,我们需要人为改变数值,可以将数值改为高进制,我使用的进制1007

3.超时问题,在修改ans的过程中,每次保存的ans值为首位开始的下标,而非子字符串,否则容易超时

#include <bits/stdc++.h>
using namespace std;
const unsigned long long inf = 10000007;
const unsigned long long B = 1007;
int mp[10000010] = {};
int main(){
	int n, maxn = 0, cnt, ans = 0;
	string s;
	scanf("%d\n", &n);
	getline(cin, s);
	unsigned long long hash = 0, b = 1;
	for(int i = 0; i < n; ++ i){
		b *= B;
		hash = hash * B + s[i];
	}
	for(int i = 0; i + n <= s.length(); ++ i){
		cnt = ++ mp[hash % inf];
		if(cnt > maxn){
			maxn = cnt;
			ans = i;
		}
		else if(cnt == maxn){
			for(int j = 0; j < n; ++ j)
				if(s[ans+j] != s[i+j]){
					if(s[ans+j] > s[i+j])
						ans = i;
					break;
				}
		}
		if(i + n < s.length())
			hash = hash * B + s[i+n] - s[i] * b;
	}
	string ans_str = s.substr(ans, n);
	printf("%s %d", ans_str.c_str(), maxn);
}

 

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值