1005 Programming Pattern (35分)（C++）

最新推荐文章于 2023-08-22 16:47:44 发布

Brielleqqqqqqjie

最新推荐文章于 2023-08-22 16:47:44 发布

阅读量978

点赞数

分类专栏： PAT TOP Level题解

本文链接：https://blog.csdn.net/qq_41562704/article/details/104122945

版权

PAT TOP Level题解专栏收录该内容

24 篇文章 9 订阅

订阅专栏

Programmers often have a preference among program constructs. For example, some may prefer if(0==a), while others may prefer if(!a). Analyzing such patterns can help to narrow down a programmer's identity, which is useful for detecting plagiarism.

Now given some text sampled from someone's program, can you find the person's most commonly used pattern of a specific length?

Input Specification:

Each input file contains one test case. For each case, there is one line consisting of the pattern length N (1≤N≤1048576), followed by one line no less than N and no more than 1048576 characters in length, terminated by a carriage return \n. The entire input is case sensitive.

Output Specification:

For each test case, print in one line the length-N substring that occurs most frequently in the input, followed by a space and the number of times it has occurred in the input. If there are multiple such substrings, print the lexicographically smallest one.

Whitespace characters in the input should be printed as they are. Also note that there may be multiple occurrences of the same substring overlapping each other.

Sample Input 1:

4
//A can can can a can.

Sample Output 1:

 can 4

Sample Input 2:

3
int a=~~~~~~~~~~~~~~~~~~~~~0;

Sample Output 2:

~~~ 19

题目大意：给出一串字符串以及子串长度N，求该字符串的长度为N的子字符串中出现次数最多的那个，如果两个子字符串出现次数相同，输出字典序较小的一个。注意个子字符串之间可以重叠。

后缀数组

关于后缀数组的原理可以参考：后缀数组详解；五分钟搞懂后缀数组；后缀数组——处理字符串的有力工具（强推！）

后缀数组通过将后缀按字典序排序，得到关键的3个数组：
sa[i]：排名为 i 的后缀的起始下标。
rank[i]：起始下标为 i 的后缀的排名。
height[i]：Suffix[sa[i]]和Suffix[sa[i-1]]的最长公共前缀，即为排名相邻的两个后缀的最长公共前缀。

本题中的应用是求一串字符串中长度为n的子字符串中出现次数最多的那个，求长度为len、出现频率最高、字典序最小的子串，我们观察高度数组可以发现，如果有连续的tempcnt个h[i]>=len，那必定有tempcnt+1个这样的子串，将tempcnt记录下来cnt；从前往后遍历，由于本身已经按字典序排好序，我们之后找到另一个tempcnt大于先前记录的cnt就进行更新；

#include <bits/stdc++.h>
using namespace std;
const int maxn = 1048576+50;
string str;
int n, m, Len;
int s[maxn]={}, sa[maxn]={}, c[maxn]={}, tax[maxn]={}, tp[maxn]={}, Rank[maxn]={}, height[maxn]={};
void getSA(){
    m = 128;
    int *x = tax, *y = tp;
    for(int i = 0; i < m; ++ i) c[i] = 0;
    for(int i = 0; i < n; ++ i) ++c[x[i]=s[i]];
    for(int i = 1; i < m; ++ i) c[i] += c[i-1];
    for(int i = n-1; i >= 0; -- i) sa[--c[x[i]]] = i;
    for(int k = 1; k <= n; k <<= 1){
        int p = 0;
        for(int i = n-k; i < n; ++ i) y[p++] = i;
        for(int i = 0; i < n; ++ i) if(sa[i] >= k) y[p++] = sa[i]-k;
        for(int i = 0; i < m; ++ i) c[i] = 0;
        for(int i = 0; i < n; ++ i) ++c[x[y[i]]];
        for(int i = 1; i < m; ++ i) c[i] += c[i-1];
        for(int i = n-1; i >= 0; -- i) sa[--c[x[y[i]]]] = y[i];
        swap(x, y);
        p = 1;
        x[sa[0]] = 0;
        for(int i = 1; i < n; ++ i) 
            x[sa[i]] = y[sa[i-1]]==y[sa[i]] && y[sa[i-1]+k]==y[sa[i]+k] ? p-1 : p++;
        if(p >= n) break;
        m = p;
    }
}
void getHeight(){
    int k = 0;
    for(int i = 0; i < n; ++ i)
        Rank[sa[i]] = i;
    for(int i = 0; i < n; ++ i){
        if(k)  k--;
        int j = sa[Rank[i]-1];
        while(j+k<n && i+k<n && s[i+k] == s[j+k]) ++k;
        height[Rank[i]] = k;
    }
}
int main(){
    scanf("%d\n", &Len);
    getline(cin, str);
    n = str.length();
    for(int i = 0; i < n; ++ i)
        s[i] = str[i];
    getSA();
    getHeight();
    int p=sa[0], cnt=0, tempcnt=1;
    for(int i = 1; i < n; ++ i){
    	if(height[i] >= Len)
    		++ tempcnt;
    	else
			tempcnt = 1;
		if(sa[i]+Len <= n && cnt < tempcnt){
			cnt = tempcnt;
			p = sa[i];
		}
	}
	printf("%s %d", str.substr(p, Len).c_str(), cnt);
}

Hash

hash很好理解，将长度为n的子字符串映射到table中，一一计数即可，有几个地方需要注意

1.可能存在的子字符串最多为1048576个，所以我们的table大小要为超过1048576的数，这里取10000010，index值取10000007（需为素数）

2.字符共有256个，相对于子字符串数目1048576，这个数值过小，为避免两个不同的子串的hash值相同，我们需要人为改变数值，可以将数值改为高进制，我使用的进制1007

3.超时问题，在修改ans的过程中，每次保存的ans值为首位开始的下标，而非子字符串，否则容易超时

#include <bits/stdc++.h>
using namespace std;
const unsigned long long inf = 10000007;
const unsigned long long B = 1007;
int mp[10000010] = {};
int main(){
	int n, maxn = 0, cnt, ans = 0;
	string s;
	scanf("%d\n", &n);
	getline(cin, s);
	unsigned long long hash = 0, b = 1;
	for(int i = 0; i < n; ++ i){
		b *= B;
		hash = hash * B + s[i];
	}
	for(int i = 0; i + n <= s.length(); ++ i){
		cnt = ++ mp[hash % inf];
		if(cnt > maxn){
			maxn = cnt;
			ans = i;
		}
		else if(cnt == maxn){
			for(int j = 0; j < n; ++ j)
				if(s[ans+j] != s[i+j]){
					if(s[ans+j] > s[i+j])
						ans = i;
					break;
				}
		}
		if(i + n < s.length())
			hash = hash * B + s[i+n] - s[i] * b;
	}
	string ans_str = s.substr(ans, n);
	printf("%s %d", ans_str.c_str(), maxn);
}