第九周上机作业:找出所有满足其出现频率大于某个给定阈值的所有子串

1. 问题描述:

    给定一组字符串的集合(共53个长度相同的字符串),试设计一个算法,找出所有满足其出现频率大于某个给定阈值的子串,其中阈值为输入参数。例如:“taat”这个子串,集合中的53个字符串中有24个字符串包含“taat”这个子串,则其频率计算为24/53. 如果阈值设置为0.5,则该子串由于其频率小于0.5,所以不必输出。反之,如果阈值设置为0.4,则该子串由于其频率24/53大于0.4,故应该被输出。

2. 问题分析:

    首先,对于给定的53个长度相同的字符串,老师可能已经给过了,像是这样:
 

string T[53] = { "tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt",
    "tgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcatcgccaa",
    "gtactagagaactagtgcattagcttatttttttgttatcatgctaaccacccggcg",
    "aattgtgatgtgtatcgaagtgtgttgcggagtagatgttagaatactaacaaactc",
    "tcgataattaactattgacgaaaagctgaaaaccactagaatgcgcctccgtggtag",
    "aggggcaaggaggatggaaagaggttgccgtataaagaaactagagtccgtttaggt",
    "cagggggtggaggatttaagccatctcctgatgacgcatagtcagcccatcatgaat",
    "tttctacaaaacacttgatactgtatgagcatacagtataattgcttcaacagaaca",
    "cgacttaatatactgcgacaggacgtccgttctgtgtaaatcgcaatgaaatggttt",
    "ttttaaatttcctcttgtcaggccggaataactccctataatgcgccaccactgaca",
    "gcaaaaataaatgcttgactctgtagcgggaaggcgtattatgcacaccccgcgccg",
    "cctgaaattcagggttgactctgaaagaggaaagcgtaatatacgccacctcgcgac",
    "gatcaaaaaaatacttgtgcaaaaaattgggatccctataatgcgcctccgttgaga",
    "ctgcaatttttctattgcggcctgcggagaactccctataatgcgcctccatcgaca",
    "tttatatttttcgcttgtcaggccggaataactccctataatgcgccaccactgaca",
    "aagcaaagaaatgcttgactctgtagcgggaaggcgtattatgcacaccgccgcgcc",
    "atgcatttttccgcttgtcttcctgagccgactccctataatgcgcctccatcgaca",
    "aaacaatttcagaatagacaaaaactctgagtgtaataatgtagcctcgtgtcttgc",
    "tctcaacgtaacactttacagcggcgcgtcatttgatatgatgcgccccgcttcccg",
    "gcaaataatcaatgtggacttttctgccgtgattatagacacttttgttacgcgttt",
    "gacaccatcgaatggcgcaaaacctttcgcggtatggcatgatagcgcccggaagag",
    "aaaaacgtcatcgcttgcattagaaaggtttctggccgaccttataaccattaatta",
    "tctgaaatgagctgttgacaattaatcatcgaactagttaactagtacgcaagttca",
    "accggaagaaaaccgtgacattttaacacgtttgttacaaggtaaaggcgacgccgc",
    "aaattaaaattttattgacttaggtcactaaatactttaaccaatataggcatagcg",
    "ttgtcataatcgacttgtaaaccaaattgaaaagatttaggtttacaagtctacacc",
    "catcctcgcaccagtcgacgacggtttacgctttacgtatagtggcgacaatttttt",
    "tccagtataatttgttggcataattaagtacgacgagtaaaattacatacctgcccg",
    "acagttatccactattcctgtggataaccatgtgtattagagttagaaaacacgagg",
    "tgtgcagtttatggttccaaaatcgccttttgctgtatatactcacagcataactgt",
    "ctgttgttcagtttttgagttgtgtataacccctcattctgatcccagcttatacgg",
    "attacaaaaagtgctttctgaactgaacaaaaaagagtaaagttagtcgcgtagggt",
    "atgcgcaacgcggggtgacaagggcgcgcaaaccctctatactgcgcgccgaagctg",
    "taaaaaactaacagttgtcagcctgtcccgcttataagatcatacgccgttatacgt",
    "atgcaattttttagttgcatgaactcgcatgtctccatagaatgcgcgctacttgat",
    "ccttgaaaaagaggttgacgctgcaaggctctatacgcataatgcgccccgcaacgc",
    "tcgttgtatatttcttgacaccttttcggcatcgccctaaaattcggcgtcctcata",
    "ccgtttattttttctacccatatccttgaagcggtgttataatgccgcgccctcgat",
    "ttcgcatatttttcttgcaaagttgggttgagctggctagattagccagccaatctt",
    "tgtaaactaatgcctttacgtgggcggtgattttgtctacaatcttacccccacgta",
    "gatcgcacgatctgtatacttatttgagtaaattaacccacgatcccagccattctt",
    "aacgcatacggtattttaccttcccagtcaagaaaacttatcttattcccacttttc",
    "ttagcggatcctacctgacgctttttatcgcaactctctactgtttctccatacccg",
    "gccttctccaaaacgtgttttttgttgttaattcggtgtagacttgtaaacctaaat",
    "cagaaacgttttattcgaacatcgatctcgtcttgtgttagaattctaacatacggt",
    "cactaatttattccatgtcacacttttcgcatctttgttatgctatggttatttcat",
    "atataaaaaagttcttgctttctaacgtgaaagtggtttaggttaaaagacatcagt",
    "caaggtagaatgctttgccttgtcggcctgattaatggcacgatagtcgcatcggat",
    "ggccaaaaaatatcttgtactatttacaaaacctatggtaactctttaggcattcct",
    "taggcaccccaggctttacactttatgcttccggctcgtatgttgtgtggaattgtg",
    "ccatcaaaaaaatattctcaacataaaaaactttgtgtaatacttgtaacgctacat",
    "tggggacgtcgttactgatccgcacgtttatgatatgctatcgtactctttagcgag",
    "tcagaaatattatggtgatgaactgtttttttatccagtataatttgttggcataat",
};

当然,如果你不喜欢这个测试数据,咱们可以随机生成一组,毕竟这是一个老学长“手撸“出来的(注:学长的那篇博客最后会放上,随机生成字符串也是参考学长的例子)。


const int SIZE = 53; 
const int LEN = 20;    // 字符串长度

char getRandChr()
{
	string s = "AEIOU";		// 重点看一下这个地方,只从 AEIOU 中随机产生,保证最后生成的字符串不会是太随机、任意
	return s[ rand() % s.length() ];
}

string getRandStr()
{
	string str;
	// srand( (unsigned int)time( (time_t *)NULL ) );		// 关于 time 的随机种子 
	for( int i = 0; i < LEN; i++ )
	{
		str += getRandChr();
	}
	return str;
}

​

先看一下产生的字符串效果,如果你还是觉得产生的字符串太随意,你可以把 getRandChr() 中的五个字母改为三个。

All strings are in here:
AAOEUUIUOUUAAUEOUIOO
AAOUEAUOAUIAUEOUAOII
AEIEAAIUOOEIOIOUOUOE
AEUEUAUIAOIEOIEEOUUA
AOIUAEOAIIUOEOEOAEUA
AOOOAIIAOUOAIUIOOIEA
AOOUAAEUAEIAOAEEEIEI
AOOUIOIAEUOEUOAUIUOA
AUAOOOIIEUOOEAOOOUOA
AUUIOOAEIAEEUAEEOUOU
EAAIOEAIUOEAAUAAEEOO
EAEEAIAUEIAAEUEAIEII
EAEIAAAUAUUAEUOIEEOO
EAUEOIOAOAEUAOUOEAEU
EIAOAEEEUUEIEOIAAEAE
EIUAUUOOIUAAEIEEAIIE
EOAEAOOUUOOUIEUAOEEA
EOAOAIOAIIIAUEAOAAIU
EOEAIOEAOEAAIIEIUOIO
EOEUAOIAIAOEIIIIUUEU
EUIAAAAUIOIUAAUUAAEE
EUIOIIEEOAIEEOUIIUAU
EUOIAAUUEIAAOOUOUEIU
EUOUOOAOIIOOOOIEAIOU
EUOUUIAIAEAIAAEIOOUU
IAOEIAIUUUUUIUOAIIOI
IEAIIUAIUOOAIIOEIEAA
IEUEEIIIIEAEOUUAUIEA
IIIIAOEOOIAUAAOOIOOI
IIOOAUUOEOIEEAOOAEIA
IOUUEAEEEEEIEUAEEIEU
OAAIEOUIUAUUAAAOOAEO
OAAOIOIAOEEOEIEUEIUE
OAOOIAIAOIIUIAOUOIEE
OAUOUAIEOEEAUIAUIOII
OEIOOUEEOOIUIIIUOEUO
OIAOAEIIIEUAIAOUAAAO
OIUOOAOOAEOEUOUIIIOI
OIUUUIIUIOIOEAIOAOOI
OUAIOEAOEEIUIAOEIEAO
OUIOUOAAOOAAEUOEOUIE
UAIEAEEEEIOAAUOOAEIU
UAIOUEAIOEEOOIIIEIEA
UEAEAUUEOOUOUUUIAAOO
UEEAUAEAEUAOAOUEOUIU
UEEIIEUAOUEIAOOIUEIU
UIEUOAIEEAOIAUUEEEEO
UIUEEOIEEUIEIUAIAAOU
UIUOEUOIAAAUOIOIUOEU
UOIOUAIOEUOUOUEOAEUI
UOUEOIOAEOUOEAUOAAOI
UOUUEAEEEOUUEOIOOIUE
UUUEAOEIUAOOEIEUUOIU

重头戏是:找出所有满足其出现频率大于某个给定阈值的子串。首先需要获得这些子串:

(1)按照长度,从53个字符串中截取(注意:是从最长的开始,并且截取的子串必须是之前没有出现过)

(2)判断子串是否满足条件,即大于给定阈值

(3)若该子串满足条件,则该串的全部子串也满足条件

​
	set<string> subStr;
	for( int i = LEN; i >= 1; i-- )		
        // 从最长子串开始,如果子串满足条件,那么该子串的子串必定满足条件
	{
		for( int j = 0; j < SIZE; j++ )		// 循环访问 SIZE 个字符串,j 为下标
		{
			for( int k = 0; k < LEN - i + 1; k++ )	// k 为下标,即 begin pos
			{
				int count = 0;		// 计数器,统计 tmpStr 出现的次数
				string tmpStr = str[j].substr( k, i );	// 截取子串
				if( subStr.find( tmpStr ) != subStr.end() )
					continue;		// 如果该子串已存在则跳过

				for( int ii = 0; ii < SIZE; ii++ )
					if( str[ii].find( tmpStr ) != string::npos )	
					{
						// cout << "Find " << tmpStr << " in " << str[ii] << endl;
						count++;
					}
				if( count >= round( threshold * SIZE ) )
				{
					// tmpStr 的子串一起放进集合 subStr 中
					// cout << "For " << tmpStr << "'s count is " << count << endl;
					for( int ii = tmpStr.length(); ii >= 1; ii--  )
					{
						for( int kk = 0; kk < tmpStr.length() - ii + 1; kk++ )
						{
							subStr.insert( tmpStr.substr( kk, ii ) );
						}
					}
				}

			}
		}
	}

​

还差子串的输出,需要从集合中输出。

​
set<string>::iterator iter;
cout << "All more than threshold strings are in here:" << endl;
for( iter = subStr.begin(); iter != subStr.end(); iter++ )
	cout << *iter << ' ';

​

如果你还需要源码的话,在这里:

https://paste.ubuntu.com/p/45hKmMX8S8/

或者复制这里:

#include <iostream>
#include <cstdlib>		// rand() 使用
#include <string>
#include <set>
#include <algorithm>	// sort() 使用
#include <time.h>
#include <cmath>		// round() 使用

using namespace std;

const int SIZE = 53;
const int LEN = 20;

string 	getRandStr();
char 	getRandChr();
double 	getThrhd();
void 	initStrSet( string str[] );

int main()
{
	string str[ SIZE ];
	double threshold;	// 阈值

	initStrSet( str );	// 初始化字符串集合,这个地方我们其实并没有使用集合
	cout << "All strings are in here:" << endl;
	for( int i = 0; i < SIZE; i++ )
		cout << str[i] << endl;

	threshold = getThrhd();	// 获取阈值
	// threshold = 0.9;

	set<string> subStr;
	for( int i = LEN; i >= 1; i-- )		// 从最长子串开始,如果子串满足条件,那么该子串的子串必定满足条件
	{
		for( int j = 0; j < SIZE; j++ )		// 循环访问 SIZE 个字符串,j 为下标
		{
			for( int k = 0; k < LEN - i + 1; k++ )	// k 为下标,即 begin pos
			{
				int count = 0;		// 计数器,统计 tmpStr 出现的次数
				string tmpStr = str[j].substr( k, i );	// 截取子串
				if( subStr.find( tmpStr ) != subStr.end() )
					continue;		// 如果该子串已存在则跳过

				for( int ii = 0; ii < SIZE; ii++ )
					if( str[ii].find( tmpStr ) != string::npos )	
					{
						// cout << "Find " << tmpStr << " in " << str[ii] << endl;
						count++;
					}
				if( count >= round( threshold * SIZE ) )
				{
					// tmpStr 的子串一起放进集合 subStr 中
					// cout << "For " << tmpStr << "'s count is " << count << endl;
					for( int ii = tmpStr.length(); ii >= 1; ii--  )
					{
						for( int kk = 0; kk < tmpStr.length() - ii + 1; kk++ )
						{
							subStr.insert( tmpStr.substr( kk, ii ) );
						}
					}
				}

			}
		}
	}

	set<string>::iterator iter;
	cout << "All more than threshold strings are in here:" << endl;
	for( iter = subStr.begin(); iter != subStr.end(); iter++ )
		cout << *iter << ' ';

	return 0;
}

char getRandChr()
{
	string s = "AEIOU";		// 重点看一下这个地方,只从 AEIOU 五个元音字母中随机产生,保证最后生成的字符串不会是太随机、任意
	return s[ rand() % s.length() ];
}

string getRandStr()
{
	string str;
	// srand( (unsigned int)time( (time_t *)NULL ) );		// 关于 time 的随机种子 
	for( int i = 0; i < LEN; i++ )
	{
		str += getRandChr();
	}
	return str;
}

void initStrSet( string str[] )
{
	for( int i = 0; i < SIZE; i++ )
		str[i] = getRandStr();
	sort( str, str + SIZE );	// 将 SIZE 个字符串排序
}

double getThrhd()
{
	double tmp;
	cout << "Please Input Threshold( 0-1 ): ";
	while (1) 
	{
    	cin >> tmp;
    	if ( tmp > 0 && tmp < 1 )
        {
        	cout << "Input SUCCESS!" << endl;
        	break;
        }
    	else
    	{
    		cout << "Input ERROR!" << endl;
        	cout << ">>>Please input threshold( 0-1 ) Again: ";
    	}
        	
	}
	return tmp;
}

 

参考的老学长的一篇博客。

https://blog.csdn.net/lrwwll/article/details/53033210

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值