Rosalind编程问题之计算随机序列出现并匹配待比对序列的概率。
跟Rosalind Java|Introduction to Random Strings有异曲同工之妙。
Matching Random Motifs
Problem:
Our aim in this problem is to determine the probability with which a given motif (a known promoter, say) occurs in a randomly constructed genome. Unfortunately, finding this probability is tricky; instead of forming a long genome, we will form a large collection of smaller random strings having the same length as the motif; these smaller strings represent the genome’s substrings, which we can then test against our motif.
Given a probabilistic event A, the complement of A is the collection Ac of outcomes not belonging to A. Because Ac takes place precisely when A does not, we may also call Ac “not A.”
For a simple example, if A is the event that a rolled die is 2 or 4, then Pr(A)=13. Ac is the event that the die is 1, 3, 5, or 6, and Pr(Ac)=23. In general, for any event we will have the identity that Pr(A)+Pr(Ac)=1.
Given: A positive integer N≤100000, a number x between 0 and 1, and a DNA string s of length at most 10 bp.
Sample input:
90000 0.6
ATAGCCGA
Return: The probability that if N random DNA strings having the same length as s are constructed with GC-content x (see “Introduction to Random Strings”), then at least one of the strings equals s. We allow for the same random string to be created more than once.
Sample output:
0.689
本题给出我们90000条长度和ATAGCCGA一样的序列,这些序列有相同的GC比例0.6。需要我们计算90000条序列中至少一条序列等于给出序列的概率(ATAGCCGA)。有些朋友可能提问,给出的GC比例为0.6,并不等于给出序列(ATAGCCGA)的GC比。在这里GC含量可以理解为GC出现的概率,从而反推AT出现的概率。否则两条GC比例不同但长度相同的序列永远不可能相等。
解题思路如下:
1.读取三个输入数据。
2.统计给出序列的GC数和AT数,根据GC比例计算G/C以及A/T出现的概率。
3.结合第二步,计算给出序列与未知序列相同的概率p。
4.结合第三步,计算给出序列与未知序列不相同的概率1-p。
5.至少一个匹配给出序列的概率等于1-(1-p)n,最后四舍五入到三位小数。
代码如下:
public class Matching_Random_Motifs {
public static void main(String[] args) {
//1.输入序列个数
Scanner sc = new Scanner(System.in);
System.out.println("请输入序列个数n:");
int n = sc.nextInt();
//输入GC比例
Scanner sr = new Scanner(System.in);
System.out.println("请输入GC比例:");
Double GC = sr.nextDouble();
//输入待比对序列
Scanner sk = new Scanner(System.in);
System.out.println("请输入示例序列:");
String DNA = sk.nextLine();
//2.计算有相同GC比例和长度的序列中等于待比对DNA的概率
int ATDNA = 0;
int GCDNA = 0;
for (int i = 0; i < DNA.length(); i++) {
switch (DNA.charAt(i)) {
//括号中的输出类型为char类型,匹配后面的字符时需要将字符加单引号转化为对应的ASCII码方可匹配。
case 'A':
case 'T':
ATDNA++;
break;
case 'C':
case 'G':
GCDNA++;
break;
default:
break;
}
}
double prob = Math.pow(GC / 2, GCDNA) * Math.pow((1 - GC) / 2, ATDNA);
double randomprob = 1 - Math.pow(1 - prob, n);
System.out.println((float) (Math.round(randomprob * 1000)) / 1000);//四舍五入时分子一定要转化成浮点型,否则会默认整数型输出并舍为0
}
}