Rosalind编程问题之查找两个序列由内含子分隔的共有motif。
Ordering Strings of Varying Length Lexicographically
Problem:
A subsequence of a string is a collection of symbols contained in order (though not necessarily contiguously) in the string (e.g., ACG is a subsequence of TATGCTAAGATC). The indices of a subsequence are the positions in the string at which the symbols of the subsequence appear; thus, the indices of ACG in TATGCTAAGATC can be represented by (2, 5, 9).
As a substring can have multiple locations, a subsequence can have multiple collections of indices, and the same index can be reused in more than one appearance of the subsequence; for example, ACG is a subsequence of AACCGGTT in 8 different ways.
Given: Two DNA strings s and t (each of length at most 1 kbp) in FASTA format.
Sample input:
Rosalind_14
ACGTACGTGACG
Rosalind_18
GTA
Return: One collection of indices of s in which the symbols of t appear as a subsequence of s. If multiple solutions exist, you may return any one.
Sample output:
3 8 10
题目给出两条序列,需要我们在长的一条中找到短的一条里全部碱基的位置。也可以理解为短序列是长序列的cds,长序列包含内含子,需要我们定位出cds的碱基序号。(本题答案不唯一)
解题思路如下:
1.读取两条序列。
2.双指针法分别遍历长短序列。
3.如碱基相同则输出该碱基的序号。
下面是实现代码:
public class Finding_a_Spliced_Motif {
public static void main(String[] args) {
ArrayList<String> fasta = BufferedReader2("C:/Users/Administrator/Desktop/rosalind_sseq.txt", "fasta");
ArrayList<Integer> index = new ArrayList<>();
//双指针法
int i = 0;//第一条序列,主序列
int j = 0;//第二条序列,亚序列
while (j < fasta.get(1).length()) {
if (fasta.get(1).charAt(j) == fasta.get(0).charAt(i)) {
index.add(i + 1);
j++;//亚序列前进
}
i++;//主序列前进
}
for (int k = 0; k < index.size(); k++) {
System.out.print(index.get(k) + " ");
}
}
public static ArrayList<String> BufferedReader2(String path, String choose) {//返回值类型是新建集合大类,此处是Set而非哈希。
BufferedReader reader;
ArrayList<String> tag = new java.util.ArrayList<String>();
ArrayList<String> fasta = new java.util.ArrayList<String>();
try {
reader = new BufferedReader(new FileReader(path));
String line = reader.readLine();
StringBuilder sb = new StringBuilder();
while (line != null) {//多次匹配带有“>”的行,\w代表0—9A—Z_a—z,需要转义。\W代表非0—9A—Z_a—z。
if (line.matches(">[\\w*|\\W*]*")) {
tag.add(line);
//定义字符串变量seq保存删除换行符的序列信息
if (sb.length() != 0) {
String seq = sb.toString();
fasta.add(seq);
sb.delete(0, sb.length());//清空StringBuilder中全部元素
}
} else {
sb.append(line);//重新向StringBuilder添加元素
}
// read next line
line = reader.readLine();
}
String seq = sb.toString();
fasta.add(seq);
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
if (choose.equals("tag")) {
return tag;
}
return fasta;
}
}
双指针法
双指针法实现遍历的核心思想就是在遍历对象的过程中,不只使用单个指针进行数组或集合的访问,而是使用两个相同方向或者相反方向的指针进行扫描,从而达到相应的目的。换言之,双指针法充分使用了数组有序这一特征,从而在某些情况下简化运算。而实现双指针法关键点在于设定终止条件,本道题中两碱基字母相等就是终止条件:fasta.get(1).charAt(j) == fasta.get(0).charAt(i)。