Rosalind编程问题之基因组reads组装。
终于来到基因组装部分了~
这里分享最简单的组装逻辑之一,即根据重叠序列左右延伸出最长序列。
Genome Assembly as Shortest Superstring
Problem:
For a collection of strings, a larger string containing every one of the smaller strings as a substring is called a superstring.
By the assumption of parsimony, a shortest possible superstring over a collection of reads serves as a candidate chromosome.
Given: At most 50 DNA strings of approximately equal length, not exceeding 1 kbp, in FASTA format (which represent reads deriving from the same strand of a single linear chromosome).
The dataset is guaranteed to satisfy the following condition: there exists a unique way to reconstruct the entire chromosome from these reads by gluing together pairs of reads that overlap by more than half their length.
Sample input:
>Rosalind_56
ATTAGACCTG
>Rosalind_57
CCTGCCGGAA
>Rosalind_58
AGACCTGCCG
>Rosalind_59
GCCGGAATAC
Return: A shortest superstring containing all the given strings (thus corresponding to a reconstructed chromosome).
Sample output:
ATTAGACCTGCCGGAATAC
题目给出了我们多条序列,需要我们将其拼成一条长序列。这个过程对应了基因组组装中的reads to contig的过程,通过reads间的overlap区域将其叠加成contig。题目中给出的overlap cutoff是其序列长度一半以上。那么我们判断两条序列是否能拼接的标准便是两条序列是否有一半以上的序列重合。
因此需要我们定义一个长度可变的窗口(windows)范围在待比对序列长度的一半到全长。那么,当我们以某一个序列作为基准序列时,就可以将遍历到的序列按照可变窗口在前端和后端逐一比对,一旦比对成功即可将待比对序列的其余部分拼接到基准序列上。
解决步骤:
1.读取多序列文本,并获取其中的碱基序列作为输入文件。
2.在某一方向上,选取最长序列作为基准序列,逐一遍历其余序列,以可变窗口的方式判断是否拼接(StringBuilder)。
3.在另一方向重复上述过程。
操作代码如下:
public class AAAA_Genome_Assembly_as_Shortest_Superstring {
public static void main(String[] args) throws FileNotFoundException {
List<String> fasta = BufferedReader2("C:/Users/Administrator/Desktop/rosalind_long.txt", "fasta");//设置返回值是碱基序列。
//初始化最长序列
String start = new String();
for (int i = 0; i < fasta.size(); i++) {
String temp_start = fasta.get(i);
if (temp_start.length() > start.length()) {
start = temp_start;
}
}
StringBuilder sb = new StringBuilder(start);//初始化stringbuilder
int k = 0;
while (k < 50) {//循环50次保证每一条reads都被循环到
//以start为基准向后延伸
for (String s : fasta) {
//设置前缀和后缀滑动重叠框
for (int j = s.length() / 2 + 1; j < s.length(); j++) {
String suffix_start = start.substring(start.length() - j);
String prefix_seq = s.substring(0, j);
//当start后端与seq的前端匹配,且不能向后延伸时
if (suffix_start.equals(prefix_seq) && !start.substring(start.length() - j - 1).equals(s.substring(0, j + 1))) {
sb.append(s.substring(j));
start = sb.toString();
break;
}
}
}
k++;
}
k = 0;
while (k < 50) {//循环50次保证每一条reads都被循环到
//以start为基准向前延伸
for (String s : fasta) {
//设置前缀和后缀滑动重叠框
for (int j = s.length() / 2 + 1; j < s.length(); j++) {
String prefix_start = start.substring(0, j);
String suffix_seq = s.substring(s.length() - j);
//当start前端与seq的后端匹配,且不能向前延伸时
if (prefix_start.equals(suffix_seq) && !start.substring(0, j + 1).equals(s.substring(s.length() - j - 1))) {
sb.insert(0, s.substring(0, s.length() - j));//补充在原序列前的新序列是索引s.length()-j以前的部分。
start = sb.toString();
break;
}
}
}
k++;
}
System.out.println(sb);
}
public static ArrayList<String> BufferedReader2(String path, String choose) {//返回值类型是新建集合大类,此处是Set而非哈希。
BufferedReader reader;
ArrayList<String> tag = new java.util.ArrayList<>();
ArrayList<String> fasta = new java.util.ArrayList<>();
try {
reader = new BufferedReader(new FileReader(path));
String line = reader.readLine();
StringBuilder sb = new StringBuilder();
while (line != null) {//多次匹配带有“>”的行,\w代表0—9A—Z_a—z,需要转义。\W代表非0—9A—Z_a—z。
if (line.matches(">[\\w|\\W]*")) {
tag.add(line);
//定义字符串变量seq保存删除换行符的序列信息
if (sb.length() != 0) {
String seq = sb.toString();
fasta.add(seq);
sb.delete(0, sb.length());//清空StringBuilder中全部元素
}
} else {
sb.append(line);//重新向StringBuilder添加元素
}
// read next line
line = reader.readLine();
}
String seq = sb.toString();
fasta.add(seq);
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
if (choose.equals("tag")) {
return tag;
}
return fasta;
}
}