fasta文件格式如下:>ABC
GGTTCCAAACCGGTT
AATTGGGGGCCCGGGTT
AAATGGAAGGATTTCCC
AATTGGA
>DEF
AAAAATTTTTGGGGGC
CCCCCGGGAAT
CCCCAAAACACACACA
TTTTGGGAGCAGGCAG
>GHI
AATTCGCGGCATCGCATTCAGC
GCGACTACGACTACGATGCATCAG
CAGCATCG
>JKL
AATTAGGATTTGTGCTAGCATG
CGCGGCTCGCGGCCCCCCCGGAT
CGCGATTGGCATC
CAGTCGTAGCTACGTAGCT
> 符号后是基因名或者Contig ID (Contig的解释:用二代测序或一代测序得到的较短的seq拼接成的较长的序列,拼接的原理就是基于seq之间的重叠碱基,拼接获得的序列称为Contig)
然后跟着的几行是碱基序列
在Java中匹配多行,需要用DOTALL参数
示例如下:import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class patMatcher {
public static void main(String[] args) {
String s = ">adsds\nATGGGC\nAAAAGG\n";
String seq_id = "";
String sequence = "";
Pattern p = Pattern.compile("(>.*?\\s)([ATCG\\s]+)", Pattern.DOTALL);
Matcher mat = p.matcher(s);
while (mat.find()){System.out.print(mat.group(2));}
}
}
s 代表一条contig序列,ID是adsds, 碱基序列是多行的ATGGGCAAAAGG
定义一个Matcher, 返回的mat.group(2) 就是碱基序列。
对于fasta文件,解析的方法如下:import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.FileNotFoundException;
import java.util.Scanner;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class ReadFa {
public static void main(String[] args) throws IOException{
// Scanner s = null;
StringBuilder s = new StringBuilder();
String seq_id = "";
String sequence = "";
String line;
try {
// s = new Scanner(new BufferedReader(new FileReader(args[0])));
// Pattern p = Pattern.compile("^(?!>.*\n)[ATGC\n]+", Pattern.MULTILINE);
BufferedReader r = new BufferedReader(new FileReader(args[0]));
while((line=r.readLine())!=null) {
s.append(line);
s.append("\n");
}
Pattern p = Pattern.compile("(>.*?\\s)([ATGC\\s]+)", Pattern.DOTALL);
Matcher mat = p.matcher(s.toString());
while (mat.find()) {
seq_id = mat.group(1);
sequence = mat.group(2);
System.out.print(seq_id);
System.out.print(sequence);
}
// String str = s.findWithinHorizon(p, 0);
/* do {
System.out.println(str);
str = s.findWithinHorizon(p, 0);
} while (str != null); */
} catch (FileNotFoundException e) {
System.out.println(e.getMessage());
}
/* } finally {
if (s != null) {
s.close();
}
} */
}
}
javac ReadFa.java
java ReadFa test_A.fasta
这样就能输出每条序列的ID和对应的碱基序列了。