生物信息学跟计算机一样,更新换代都是比较快的,还不能说当年我们用的经典软件,可能在现代来说已经过时了,因而与时俱进对于生物信息人员来说是很重要的。
当我们尝试使用EBI上的在线工具ClustalW2进行比对时,我们发现他已经光荣的退役了。
网站推荐做蛋白序列的多序列比对建议使用Clustal Omega
(https://www.ebi.ac.uk/Tools/msa/clustalo/)
做核酸序列多序列比对采用MUSCLE
(https://www.ebi.ac.uk/Tools/msa/muscle/)
本文主要讲述ClustaW2的替代工具之一MUSCLE用法。
在线版本如上,非常简单,小编在此不做介绍。下面主要说一下Linux版本用法,方便对于大批量序列的快速比对。
下面是MUSCLE的主要参数:
可以看到两个主要的参数-in 和-out,分别是输入和输出。
基础命令:
muscle -in seq.fa -out seq.aligned.fa
muscle -in seq.fa -phyiout seq.aligned.phy(常用的构建ML进化树所采用的MUSCLE命令)
MUSCLE的所有输出格式见下面,只需要仿照上面的命令改成相应的输出参数即可。
-clwout filenameCLUSTALW format. By default, will write MUSCLE as the program name in the file header. If the -clwstrict option is specified, then the program name will be written as 'CLUSTAL W (1.81)'. This is useful if the output will be parsed by scripts that check the program name.
-fastaout filenameFASTA format (default).
-htmlout filenameHTML (web page) output. The alignment is colored using a color scheme from Eric Sonnhammer's Belvu editor.
-physout filenamePHYLIP sequential format.
-phyiout filenamePHYLIP interleaved format.
-msfout filenameMSF format, as used in the GCG package, is requested by using the –msf option. As with CLUSTALW format, this is easier for people to read than FASTA. As of MUSCLE 3.52, the MSF format has been tweaked to be more compatible with GCG. The following differences remain.
(a) MUSCLE truncates labels at the first white space or after 63 characters, which ever comes first. The GCG package apparently truncates after 10 characters. If this is a problem for you, please let me know and I'll add an option to truncate after 10 in a future version.
(b) MUSCLE allows duplicate sequence labels, while GCG forbids duplicates. If you use the –stable option of muscle, then the order of the input sequences is preserved and sequences can be unambiguously identified even if the labels differ.