共线性分析软件MCScanX安装、报错解决方法及使用

最新推荐文章于 2023-12-03 18:21:34 发布

awk_bioinfo

最新推荐文章于 2023-12-03 18:21:34 发布

阅读量5.3k

点赞数

分类专栏：生物信息群体遗传进化&GWAS

原文链接：https://www.jianshu.com/p/41f7842140c6?tdsourcetag=s_pcqq_aiomsg

版权

生物信息同时被 2 个专栏收录

116 篇文章 91 订阅

订阅专栏

群体遗传进化&GWAS

24 篇文章 44 订阅

订阅专栏

MCScanX安装、报错解决方法及简单使用

1.简介：
（1）介绍：
MCScanX采用改进了的MCScan算法，分析基因组内或者基因组间的共线性区块。它利用两个物种蛋白质（或核酸）blastp比对结果，再结合这些蛋白质基因在基因组中的位置（处理过的gff），得到两个物种基因组的共线性区块。如果是分析基因组内的共线性区块，物种内蛋白质自己比对自己就可以。
mannual:http://chibba.pgml.uga.edu/mcscan2/documentation/manual.pdf

软件包含两个部分：1.MCScan算法；2.后期的可视化分析。目前这个软件可以在MAC OS（需要提前安装xcode）和 linux（需要Java SE Develoment Kit和“libpng”）上使用。MCScanX包括MCScanX、MCScanX_h、duplicate_gene_classifier三个主程序，位于主文件夹中；还有12个下游分析程序位于downstream_analyses文件夹。注意：优化版本中，13年发布的MCScanX-transposed是用来检测基因组内或组间的transposed gene重复。

（2）发表文献：Wang Y , Tang H , Debarry J D , et al. MCScanX : a toolkit for detection and evolutionary analysis of gene synteny and collinearity[J]. Nucleic Acids Research, 2012, 40(7):e49-e49

2.安装
pengzw@super-server:~$ wget http://chibba.pgml.uga.edu/mcscan2/MCScanX.zip
pengzw@super-server:~$ unzip MCScanX.zip -d ~/biosoft/ #安装在~/biosoft/下
pengzw@super-server:~/biosoft/MCScanX$ cd ~/biosoft/MCScanX
pengzw@super-server:~/biosoft/MCScanX$ make #会出现如下报错，修改文件以后再make

pengzw@super-server:~/biosoft/MCScanX$ make ##出现以下信息，则证明安装对啦
g++ struct.cc mcscan.cc read_data.cc out_utils.cc dagchainer.cc msa.cc permutation.cc -o MCScanX
g++ struct.cc mcscan_h.cc read_homology.cc out_homology.cc dagchainer.cc msa.cc permutation.cc -o MCScanX_h
g++ struct.cc dup_classifier.cc read_data.cc out_utils.cc dagchainer.cc cls.cc permutation.cc -o duplicate_gene_classifier
g++ dissect_multiple_alignment.cc -o downstream_analyses/dissect_multiple_alignment
g++ detect_collinear_tandem_arrays.cc -o downstream_analyses/detect_collinear_tandem_arrays
cd downstream_analyses/ && make
make[1]: Entering directory ‘/home/pengzw/biosoft/MCScanX/downstream_analyses’
javac -g family_circle_plotter.java
javac -g dot_plotter.java
javac -g family_tree_plotter.java
javac -g family_tree_plotter_show_length.java
javac -g bar_plotter.java
javac -g dual_synteny_plotter.java
javac -g circle_plotter.java
javac -g family_tree_plotter_chr.java
make[1]: Leaving directory ‘/home//pengzw/biosoft/MCScanX/downstream_analyses’

pengzw@super-server:~/biosoft/MCScanX$ echo 'PATH= $P A T H : / b i o s o f t / M C S c a n X /^{'} > > / . b a s h r c p e n g z w @ s u p e r - s e r v e r : / b i o s o f t / M C S c a n X$ source ~/.bashrc
pengzw@super-server:~/biosoft/MCScanX$ MCScanX
报错如图：是因为MCScanX 不支持64位系统。如果要在 64位上运行，需要加入相关库文件

报错1: “msa.cc:289:9: error: ‘chdir’ was not declared in this scope”
解决方案: 打开msa.cc，在顶部加上#include <unistd.h>

报错2: “dissect_multiple_alignment.cc:252:44: error: ‘getopt’ was not declared in this scope”
解决方案: 打开"dissect_multiple_alignment.cc"，在顶部加上#include <getopt.h>

报错3: “detect_collinear_tandem_arrays.cc:286:17: error: ‘getopt’ was not declared in this scope”
解决方案：打开"detect_collinear_tandem_arrays.cc"，在顶部加上#include <getopt.h>

报错4: “make[1]: javac: Command not found”
解决方案: 在https://www.oracle.com/technetwork/java/javase/downloads/index.html下载JDK，安装Java环境
有权限就直接sudo，因为我真的很懒。

pengzw@super-server:~$ sudo apt install openjdk-8-jdk
3.使用
(1).准备文件xyz.gff
MCscanX要求的gff文件和标准的gff文件不一样，它只有四列, 其中"sp#"的sp意味着你要用2个字母代表物种(多个字母好像也不影响结果)，#则表示是哪条染色体。而"gene"则要是你蛋白序列的基因名。

sp# gene starting_position ending_position
gff3文件第九列是=连接，利用awk指定多个分隔符就可得到

pengzw@super-server:~/reference/At$ awk -F “[= \t]” '$3 == “gene” {print$1"\t"$11"\t"$4"\t" $KaTeX parse error: Expected 'EOF', got '}' at position 2: 5}̲' Athaliana_167\dots$ sudo apt install ncbi-blast+

或者本地安装

pengzw@super-server:~$ mkdir biosoft
pengzw@super-server:~$ wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.8.1±x64-linux.tar.gz
pengzw@super-server:~$ tar zxvf ncbi-blast-2.8.1±x64-linux.tar.gz -C ~/biosoft/
pengzw@super-server:~$ cd ~/biosoft/ncbi-blast-2.8.1+/bin
pengzw@super-server:~/biosoft/ncbi-blast-2.8.1+/bin$ ls #绿色为程序
blastdb_aliastool blast_formatter blastx dustmasker makeblastdb psiblast segmasker update_blastdb.pl
blastdbcheck blastn convert2blastmask get_species_taxids.sh makembindex rpsblast tblastn windowmasker
blastdbcmd blastp deltablast legacy_blast.pl makeprofiledb rpstblastn tblastx
pengzw@super-server:~/biosoft/ncbi-blast-2.8.1+/bin$ echo 'PATH= $P A T H : / b i o s o f t / n c b i - b l a s t - 2.8.1 + / b i n /^{'} > > / . b a s h r c p e n g z w @ s u p e r - s e r v e r : / b i o s o f t / n c b i - b l a s t - 2.8.1 + / b i n$ source ~/.bashrc
pengzw@super-server:~/biosoft/ncbi-blast-2.8.1+/bin$ makeblastdb
blast建库（索引）：
makeblastdb -in refpep.fa -dbtype prot -parse_seqids -out refpep.db
options:
-in :带格式化的序列文件，必须为fa
-dbtype ：数据库类型，prot或者nucl
-out：数据库索引名
…

blast比对：
blastp -query yourpep.fa -db refpep.db -out xyz.blast -evalue 1e-10 -num_threads 24 -outfmt 6 -num_alignments 5
options：
-parse_seqids:解析序列标识，一般都要加上
-evalue:E值的阈值设置官网推荐的1e-10
-num_threads 24:线程为24
-num_alignments5:是取最好的5个比对结果
-outfmt 6:输出文件格式，总共有12种格式，6是tab格式

注意：
1.需要对序列进行预处理，仅保留每个基因中的一个转录本。
2.注意统一gff和blast文件中ID，不然无结果（0 matches ）。
3.blastp format6：12列

1 2 3 4 5 6 7 8 9 10 11 12
queryID dbID identity% length mismatch gap querypos1 querypos2 dbpos1 dbpos2 e-value bit-score
4.如果要做基因组组内和组间的共线性，那么就要将这两个基因组先进行合并, cat 1genome.fa 2genome.fa > all.fa, 然后用all.fa建索引，用all.fa进行比对。文件生成后也需要整合 cat 1genome.gff 2genome.gff > all.gff。

(3)使用MCScanX分析基因组共线性区块：
./MCScanX dir/xyz #xyz.blast and xyz.gff在同一文件夹下
options:
[Usage] ./bin/mcscan2 prefix_fn [options]
-k MATCH_SCORE, final score=MATCH_SCORE+NUM_GAPS*GAP_PENALTY
(default: 50)
-g GAP_PENALTY, gap penalty (default: -1)
-s MATCH_SIZE, number of genes required to call a collinear block
(default: 5)
-e E_VALUE, alignment significance (default: 1e-05)
-m MAX_GAPS, maximum gaps allowed (default: 25)
-a only builds the pairwise blocks (.aligns file)
-b patterns of collinear blocks. 0:intra- and inter-species (default); 1:intra-species; 2:inter-species -h print this help page

运行结果如图：

4.结果
注意：0 matches imported (xxxxx discarded), 那么一定是你的GFF文件里的基因名和blast结果的基因名不对应导致

程序输出3个文件：
Filename Description
xyz.collinaeriry 共性性区域数据。可以是同一物种类的共线性区域，也可以是物种间的共线性区域
xyz.html 在网页中浏览，可以直观看到在各个染色体上共线性的状态。灰色表示染色体序列；红色表示染色体上的串联基因；黄色表示共线性基因。
xyz.tandem 基因串联数据。2个或2个以上的同源基因在基因组上紧挨在一起。
核心程序：
MCScanX 检测共线性区域，并比对到参考染色体上。
MCScanX_h 和MCScanX类似，只不过输入文件是成对的用tab隔开的同源基因。
duplicate_gene_classifier 基因分类
5.下游分析及可视化
12个下游分析程序：
1.detect_collinear_tandem_arrays
2.dissect_multiple_alignment
3.add_ka_and_ks_to_collinearity.pl
4.group_collinear_genes.pl
5.detect_collinearity_within_gene_families.pl
6.origin_enrichment_analysis.pl
7.dot_plotter.java
8.dual_synteny_plotter.java
9.circle_plotter.java
10.bar_plotter.java
11.family_circle_plotter.java
12.family_tree_plotter.java

(1)duplicate_gene_classifier(基因分类):
其中0，1，2，3，4分别代表了哪五大类：
0：singleton（非重复基因）
1：dispersed（不是2，3，4的其它重复）
2：proximal（染色体附近的重复，但是不相邻）
3：tandem（串联重复）
4：WGD/segmental（在共线性区域的共线性基因）

(2)detect_collinearity_within_gene_families.pl:提取基因家族的分析结果
1)准备gene_family_file:txt, 以tab键分隔

2)detect_collinearity_within_gene_families.pl用法：得到复制基因对

perl detect_collinearity_within_gene_families.pl -i gene_family_file.txt -d xyz.collinearity -o
output_file
3)对基因家族的复制基因对分类：
安装：

wget http://chibba.pgml.uga.edu/mcscan2/transposed/MCScanX-transposed.zip
unzip MCScanX-transposed.zip -d ~/biosoft/
cd ~/biodoft/MCScanX-transposed
make
提取结果：

perl MCScanX-transposed.pl -i data -t at -c al,br,cp,pt,vv -o result/at_result -x 3
绘图:
按照manual中画图，
若结果不满意可以下载分析结果用circos软件绘图。