roary数据输入，参数设置，结果文件

Ai.大萝北

已于 2024-12-13 20:40:45 修改

阅读量5.3k

点赞数 10

分类专栏：软件使用文章标签：机器学习人工智能 r语言

于 2021-10-13 10:56:13 首次发布

本文链接：https://blog.csdn.net/luobiubiu/article/details/120739380

版权

软件使用专栏收录该内容

13 篇文章

订阅专栏

1.数据输入

（1）Roary 的输入端需要 gff 格式的数据文件。由于 NCBI 下载的 gff 是不含核酸序列的 gff 格式文件，无法直接用来分析。Prokka 生成的 gff 格式文件包含核酸序列，所以可以下载 NCBI 上的 fna 文件然后用 prokka 注释后再用 roary 分析。
（2）所有参与计算的gff放进一个文件夹

2.参数设置

（1）roary 用 blastp 对 gff 文件中的序列进行 orthologs 分析，获得 pangenome 和 coregenome 结果，还可以使用 mafft 对核心基因组进行序列比对，生成系统发生树。
（2）参数设置：

Usage:   roary [options] *.gff

Options: -p INT    number of threads [1]
         -o STR    clusters output filename [clustered_proteins]
         -f STR    output directory [.]
         -e        create a multiFASTA alignment of core genes using PRANK
         -n        fast core gene alignment with MAFFT, use with -e
         -i        minimum percentage identity for blastp [95]
         -cd FLOAT percentage of isolates a gene must be in to be core [99]
         -qc       generate QC report with Kraken
         -k STR    path to Kraken database for QC, use with -qc
         -a        check dependancies and print versions
         -b STR    blastp executable [blastp]
         -c STR    mcl executable [mcl]
         -d STR    mcxdeblast executable [mcxdeblast]
         -g INT    maximum number of clusters [50000]
         -m STR    makeblastdb executable [makeblastdb]
         -r        create R plots, requires R and ggplot2
         -s        dont split paralogs
         -t INT    translation table [11]
         -ap       allow paralogs in core alignment
         -z        dont delete intermediate files
         -v        verbose output to STDOUT
         -w        print version and exit
         -y        add gene inference information to spreadsheet, doesnt work with -e
         -iv STR   Change the MCL inflation value [1.5]
         -h        this help message

Default usage – create a pan genome without a core alignment

roary *.gff

Quickly generate a core gene alignment using 8 threads:

roary -e --mafft -p 8 *.gff

Save results to a different directory:

roary –f output_dir *.gff

Change the minimum blastp percentage identity. ’ not advised to go below 90% unless you know what you’re doing.

roary –i 90 *.gff

Run a QC check to see if all the samples are what you think they are

roary –qc –k /path/to/kraken/db *.gff

don’t split clusters containing paralogs

roary -s *.gff

Check that the software is installed correctly.

roary -a

3.结果文件

roary生成的结果文件：

结果文件数据内容
accessory_binary_genes.fa 非核心基因的二进制分布数据，以0/1表示携带或不携带
accessory_binary_genes.fa.newick 非核心基因的二进制分布数据的newick树图数据文件
accessory_graph.dot 非核心基因点图
accessory.header.embl 非核心基因数据头信息，以embl格式保存
accessory.tab 非核心基因信息
blast_identity_frequency.Rtab blast比对一致性结果的R语言工具
clustered_proteins 聚类的蛋白质
core_accessory_graph.dot 核心基因点图
core_accessory.header.embl embl 格式的文件显示各 accessory 基因
core_accessory.tab accessory 基因在所在的基因组
core_alignment_header.embl 核心序列比对结果的头信息，以embl格式保存
core_gene_alignment.aln 核心基因序列比对
core_gene_alignment.aln.reduced 核心基因序列比对，去除冗余数据
gene_presence_absence.csv csv 格式的基因在各个基因组中是否存在的数据文件
gene_presence_absence.Rtab Rtab 格式的基因在各个基因组中是否存在的数据文件
number_of_conserved_genes.Rtab Rtab 格式的不同数量基因组所共有基因数
number_of_genes_in_pan_genome.Rtab Rtab 格式的不同数量基因组的所有基因数
number_of_new_genes.Rtab Rtab 格式的不同数量基因组所新增的基因数
number_of_unique_genes.Rtab Rtab 格式的不同数量基因组所特有基因数
pan_genome_reference.fa 泛基因组参考序列
summary_statistics.txt pangenome 分析各种基因数量结果

参考

roary官网

杭州市疾控中心测序实验室Roary分析