genome comparison commend 1

Xsisikk

已于 2023-05-15 09:19:52 修改

阅读量400

点赞数 1

分类专栏：比较基因组文章标签： linux 运维服务器 ubuntu

于 2023-04-22 16:08:01 首次发布

本文链接：https://blog.csdn.net/weixin_58269397/article/details/130305380

版权

比较基因组专栏收录该内容

1 篇文章 1 订阅

订阅专栏

本文详细介绍了比较基因组学的分析流程，包括基因序列的下载与格式修正，使用OrthoFinder进行直系同源基因聚类，以及基于单拷贝基因构建物种进化树的方法。整个过程涉及多个生物信息学工具的使用，如OrthoFinder、GFF3清理、MAFFT和RAxML。

摘要由CSDN通过智能技术生成

仅本人练习使用！！后续会逐渐修改！！

一文读懂比较基因组学 - 简书
 比较基因组学分析（Comparative Genomics Analysis）_比较基因组分析_荞麦agan的博客-CSDN博客『比较基因组学』直系同源基因寻找和基因家族扩张与收缩分析 - 知乎

比较基因组学分析（Comparative Genomics Analysis）_比较基因组分析_荞麦agan的博客-CSDN博客

1. 准备比较基因分析多物种的FASTA序列文件

系统进化树的构建 - 简书

比较基因组分析-数据准备 - 简书

mkdir a.preparing_data && cd a.preparing_data

1) 下载基因组文件

通过NCBI的Genome数据库(https://www.ncbi.nlm.nih.gov/genome)和JGI数据库查找目标物种的信息，并下载pep.fasta，genome.fasta，genome.gff3等。

2) 手工生成文件source.txt

该文件前2列为：物种缩写名（推荐为属名前2个字母+种名前3个字母）、双名法拉丁文物种名。

（可做）该文件前第一列用于后续的数据文件名前缀、第二列用于后续将结果中的物种名缩写转换成物种名全称、第三列用于数据的批量化下载。此外，手动编辑文件source.txt时，可以生成更多的列，记录其它基因组信息。

3) 对NCBI的GFF文件进行格式修正。

默认下NCBI的GFF格式不太正确，且其基因ID编号比较乱。

对NCBI的GFF3文件进行整理，仅保留含有mRNA信息的基因，修正尾部的CDS（有些NCBI的注释结果中最后一个CDS是不包含终止密码子的），修正同一条链上存在重叠的基因情况，并对基因id进行重命名。

最后，得到所有基因的Protein和CDS序列。若一个基因存在多个可变剪接，则仅保留CDS最长的转录本信息。

# 批量
for i in `cut -f 1 source.txt`
do
    echo "gzip -dc $i.genomic.gff.gz > $i.genome.gff; gzip -dc $i.genomic.fna.gz > $i.genome.fasta; GFF3Clear --coverage 0.3  --gene_code_length 5 --gene_prefix $i --genome $i.genome.fasta $i.genome.gff > $i.geneModels.gff3 2> $i.GFF3Clear.log; gff3ToGtf.pl $i.genome.fasta $i.geneModels.gff3 > $i.geneModels.gtf 2> $i.gff3ToGtf.log; eukaryotic_gene_model_statistics.pl $i.geneModels.gtf $i.genome.fasta $i > $i.statistics"
done > command.geneModels.list
ParaFly -c command.geneModels.list -CPU 6

生信基因功能分析工具：Orthofinder使用教程 - 简书

使用OrthoFinder进行基因家族分析_徐洲更hoptop的博客-CSDN博客

https://www.cnblogs.com/abysw/p/14120408.html

OrthoFinder 进行直系同源基因分析教程 - 简书

OrthoFinder寻找同源基因并建树（1） - 简书

OrthoFinder 安装和使用 - 知乎

用OrthoFinder做直系同源推断 | 生信技工

OrthoFinder2：直系同源基因的寻找以及Orthogroup构建 - 简书

OrthoFinder介绍:比较基因组学系统发育关系推断（单拷贝基因构树）_哔哩哔哩_bilibili

2. 使用OrthoFinder进行同源基因聚类分析

mkdir b.OrthoFinder && cd b.OrthoFinder

1) 准备输入数据

输入一个文件夹，包含各物种全基因组Proteins序列的Fasta文件。

# pep
mkdir compliantFasta && cd compliantFasta
for i in `cut -f 1 ../source.txt`
do
    echo "perl -p -e 's/>/>$i\|/' ../$i.pep.fasta > $i.fasta"
done | sh
cd ..
# CDS
mkdir compliantFasta_CDS && cd compliantFasta_CDS
for i in `cut -f 1 ../../a.preparing_data/source.txt`
do
    echo "perl -p -e 's/>/>$i\|/' ../../a.preparing_data/$i.CDS.fasta > $i.fasta"
done | sh
cd ..

2) 使用OrthoFinder进行直系同源基因分析

# 进入conda环境
sudo su
password
conda activate orthofinder
# 进入相应文件夹
cd b.OrthoFinder

① 默认设置下输出文件夹在输入文件夹里面，使用-o参数设置输出文件夹路径。使用-op参数设置运行diamond前终止，以手动运行diamond并设置更严格的BLASTP阈值。

orthofinder -f compliantFasta -o OrthoFinder -op > command.diamond.list

② 修改diamond参数，手动BLASTP分析并过滤比对结果

perl -p -i -e 'if (m/diamond blastp/ ) { s/$/ --outfmt 5 --max-target-seqs 500 --id 10/; s/--compress 1//; s/.txt/.xml/; s/ -e \S+/ --evalue 1e-5/; s/ -p 1 / -p 4 /; } else { s/^.*\n$//; }' command.diamond.list
ParaFly -c command.diamond.list -CPU 2
perl -e 'while (<>) { print "parsing_blast_result.pl --no-header --max-hit-num 500 --evalue 1e-9 --CIP 0.3 --query-coverage 0.5 --db-query 0.5 $1.xml | gzip -c - > $1.txt.gz\n" if m/(\S+).xml/; }' command.diamond.list3 > command.parsing_blast_result.list3
ParaFly -c command.parsing_blast_result.list3 -CPU 18

③ 继续运行OrthoFinder命令进行直系同源基因分析

OrthoFinderWorkingDir=`head -n 1 command.diamond.list3 | perl -ne 'print $1 if m/-d (\S+)\//'`
orthofinder -b $OrthoFinderWorkingDir -og

④ 得到直系同源基因聚类结果groups.txt

/media/aa/DATA/SZQ2/command/clf/bin/orthomcl_mcl2Groups.pl OrthoFinder3/*/WorkingDirectory/OrthoFinder/*/Orthogroups/Orthogroups.txt
mv groups.OG.txt groups.OG3.txt
mv groups.PG.txt groups.PG3.txt
mv groups.filtered.txt groups.filtered3.txt
# 软连接
ln -sf groups.OG3.txt groups3.txt

⑤ 进行同源基因数量的统计

cat groups.OG3.txt groups.PG3.txt groups.filtered3.txt > groups.All3.txt
/media/aa/DATA/SZQ2/command/clf/bin/orthomcl_genes_number_stats.pl groups.All3.txt compliantFasta3/compliantFasta > genes_number_stats3.txt
cd ..

3. 使用单拷贝同源基因构建物种树

「生信」同源基因分析——OrthoMCL - 简书

寻找同源基因工具OrthoMCL与OrthoFinder的安装与使用 - 简书

orthomcl 安装实战 - 简书

OrthoMCL介绍 | 陈连福的生信博客

Orthomcl使用说明_海蜒的博客-CSDN博客

使用 RAxML 构建进化树 | 陈连福的生信博客

RAxML 构建进化树 - 简书

RAxML建最大似然树教程（Linux） - 简书

最大似然法（ML）构建进化树的几种软件比较 - 知乎

科学网—raxmlHPC-HYBRID-SSE3 使用笔记 - 李雷廷的博文

mkdir c.species_tree && cd c.species_tree

1) 根据orthomcl结果提取单拷贝同源基因

# CDS
orthomcl_extract_ortholog_seqs.pl --out_directory orthologGroups_CDS3 --species_ratio 1 --single_copy_species_ratio 0.5 --copy_num 10 --max_seq_length 6000 ../b.OrthoFinder/groups3.txt ../b.OrthoFinder/compliantFasta_CDS3/compliantFasta_CDS
# 粘贴结果
touch orthomcl.results1-3.txt
sudo  chmod -R 777 *     
# pep
orthomcl_extract_ortholog_seqs.pl --out_directory orthologGroups_Protein3 --species_ratio 1 --single_copy_species_ratio 0.5 --copy_num 10 --max_seq_length 6000 --compliantFasta_dir_for_calculating_seq_length  ../b.OrthoFinder/compliantFasta_CDS3/compliantFasta_CDS ../b.OrthoFinder/groups3.txt ../b.OrthoFinder/compliantFasta3/compliantFasta
# 粘贴结果
touch orthomcl.results2-3.txt
sudo  chmod -R 777 *
# 修改文件
perl -p -i -e 's/\*$//; s/\*/X/g' orthologGroups_Protein3/*.fasta
ls orthologGroups_Protein3/ | perl -pe 's/.fasta//' > orthologGroups3.txt

2) 对单拷贝同源进行进行多序列比对

for i in `cat orthologGroups3.txt`
do
    echo "linsi orthologGroups_Protein3/$i.fasta > orthologGroups_Protein3/$i.fasta.align"
done > command.mafft.list3
ParaFly -c command.mafft.list3 -CPU 48

3) 将Protein序列比对结果转换为Codon序列比对结果

for i in `cat orthologGroups3.txt`
do
    echo "proteinAlignment2CDSAlignemnt.pl orthologGroups_Protein3/$i.fasta.align orthologGroups_CDS3/$i.fasta > orthologGroups_CDS3/$i.fasta.align"
done > command.proteinAlignment2CDSAlignemnt.list3
ParaFly -c command.proteinAlignment2CDSAlignemnt.list3 -CPU 48
# 解锁文件权限
sudo  chmod -R 777 *

4) 对Codon序列比对结果进行保守区块提取

# 进入conda环境
conda activate pfam_scan
# 运行
for i in `cat orthologGroups3.txt`
do
    echo "Gblocks orthologGroups_CDS3/$i.fasta.align -t=c; if [ -f orthologGroups_CDS3/$i.fasta.align-gb ]; then echo $i completed; fi"
    echo "Gblocks orthologGroups_Protein3/$i.fasta.align -t=p; if [ -f orthologGroups_Protein3/$i.fasta.align-gb ]; then echo $i completed; fi"
done > command.Gblocks.list3
ParaFly -c command.Gblocks.list3 -CPU 48

5) 将各个同源基因家族的多序列比对结果转换为Phylip格式

for i in `cat orthologGroups3.txt`
do
    perl -p -e 's/(^>\w+).*/$1/; s/ //g' orthologGroups_CDS3/$i.fasta.align-gb > orthologGroups_CDS3/$i.fasta.align-gb.fasta
done
# 运行convertFasta2Phylip.sh
for i in `cat orthologGroups3.txt`
do
    echo "/media/aa/DATA/JJH/software/standard-RAxML-master/usefulScripts/convertFasta2Phylip.sh orthologGroups_CDS3/$i.fasta.align-gb.fasta > orthologGroups_CDS3/$i.phy"
done > command.convertFasta2Phylip.list3
ParaFly -c command.convertFasta2Phylip.list3 -CPU 48

6) 整合所有单拷贝同源基因的Codon序列比对结果，并使用ML算法构建物系统发育树

perl -e 'while (<orthologGroups_CDS3/*.align-gb>) { open IN, $_ or die $!; while (<IN>) { if (m/^>(\w+)/) { $seq_id = $1; } else { s/\s+//g; $seq{$seq_id} .= $_; } } } foreach (sort keys %seq) { print ">$_\n$seq{$_}\n"; }' > allSingleCopyOrthologsAlign.Codon3.fasta
# FastTree建树
mkdir FastTree3 && cd FastTree3
FastTree -nt -gtr -gamma ../allSingleCopyOrthologsAlign.Codon3.fasta > tree_abbr.FastTree
cp tree_abbr.FastTree tree_fullName.FastTree
cut -f 1,2 source.txt | perl -p -e 's/\t/\t\"/; s/$/\"/;' | perl -p -e 's#\s+#/#; s#^#perl -p -i -e "s/#; s#\n$#/" tree_fullName.FastTree\n#;' | perl -p -e 's#/\"#/\\\"#; s#\"/#\\\"/#;' | sh
cp tree* ../
cd ..
# RAxML建树
mkdir RAxML3 && cd RAxML3
/media/aa/DATA/JJH/software/standard-RAxML-master/usefulScripts/convertFasta2Phylip.sh ../allSingleCopyOrthologsAlign.Codon3.fasta > allSingleCopyOrthologsAlign.Codon.phy
# 进入conda环境
conda activate FigTree
# 数据量较大时，推荐按如下步骤使用raxmlHPC-PTHREADS-SSE3进行并行化计算，指定核苷酸或氨基酸替代模型。PROTGAMMALGX 的解释： "PROT" 表示氨基酸替代模型； GAMMA 表示使用 GAMMA 模型； X 表示使用最大似然法估计碱基频率。
raxmlHPC-PTHREADS-SSE3 -f a -x 12345 -p 12345 -# 100 -m GTRGAMMA -s allSingleCopyOrthologsAlign.Codon.phy -n out_codon -T 48
# 或者（Dacsp1,Treme1,Walse1为外群）
raxmlHPC-PTHREADS-SSE3 -f a -x 12345 -p 12345 -# 100 -m PROTGAMMALGX -s allSingleCopyOrthologsAlign.Codon.phy -o Dacsp1,Treme1,Walse1 -n out_codon2 -T 48
# 整理结果
cp RAxML_bipartitions.out_codon tree_abbr.RAxML
cp tree_abbr.RAxML tree_fullName.RAxML
cut -f 1,2 ../FastTree2/source.txt | perl -p -e 's/\t/\t\"/; s/$/\"/;' | perl -p -e 's#\s+#/#; s#^#perl -p -i -e "s/#; s#\n$#/" tree_fullName.RAxML\n#;' | perl -p -e 's#/\"#/\\\"#; s#\"/#\\\"/#;' | sh
# 一般使用RAxML_bipartitions.out_codon即可