前言:之前跑了共线性比对,得到了单拷贝直系同源基因,获取了四倍碱基位点进行建树,接下来咱们就跑mcmctree
输入文件1)含校准点的有根树 2)密码子在3个位点的多序列比对结果 3)mcmctree配置文件
流程:
cd ~/Qxy/knowngenome/orthofinder #先建几个文件夹^-^
mkdir mcmctree
cd mcmctree
一、含校准点的有根树
之前Orthofinder结果会得到单拷贝直系同源基因,24个石耳树+外群有642个,然后构建ML树
在figtree中打开树,在Tree选项那里,选择root trees以外群为根,选择order nodes,选择transform branches这个格式选择equal
然后选择Export tree导出树为nemick格式XX.trees
然后去timetree网站http://www.timetree.org/上查找分子时间,选择石耳属分化时间47.2MYA,U.decussata和其他石耳的分化时间为29.9MYA
调整好树格式和内容后上传至服务器
#共线性比对结果同理~(总之就是要有基因序列,比对,建树)
自己去改好格式再上传啊!!!
二、密码子在3个位点的多序列比对结果
根据看的教程和各种说明,理解是密码子在三个位点分别的比对结果,意思是氨基酸第一位列在一起,第二位列在一起,第三位列在一起,但是尝试操作后报错,说输入序列的phylip格式不对,因此还是用密码子去跑mcmctree
2.1用orthofinder比对结果去做:
2.1.1拿到25个物种的cds序列(拿到之后别的流程也可以直接用)
cd 进入对应目录,应该包含XX.cds-transcripts.fa文件
python3 ~/Qxy/qxyjiaoben/CDS2.py #提前改名字
sh ~/Qxy/qxyjiaoben/orthofinder_changename.sh #要提前更改一下名字
rm -rf modified_cds1.fasta #中间文件删掉
cp 一下转到同目录下
#CDS2.py该脚本作用是把空格后面的那个删掉,从FUN_000001-T1 FUN_000001变成FUN_000001-T1(内容如下)
vim CDS2.py #脚本内容如下
from Bio import SeqIO
# 读取FASTA文件
records = SeqIO.parse('*.cds-transcripts.fa', 'fasta')
# 修改序列的描述符
new_records = []
for record in records:
seq_id = record.id
new_seq = str(record.seq)[:-11] # 去掉最后11个字符
new_record = f">{seq_id}\n{new_seq}"
new_records.append(new_record)
# 将修改后的记录写入新的FASTA文件
output_file = 'modified_cds1.fasta'
with open(output_file, 'w') as f:
f.write('\n'.join(new_records))
#orthofinder_changename.sh 这是个老脚本了,之前Orthofinder的时候也用过(内容如下)
vim orthofinder_changename.sh
#!/bin/bash
# 设置物种名变量
species_name="Umbilicaria_antarctica" #这个应该是换哪个名就填哪个哈哈哈,也挺方便不是
# 读取蛋白文件中每一行文本
while read -r line; do
# 如果当前行包含 ">" 符号,说明是序列头部
if [[ "$line" == ">"* ]]; then
# 将序列 ID 和描述信息拆分出来,并在其前面加上物种名并连接
new_header=">${species_name}${line#>}"
echo "${new_header}" # 调试输出新的序列头部
else
# 否则说明是序列数据,直接输出即可
echo "${line}"
fi
done < "modified_cds1.fasta" > "Umbilicaria_antarctica_cds.fasta"
特殊情况!!!
由于Hypocenomyce_scalaris、Umbilicaria_muehlenbergii和Umbilicaria_vellea都是已有蛋白,非funannotate预测
so 需要借助TBtools软件,利用gff3(或gtf文件)和genome文件,提取出相应的cds序列
but 提取出来的序列名称和顺序有点不太对,要用下面的脚本稍作调整
python3 order.py #提前改名字
改完顺序接着更改名字前缀
sh orthofinder_changename.sh #要提前更改一下名字
#order.py去掉序列名称后面的.XX(有时候也不需要去,具体情况具体分析),去掉后面的+或-
(脚本内容如下,需要更改最后那里input_file_path和output_file_path路径!!!)
vim order.py
def clean_seq_name(seq_name):
# 去除名字中的.mRNA1和+/-符号
return seq_name.split('.')[0].split('+')[0].split('-')[0]
def clean_and_sort_fasta(input_file, output_file):
sequences = {}
with open(input_file, 'r') as input_f, open(output_file, 'w') as output_f:
seq_name = ''
seq_data = ''
for line in input_f:
if line.startswith('>'): # 名称行
if seq_name != '':
sequences[clean_seq_name(seq_name)] = seq_data
seq_name = line[1:].strip()
seq_data = ''
else: # DNA序列行
seq_data += line.strip()
if seq_name != '':
sequences[clean_seq_name(seq_name)] = seq_data
sorted_sequences = sorted(sequences.items(), key=lambda x: x[0]) # 按照序列名称排序
for seq_name, seq_data in sorted_sequences:
output_f.write('>{}\n'.format(seq_name))
output_f.write('{}\n'.format(seq_data))
print('处理完成!')
# 使用示例
input_file_path = '/ifs1/User/wuqi/Qxy/knowngenome/Umbilicaria_muehlenbergii/gene.fasta'
output_file_path = '/ifs1/User/wuqi/Qxy/knowngenome/Umbilicaria_muehlenbergii/sorted_gene.fasta'
clean_and_sort_fasta(input_file_path, output_file_path)
这样我们就拿到了25个物种的CDS序列,在/ifs1/User/wuqi/Qxy/knowngenome/mcmctree/25cds目录下
2.1.2获得单拷贝基因对应的CDS序列
因为orthofinder结果得到的是氨基酸的比对序列,跑mcmctree需要的是密码子的比对需要,所以需要多倒腾几步
经常要用到的一个list文件,是Orthofinder结果里面Orthogroups下的Orthogroups_SingleCopyOrthologues.txt
文件绝对路径:
/ifs1/User/wuqi/Qxy/knowngenome/orthofinder/OrthoFinder/Results_Aug01/Orthogroups/Orthogroups_SingleCopyOrthologues.txt
(1)先获得序列名称name
vim pilianggetname.py #用该脚本批量获得某一目录下XX.2.3.4文件中的序列名称
import os
def process_files(input_dir):
for filename in os.listdir(input_dir):
if filename.endswith("_aligned.fasta.2.3.4"):
input_file = os.path.join(input_dir, filename)
output_file = os.path.join(input_dir, filename.replace("_aligned.fasta.2.3.4", "_aligned.name.txt"))
command = "seqkit fx2tab -n -i {} > {}".format(input_file, output_file)
os.system(command)
print("处理完成:{}".format(output_file))
# 使用示例
input_directory = "/ifs1/User/wuqi/Qxy/knowngenome/orthofinder/SingleCopyOrthologues_speciestree/alignments/seqkit"
process_files(input_directory)
运行上述代码会得到单拷贝同源基因对应的物种名文件,可以mkdir name,将其mv进去
(2)获得每个单拷贝基因的cds序列
去25cds目录下获得一个所有物种的cds序列allcds.fa
cd /ifs1/User/wuqi/Qxy/knowngenome/mcmctree/25cds
cat *_cds.fasta > allcds.fasta
ln -s allcds.fasta ./
vim getcds.py #该代码作用是通过序列名字,获取cds,这样我们就得到了每一个单拷贝基因的cds序列
import os
def process_files(input_dir):
for filename in os.listdir(input_dir):
if filename.endswith("_aligned.name.txt"):
input_file = os.path.join(input_dir, filename)
output_file = os.path.join(input_dir, filename.replace("_aligned.name.txt", "_cds.txt"))
command = "seqkit grep -f {} allcds.fasta > {}".format(input_file, output_file)
os.system(command)
print("处理完成:{}".format(output_file))
# 使用示例
input_directory = "/ifs1/User/wuqi/Qxy/knowngenome/orthofinder/SingleCopyOrthologues_speciestree/alignments/seqkit/name"
process_files(input_directory)
(3)比对得到alignment,去空白序列啥的
sh Muscle_align.sh
或者nohup mafft --auto 4dtv.fasta > aln.4dtv.fasta &
稍微改一改!
(4)比对结果转为phy格式并合并
先要给序列改名字,还要排序,然后再用代码转一下格式就行了!!!(忘记了)
2.2用共线性比对的结果去做:
2.1.1需要根据树文件顺序编写一个seq_name_order.list 文件,要一一对应(脚本目录那里)
2.1.2运行代码
首先需要gapfilitered文件夹下所有文件的名字的list gene_name.txt,运行下列代码
sh get_dictory_name.sh
sh getcds2.sh #将gapfiltered文件中序列名字改成简写的,输出结果是XX.rename.fa
python3 3.py #将gapfiltered文件中序列进行排序,根据seq_name_order.list,输出结果是XX.rename.fa.2
sh getcds3.sh #转phy格式第一步,输出结果是${i}.phy.tmp
sh getcds4.sh #转phy格式第二步,输出结果是${i}.phy.tmp.2
cat *.phy.tmp.2 > 25gongxianxing.phy
注:以上四步骤是层层递进的,环环相扣的,先改名字,再倒腾顺序,最后再转格式
运行完毕后就可以删掉过程文件了,只保留XX.mRNA1.fa、gene_name.txt和25gongxianxing.phy
rm -rf *.phy.tmp.2
rm -rf *.phy.tmp
rm -rf *.rename.fa.2
rm -rf *.rename.fa
下面是代码内容!
vim get_dictory_name.sh
#!/bin/bash
# 获取所有文件名,并去掉后缀
for file in *.fa; do
name="${file%.fa}" #如果运行失败的话,看看这两个后缀是不是写对了
echo "$name"
done > gene_name.txt
vim getcds2.sh
for i in `cat gene_name.txt`;do sed -E 's/Hypocenomyce_scalaris_genomic/H.scalaris/' ${i}.fa | sed -E 's/Umbilicaria_antarctica_genomic/U.antarctica/' | sed -E 's/Umbilicaria_aprina_genomic/U.aprina/' | sed -E 's/Umbilicaria_arctica_genomic/U.arctica/' |sed -E 's/Umbilicaria_caroliniana_genomic/U.caroliniana/' | sed -E 's/Umbilicaria_cylindrica_genomic/U.cylindrica/' | sed -E 's/Umbilicaria_decussata_genomic/U.decussata/' | sed -E 's/Umbilicaria_deusta_genomic/U.deusta/'|sed -E 's/Umbilicaria_esculenta_genomic/U.esculenta/' | sed -E 's/Umbilicaria_freyi_genomic/U.freyi/' | sed -E 's/Umbilicaria_grisea_genomic/U.grisea/' | sed -E 's/Umbilicaria_indica_genomic/U.indica/' | sed -E 's/Umbilicaria_loboperipherica_genomic/U.loboperipherica/' | sed -E 's/Umbilicaria_lyngei_genomic/U.lyngei/' | sed -E 's/Umbilicaria_maculata_genomic/U.maculata/' | sed -E 's/Umbilicaria_muehlenbergii_genomic/U.muehlenbergii/' | sed -E 's/Umbilicaria_phaea_genomic/U.phaea/' | sed -E 's/Umbilicaria_spodochroa_genomic/U.spodochroa/' | sed -E 's/Umbilicaria_squamosa_genomic/U.squamosa/' | sed -E 's/Umbilicaria_subpolyphylla_genomic/U.subpolyphylla/' | sed -E 's/Umbilicaria_thamnodes_genomic/U.thamnodes/' | sed -E 's/Umbilicaria_torrefacta_genomic/U.torrefacta/' | sed -E 's/Umbilicaria_vellea_genomic/U.vellea/' | sed -E 's/Umbilicaria_virginis_genomic/U.virginis/' | sed -E 's/Umbilicaria_yunnana_genomic/U.yunnana/' > ${i}.rename.fa;done
vim 3.py
import os
import re
# 指定文件夹路径
folder_path = '/ifs1/User/wuqi/Qxy/knowngenome/gongxianxing/0828bidui/gapfiltered/'
# 读取顺序列表
with open('/ifs1/User/wuqi/Qxy/qxyjiaoben/gene2paml/seq_name_order.list', 'r') as order_file:
order_list = order_file.read().splitlines()
# 遍历文件夹中的文件
for file_name in os.listdir(folder_path):
# 仅处理符合条件的文件
if file_name.endswith('.rename.fa'):
# 构建文件的完整路径
file_path = os.path.join(folder_path, file_name)
# 读取fasta文件
with open(file_path, 'r') as fasta_file:
fasta_data = fasta_file.read()
# 解析fasta文件中的序列和标识物种的行
sequences = re.findall(r'>(.*?)\n([^>]*)', fasta_data, re.MULTILINE | re.DOTALL)
# 根据顺序列表重新排序序列
sorted_sequences = []
for sp in order_list:
for header, seq in sequences:
if header.strip() == sp:
sorted_sequences.append((header, seq))
break
# 构建输出文件的路径
output_file_path = os.path.join(folder_path, f'{file_name}.2')
# 将重新排序后的序列输出到新的fasta文件(移除空行)
with open(output_file_path, 'w') as output_file:
for header, seq in sorted_sequences:
output_file.write(f'>{header}\n{seq.rstrip()}\n')
print(f'Sorted file: {output_file_path}')
vim getcds3.sh
for i in `cat gene_name.txt`;do cat ${i}.rename.fa.2 |tr '\n' '\t'|sed 's/>/\n/g' |sed 's/\t/ /'|sed 's/\t//g'| awk 'NF > 0' > ${i}.phy.tmp;done
vim getcds4.sh
for i in `cat gene_name.txt`;do awk '{print " "NR" "length($2)}' ${i}.phy.tmp|tail -n 1 | cat - ${i}.phy.tmp > ${i}.phy.tmp.2;done
三、mcmctree配置文件
自己写配置文件qxymcmctree.ctl,注意输入序列文件、树文件、输出文件
seed = -1
seqfile = 25gongxianxing.phy
treefile = tree2.trees
outfile = mcmctreeout1.txt
ndata = 20
seqtype = 0 * 0: nucleotides; 1:codons; 2:AAs
usedata = 3 * 0: no data; 1:seq like; 2:use in.BV; 3: out.BV #第一次是3,第二次是2
clock = 3 * 1: global clock; 2: independent rates; 3: correlated rates
RootAge = '<1.0' * safe constraint on root age, used if no fossil for root.
model = 4 * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85
alpha = 0.5 * alpha for gamma rates at sites
ncatG = 5 * No. categories in discrete gamma
cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)?
BDparas = 1 1 0.1 * birth, death, sampling
kappa_gamma = 6 2 * gamma prior for kappa
alpha_gamma = 1 1 * gamma prior for alpha
rgene_gamma = 2 20 1 * gamma prior for overall rates for genes
sigma2_gamma = 1 10 1 * gamma prior for sigma^2 (for clock=2 or 3)
finetune = 1: 0.1 0.1 0.1 0.01 .5 * auto (0 or 1) : times, musigma2, rates, mixing, para
s, FossilErr
print = 1
burnin = 2000
sampfreq = 10
nsample = 20000
运行:mcmctree qxymcmctree.ctl #会得到一些结果
然后:更改out.BV为in.BV,更改ctl控制文件里的输入文件信息,改usedata = 2参数,然后mcmctree qxymcmctree.ctl再跑一遍
最终得到FigTree.tre文件,传输到本地用Figtree打开
调整树型,选择Root Length,Tip Scale,Node Lables,Scale Bar,Scale Axis。
完结撒花~~~还有一些写的不清楚,希望下次运行顺利~~~