上一篇讲了反转录转座子插入时间的计算原理,这篇主要是计算过程。
首先使用ltr_finder和ltr_harvest对基因组中的反转录转座子进行预测,之后用LTR_retriever整合预测结果,流程主要参考徐洲更先生的博文,输出结果如下:
图源:https://www.jianshu.com/p/f962d5c40fdf
其中*.fa.pass.list.gff3内包含有所有完整的反转录转座子,内容如下:
每个block表示一个反转录转座子,第一行为整个LTR-RT的区域,4、5行分别是5’LTR和3’LTR,根据这两行位置信息,将两段LTRs序列从fasta文件中提取出来。
def readgff3(gff3file):
with open(gff3file, "r") as g3f:
f = g3f.readlines()
out_list = []
small_pair = []
for line in f:
if line.startswith("#"):
pass
if not line.startswith("chr19"):
pass
else:
lin = line.strip().split("\t")
chr_name = lin[0]
type = lin[2]
start = lin[3]
end = lin[4]
if type == "repeat_region":
out_list.append(small_pair)
small_pair = []
if type == "long_terminal_repeat":
ltr_pair = ">" + chr_name + ":" + start + ".." + end
small_pair.append(ltr_pair)
del out_list[0]
out_list.append(small_pair)
return out_list
pair_ltr = readgff3("14_male.fa.pass.list.gff3")
#print(pair_ltr)
def readfasta(inputfasta):
with open(inputfasta, "r") as file:
sequence = {}
out_sequence = {}
name_line = []
out_name = []
f = file.readlines()
for line in f:
line = line.rstrip()
if line.startswith(">"):
name_line.append(line)
sequence[line] = ""
currseqname = line
else:
sequence[currseqname] += line
for item in name_line:
if item.startswith(">chr19"):
start = int(item.strip().split(":")[1].split("..")[0])
#if start < 300000:
out_name.append(item)
for key in sequence.keys():
if key in out_name:
out_sequence[key] = sequence[key]
return out_sequence, out_name
fasta_dic, fasta_name = readfasta("14_male.fa.LTRlib.redundant.fa")
#print(fasta_dic)
#print(len(fasta_dic), len(fasta_name))
def findfasta(gfflist, fastadic):
pair_dict = {}
gff_namedic = {}
ltr_fasta = []
for key in fastadic.keys():
#print(key)
name_number = key.split("_")[0]
gff_namedic[name_number] = key
for it in gfflist:
starts = int(it[0].split(":")[1].split("..")[0])
'''
if starts > 300000:
pass
else:'''
fiveltr = gff_namedic[it[0]]
five_fasta = fastadic[fiveltr]
ltr_fasta.append(fiveltr)
ltr_fasta.append(five_fasta)
threeltr = gff_namedic[it[1]]
pair_dict[fiveltr] = threeltr
three_fasta = fastadic[threeltr]
ltr_fasta.append(threeltr)
ltr_fasta.append(three_fasta)
return ltr_fasta, pair_dict
pair_ltr, pair_dictionary = findfasta(pair_ltr,fasta_dic)
print(len(pair_dictionary), len(pair_ltr))
def write_out(list, dict, file1, file2):
with open(file1, "w") as out_f:
for i in list:
line = i + "\n"
#line = "\t".join(str(a) for a in i) + "\n"
#print(type(line))
out_f.write(line)
out_f.close()
with open(file2, "w") as out2_f:
for key in dict.keys():
line1 = key + "\t"
out2_f.write(line1)
line2 = dict[key] + "\n"
out2_f.write(line2)
out2_f.close()
#write_out(l, "ltr_redundant_out.txt")
write_out(pair_ltr, pair_dictionary, "14_extract_all.txt", "14_extract_paral_all.txt")
提取结果如下:
在服务器上可以按行把文件拆成小文件,每个文件中包含一对LTRs,也可以在提取的时候就分开写入文件。
计算核酸演化速率的模型有很多种,我们主要使用Kimura 2-Parameter model:
r=K/2T
其中r是核酸演化速率,在近缘物种中较一致;K是每个位点的平均替代数,T是分化时间。想要进一步理解本公式可以参考北京大学的生物演化课程:https://zh.coursera.org/lecture/shengwu-yanhua/8-1-he-suan-yan-hua-de-su-lu-shang-eRCOU
那么当我们已知近缘物种的演化速率,根据K-2模型算出K,即可推算出两条序列的分化时间。K值的计算使用MEGA,由于我们有多条序列,需要成对比对,可以使用MEGA-cc来批量化处理。
下班,具体处理方式下次再更。