之前写过一个批量生成mCherry+split intein正交组合序列,以交给AF-multimer进行结构预测的python代码,目的是生成正交组合序列,以后续验证正交性。
而在对连续内含肽断裂处理后进行活性验证的时候,不需要验证正交性,即不需要生成两两的组合。使用上述代码将会带来众多冗余文件。
因此,这里专门写了一个更加适用这种情况的代码。如下:
需求描述:
输入文件active_seq.dat给出如下内容:外显肽序列 + 内含肽名字、内含肽序列、内含肽的一系列断裂位点、连接序列
生成:进行AF2结构预测的输入文件,文件名举例:00001_Acel-Terl_25.fasta
输入文件active_seq.dat内容:
mrc@mrc-Precision-3660:prediction_SplitInt_from_literature$ cat active_seq.dat
#Description: Intein name\n Intein sequences\n Split sites\n Junction sequences
Acel-Terl
CVYGDTMVETEDGKIKIEDLYKRLAMFRTNTNNIKILSPNGFSNFNGIQKVERNLYQHIIFDDDTEIKTSINHPFGKDKILARDVKVGDYLNSKKVLYNELVNENIFLYDPINVEKESLYITNGVVSHN
25
EFE/CEF
Aov-DnaE
CLSADTEILTVEYGFLPIGEIVGKAIECRVYSVDGNGNIYTQSIAQWHNRGEQEVFEYTLEDGSIIRATKDHKFMTTDGEMLPIDEIFARQLDLMQVQGLHVKITARKFVGRENVYDIGVEHHHNFAIKNGLIASN
101
AEY/CEF
……
代码:
mrc@mrc-Precision-3660:prediction_SplitInt_from_literature$ cat gene_seq.py
#Python3
#maoruichao@2024.4.24
#Usage: python3 gene_seq.py
##Extein 序列
ExN_seq = 'MVSKGEEDNMAIIKEFMRFKVHMEGSVNGHEFEIEGEGEGRPYEGTQTAKLKVTKGGPLPFAWDILSPQFMYGSKAYVKHPADIPDYLKLSFPEGFKWERVMNFEDGGVVTVTQDSSLQDGEFIYKVKLRGTNFPSDGPVMQKKTMGWEASSERMYPED'
ExC_seq = 'GALKGEIKQRLKLKDGGHYDAEVKTTYKAKKPVQLPGAYNVNIKLDITSHNEDYTIVEQYERAEGRHSTGGMDELYK'
##Open and read file
file_Int = open('active_seq.dat','r')
list_SeqInt = file_Int.readlines()
list_SeqInt_clean = [line.strip() for line in list_SeqInt if line != '\n' and not line.startswith("#")]
#print (len(list_SeqInt_clean))
##将上述列表按照4个元素为一组,划分为多个子列表,每个子列表代表一个内含肽的数据
grouped_lists = [list_SeqInt_clean[i:i+4] for i in range(0, len(list_SeqInt_clean), 4)]
int_num = 0
for int_data in grouped_lists:
print (int_num)
int_name = grouped_lists[int_num][0]
int_seq = grouped_lists[int_num][1]
int_split_sites = grouped_lists[int_num][2]
int_junction = grouped_lists[int_num][3]
#获取intein长度
int_len=len(int_seq)
#遍历每一个断裂位点,并生成对应的AF2预测序列
for site_id in int_split_sites.split(','):
print ("Split in site"+site_id)
site_id = int(site_id)
intN_seq = int_seq[:site_id]
intC_seq = int_seq[site_id:]
linker_list = int_junction.split('/')
linker_N = linker_list[0]
linker_C = linker_list[1]
#生成AF2预测的输入序列
All_seqN = ExN_seq + linker_N + intN_seq
All_seqC = intC_seq + linker_C + ExC_seq
print (All_seqN,All_seqC)
#generate file for AF2_prediction
filename = './active_seq_for_predict/' + format(int_num,'03d') + '_' + int_name + '_' + str(site_id) + '.fasta'
title1 = '>' + int_name + '_' + str(site_id) + '-N' + '|Ex_N+linker+Int_N'
title2 = '>' + int_name + '_' + str(site_id) + '-C' + '|Int_C+linker+Ex_C'
with open(filename, 'w') as newfile:
newfile.write(title1 + '\n')
newfile.write(All_seqN + '\n')
newfile.write(title2 + '\n')
newfile.write(All_seqC + '\n')
int_num += 1
#Close file
file_Int.close()