隐形马尔可夫链实现维特比(Viterbi)算法(不使用任外带的库)
维特比算法的实现和应用:
大多数基因组序列的碱基组成是不均匀的。特别是,GC的组成可以因地区而异。例如,在人类基因组中有“CpG岛”,它显示出非常强的统计信号(在二核苷酸组成水平上甚至比单核苷酸组成水平更强),并且倾向于标记许多基因的5’端。因此,客观地将基因组序列分割成不同组成区域的问题就自然产生了。一种分割基因组的方法是使用HMM算法。
假设一个基因组可以被建模为一个双状态HMM,有两个状态a和b。假设这个HMM的参数如下。状态A对{A,C,G,T}的概率有at偏置发射分布{0.35,0.15,0.15,0.35}。状态B有gc偏置发射分布{0.15,0.35,0.35,0.15}。状态A切换到状态B的概率为0.001。状态B切换回状态A的概率是0.01。假设HMM的初始分布是均匀的;这两种状态都是0.5。
维特比解析:
使用Python为上述HMM实现一个Viterbi解析器,包括初始化、矩阵填充和回溯阶段。输出结果格式应当如下:
1 153 state A (or state 1)
154 252 state B (or state 2)
253 1651 state A
(… etc…)
#导入.fa文件并得到基因序列
import numpy as np
def get_sequecne(path_fa):
f = open(path_fa)
next(f)
lines = f.read().split()
sequence = []
for line in lines:
for val in line:
sequence.append(val.upper())
return sequence
#导入.hmm文件并且得到状态数(A和B),观测到的特征,状态转移矩阵和发射矩阵。
def get_parameters(path_hmm):
f = open(path_hmm)
lines = f.readlines()
index = 0
matrix = []
for line in lines:
line = line.strip().split()
if index == 0:
state_num = line[0]
feature_num = line[1]
feature = line[2]
elif index == 1:
prior_prob = list(map(float, line))
else:
matrix.append(list(map(float, line)))
index += 1
features = {}
index = 0
for val in feature:
if val not in features:
features[val] = index
index += 1
matrix = np.array(matrix)
trans_mat = np.array(matrix[:,:2])
emiss_mat = np.array(matrix[:,2:])
prior_prob = np.log10(np.array(prior_prob))
trans_mat = np.log10(trans_mat)
emiss_mat = np.log10(emiss_mat)
return state_num, features, prior_prob, trans_mat, emiss_mat
#构建概率矩阵,对其进行初始化并填充,并且根据维特比算法的公式得到nodes的矩阵
def vitervbi(sequence, state_num, trans_mat, emiss_mat):
prob_mat = np.zeros((len(sequence), int(state_num)))
nodes = np.zeros_like(prob_mat)
for i in range(int(state_num)):
prob_mat[0][i] = prior_prob[i] + emiss_mat[i][features[sequence[0]]]
for i in range(1,len(sequence)):
for j in range(int(state_num)):
to_list = []
for k in range(int(state_num)):
to_list.append(prob_mat[i-1][k]+trans_mat[k][j])
prob_mat[i][j] = np.max(to_list) + emiss_mat[j][features[sequence[i]]]
nodes[i][j] = np.argmax(to_list)
return prob_mat, nodes
#回溯,得到最优路径。
def backtrace(sequence, state_num):
prob_mat, nodes = vitervbi(sequence, state_num, trans_mat, emiss_mat)
List = []
for i in range(int(state_num)):
List.append(prob_mat[-1][i])
P = np.max(List)
T = np.argmax(List)
nodes_list = []
nodes_list.append(T)
index = 0
for j in range(len(sequence)-2,-1,-1):
N = nodes[j+1][int(nodes_list[index])]
nodes_list.append(N)
index += 1
nodes_list.reverse()
return nodes_list
#设置输出格式
def out_format(nodes_list):
index_0 = 0
index_1 = 0
start_node = 0
for i in range(1,len(nodes_list)):
current = nodes_list[i]
last = nodes_list[i-1]
if current == last:
pass
elif current != last:
if nodes_list[i-1]== 1:
print(start_node+1, '-', i, 'state B')
start_node = i
index_1 += 1
elif nodes_list[i-1]== 0:
print(start_node+1, '-', i, 'state A')
start_node = i
index_0 +=1
return index_0, index_1
if __name__ == '__main__':
sequence = get_sequecne(input('input the .fa file: (eg: example.fa) '))
state_num, features, prior_prob, trans_mat, emiss_mat = get_parameters(input('input the .hmm file: (eg: example.hmm) '))
index_0, index_1 = out_format(backtrace(sequence, state_num))
print('there are %s state B segments' % index_1)
运行后得到结果如下:
input the .fa file: (eg: example.fa) example.fa
input the .hmm file: (eg: example.hmm) example.hmm
1 - 52 state B
53 - 720 state A
721 - 768 state B
769 - 2257 state A
2258 - 2404 state B
2405 - 3472 state A
3473 - 3526 state B
3527 - 5443 state A
5444 - 5516 state B
5517 - 6331 state A
6332 - 6369 state B
…
…
992259 - 992868 state A
992869 - 993094 state B
993095 - 996916 state A
996917 - 997067 state B
there are 688 state B segments