字符串（如DNA序列，蛋白质序列）的编码和用于机器学习和神经网络

最新推荐文章于 2023-11-05 20:41:00 发布

XH生信ML笔记

最新推荐文章于 2023-11-05 20:41:00 发布

阅读量8.3k

点赞数 7

分类专栏：机器学习

原文链接：https://www.kaggle.com/thomasnelson/working-with-dna-sequence-data-for-ml

版权

机器学习专栏收录该内容

16 篇文章 3 订阅

订阅专栏

在处理dna序列或这蛋白质序列时，常常需要把序列转化为数字，这样才能形成矩阵输入模型训练，一般而言，有三种方法用于序列编码：顺序编码，独热编码，kmer编码。

1. 顺序编码

第一种顺序编码即使把atcg四种碱基编码成具体数据，如把atct编码成[0.25, 0.5, 0.75, 1.0]，其他的如n可以编码为0。

首先处理序列，形成字符串str

# function to convert a DNA sequence string to a numpy array
# converts to lower case, changes any non 'acgt' characters to 'n'
import numpy as np
import re
def string_to_array(my_string):
    my_string = my_string.lower()
    my_string = re.sub('[^acgt]', 'z', my_string)
    my_array = np.array(list(my_string))
    return my_array

# create a label encoder with 'acgtn' alphabet
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
label_encoder.fit(np.array(['a','c','g','t','z']))

设置要编码的数字：

# function to encode a DNA sequence string as an ordinal vector
# returns a numpy vector with a=0.25, c=0.50, g=0.75, t=1.00, n=0.00
def ordinal_encoder(my_array):
    integer_encoded = label_encoder.transform(my_array)
    float_encoded = integer_encoded.astype(float)
    float_encoded[float_encoded == 0] = 0.25 # A
    float_encoded[float_encoded == 1] = 0.50 # C
    float_encoded[float_encoded == 2] = 0.75 # G
    float_encoded[float_encoded == 3] = 1.00 # T
    float_encoded[float_encoded == 4] = 0.00 # anything else, z
    return float_encoded

测试：

>>> test_sequence = 'AACGCGCTTNN'
>>> ordinal_encoder(string_to_array(test_sequence))
array([0.25, 0.25, 0.5 , 0.75, 0.5 , 0.75, 0.5 , 1.  , 1.  , 0.  , 0.  ])

Allen Chieng, Hoon Choong and Nung Kion Lee等研究者认为这种方法训练效果较好，可以查看这篇文章"Evaluation of Convolutionary Neural Networks Modeling of DNA Sequences using Ordinal versus one-hot Encoding Method"

2. one-hot 编码，独热编码

把 “ATGC” 变成[0,0,0,1], [0,0,1,0], [0,1,0,0], [1,0,0,0]，因此，1000bp的序列成为1000*4的矩阵

# function to one-hot encode a DNA sequence string
# non 'acgt' bases (n) are 0000
# returns a L x 4 numpy array
from sklearn.preprocessing import OneHotEncoder
def one_hot_encoder(my_array):
    integer_encoded = label_encoder.transform(my_array)
    onehot_encoder = OneHotEncoder(sparse=False, dtype=int, n_values=5)
    integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
    onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
    onehot_encoded = np.delete(onehot_encoded, -1, 1)
    return onehot_encoded

测试一条序列：

>>> test_sequence = 'AACGCGGTTNN'
>>> one_hot_encoder(string_to_array(test_sequence))
array([[1, 0, 0, 0],
       [1, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 0, 1],
       [0, 0, 0, 1],
       [0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int64)

3. 将dna打断成kmer，按照自然语言处理的方式来训练

此方法依次切割一定长度的序列形成kmer，然后按照自然语言处理的方式来训练，如“ATGCATGCA” 如果打断成6bp的序列，就会成为 ‘ATGCAT’, ‘TGCATG’, ‘GCATGC’, ‘CATGCA’
首先获得kmer：

>>> def getKmers(sequence, size):
>>>    return [sequence[x:x+size].lower() for x in range(len(sequence) - size + 1)]
# 测试该函数
>>> mySeq = 'CATGGCCATCCCCCCCCGAGCGGGGGGGGGG'
>>> getKmers(mySeq, size=6)
['catggc',
 'atggcc',
 'tggcca',
 'ggccat',
 'gccatc',
 'ccatcc',
 'catccc',
 'atcccc',
 'tccccc',
 'cccccc',
 'cccccc',
 'cccccc',
 'cccccg',
 'ccccga',
 'cccgag',
 'ccgagc',
 'cgagcg',
 'gagcgg',
 'agcggg',
 'gcgggg',
 'cggggg',
 'gggggg',
 'gggggg',
 'gggggg',
 'gggggg',
 'gggggg']

将kmer 连接起来

>>> words = getKmers(mySeq, size=6)
>>> sentence = ' '.join(words)
>>> sentence
'catggc atggcc tggcca ggccat gccatc ccatcc catccc atcccc tccccc cccccc cccccc cccccc cccccg ccccga cccgag ccgagc cgagcg gagcgg agcggg gcgggg cggggg gggggg gggggg gggggg gggggg gggggg'

我们用两条序列来测试

mySeq2 = 'GATGGCCATCCCCGCCCGAGCGGGGGGGG'
mySeq3 = 'CATGGCCATCCCCGCCCGAGCGGGCGGGG'
sentence2 = ' '.join(getKmers(mySeq2, size=6))
sentence3 = ' '.join(getKmers(mySeq3, size=6))

编码，形成字袋

# Creating the Bag of Words model
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> cv = CountVectorizer()
>>> X = cv.fit_transform([sentence, sentence2, sentence3]).toarray()
>>> X
array([[1, 1, 1, 1, 1, 1, 3, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0,
        0, 1, 1, 0, 0, 5, 1, 0, 1],
       [1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
        0, 1, 1, 0, 0, 3, 0, 1, 1],
       [1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1,
        1, 1, 1, 1, 1, 0, 0, 1, 1]], dtype=int64)

XH生信ML笔记

关注

7
点赞
踩
76

收藏

觉得还不错? 一键收藏
2
评论
字符串（如DNA序列，蛋白质序列）的编码和用于机器学习和神经网络

在处理dna序列或这蛋白质序列时，常常需要把序列转化为数字，这样才能形成矩阵输入模型训练，一般而言，有三种方法用于序列编码：顺序编码，独热编码，kmer编码。1. 顺序编码第一种顺序编码即使把atcg四种碱基编码成具体数据，如把atct编码成[0.25, 0.5, 0.75, 1.0]，其他的如n可以编码为0。首先处理序列，形成字符串str# function to convert a D...
复制链接

扫一扫