统计学习笔记 生物基因组学应用(二)

机器学习可分为:

1. 有监督学习:目标为 分类(predicting the label /classification)和回归(response by using a provided set of labeled training examples)

2. 无监督学习: 目标为 learning inherent patterns with the data (比如PCA, clustering 等 起源于生态分析的CA CCA分析可能也属于这一类,后续将详细介绍其原理及公式推导)

据说open AI的创始人之一Ilya就比较推崇无监督学习。

顺便列出其为转战AI的程序人列出的论文和学习资源清单

 1.The Annotated Transformer (nlp.seas.harvard.edu)

2.The First Law of Complexodynamics (scottaaronson.blog)

3.The Unreasonable Effectiveness of RNNs (karpathy.github.io)

4.Understanding LSTM Networks (colah.github.io)

5.Recurrent Neural Network Regularization (arxiv.org)

6.Keeping Neural Networks Simple by Minimizing the Description Length of the Weights (cs.toronto.edu)

7.Pointer Networks (arxiv.org) 8.ImageNet Classification with Deep CNNs (proceedings.neurips.cc) 9.Order Matters: Sequence to sequence for sets (arxiv.org)

10.GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism (arxiv.org) 11.Deep Residual Learning for Image Recognition (arxiv.org)

12.Multi-Scale Context Aggregation by Dilated Convolutions (arxiv.org) 13.Neural Quantum Chemistry (arxiv.org) 14.Attention Is All You Need (arxiv.org) 1

5.Neural Machine Translation by Jointly Learning to Align and Translate (arxiv.org)

16.Identity Mappings in Deep Residual Networks (arxiv.org)

17.A Simple NN Module for Relational Reasoning (arxiv.org) 18.Variational Lossy Autoencoder (arxiv.org) 19.Relational RNNs (arxiv.org) 20.Quantifying the Rise and Fall of Complexity in Closed Systems: The Coffee Automaton (arxiv.org) 21.Neural Turing Machines (arxiv.org) 22.Deep Speech 2: End-to-End Speech Recognition in English and Mandarin (arxiv.org) 23.Scaling Laws for Neural LMs (arxiv.org) 24.A Tutorial Introduction to the Minimum Description Length Principle (arxiv.org)

25.Machine Super Intelligence Dissertation (vetta.org) 26.PAGE 434 onwards: Komogrov Complexity (lirmm.fr)

27.CS231n Convolutional Neural Networks for Visual Recognition (cs231n.github.io)

慢慢学,先看一篇 可实现的基因组学使用深度学习的指南

这篇文章给了一个实际的例子,用神经网络预测一段序列当中的motif, 分割成2000小段序列,每段50bp,实验已知每段是否包含motif 比如增强子,标记为0、1,分成训练集和预测集进行验证,最后解释,实际真实的序列是(CGACCGAACTCC. Of course, the neural network doesn't know this)https://colab.research.google.com/drive/17E4h5aAOioh5DiTo7MZg4hpL6Z_0FyWr

先对数据编码,一般对四种碱基编码为:A [1,0,0,0],T [0,0,1,0],C [0,1,0,0] ,G [0,0,0,1]

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import requests
# The LabelEncoder encodes a sequence of bases as a sequence of integers.
integer_encoder = LabelEncoder()  
# The OneHotEncoder converts an array of integers to a sparse matrix where 
# each row corresponds to one possible value of each feature.
one_hot_encoder = OneHotEncoder(categories=[range(4)])   
input_features = []
for sequence in sequences:
  
  integer_encoded = integer_encoder.fit_transform(list(sequence))
  integer_encoded = np.array(integer_encoded).reshape(-1, 1)
  one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
#   print(one_hot_encoded,type(one_hot_encoded))
  input_features.append(one_hot_encoded.toarray())
#   print(one_hot_encoded.toarray(),type(one_hot_encoded.toarray()))

np.set_printoptions(threshold=40)

#np.stack(arrays, axis=0)则表示取出每个二维数组(最外层有两个中括号)相应的索引

#对应的数组进行堆叠,这里np.stack(arrays, axis=0)则表示

#arrays[0], arrays[1], arrays[2]进行堆叠,所以结果与原始数组一样。

input_features = np.stack(input_features)

print("Example sequence\n-----------------------")

print('DNA Sequence #1:\n',sequences[1][:10],'...',sequences[1][-10:])

print(len(input_features[0]),"==",len(sequences[1]),"and",len(input_features),"==",len(sequences))

print('One hot encoding of Sequence #1:\n',input_features[1].T)

Example sequence
-----------------------
DNA Sequence #1:
 GAGTTTATAT ... TGTCGCGTCG
50 == 50 and 2000 == 2000
One hot encoding of Sequence #1:
 [[0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 1. 0.]
 [1. 0. 1. ... 0. 0. 1.]
 [0. 0. 0. ... 1. 0. 0.]]

​​​​​​​

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值