机器学习可分为:
1. 有监督学习:目标为 分类(predicting the label /classification)和回归(response by using a provided set of labeled training examples)
2. 无监督学习: 目标为 learning inherent patterns with the data (比如PCA, clustering 等 起源于生态分析的CA CCA分析可能也属于这一类,后续将详细介绍其原理及公式推导)
据说open AI的创始人之一Ilya就比较推崇无监督学习。
顺便列出其为转战AI的程序人列出的论文和学习资源清单
1.The Annotated Transformer (nlp.seas.harvard.edu)
2.The First Law of Complexodynamics (scottaaronson.blog)
3.The Unreasonable Effectiveness of RNNs (karpathy.github.io)
4.Understanding LSTM Networks (colah.github.io)
5.Recurrent Neural Network Regularization (arxiv.org)
6.Keeping Neural Networks Simple by Minimizing the Description Length of the Weights (cs.toronto.edu)
7.Pointer Networks (arxiv.org) 8.ImageNet Classification with Deep CNNs (proceedings.neurips.cc) 9.Order Matters: Sequence to sequence for sets (arxiv.org)
10.GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism (arxiv.org) 11.Deep Residual Learning for Image Recognition (arxiv.org)
12.Multi-Scale Context Aggregation by Dilated Convolutions (arxiv.org) 13.Neural Quantum Chemistry (arxiv.org) 14.Attention Is All You Need (arxiv.org) 1
5.Neural Machine Translation by Jointly Learning to Align and Translate (arxiv.org)
16.Identity Mappings in Deep Residual Networks (arxiv.org)
17.A Simple NN Module for Relational Reasoning (arxiv.org) 18.Variational Lossy Autoencoder (arxiv.org) 19.Relational RNNs (arxiv.org) 20.Quantifying the Rise and Fall of Complexity in Closed Systems: The Coffee Automaton (arxiv.org) 21.Neural Turing Machines (arxiv.org) 22.Deep Speech 2: End-to-End Speech Recognition in English and Mandarin (arxiv.org) 23.Scaling Laws for Neural LMs (arxiv.org) 24.A Tutorial Introduction to the Minimum Description Length Principle (arxiv.org)
25.Machine Super Intelligence Dissertation (vetta.org) 26.PAGE 434 onwards: Komogrov Complexity (lirmm.fr)
27.CS231n Convolutional Neural Networks for Visual Recognition (cs231n.github.io)
慢慢学,先看一篇 可实现的基因组学使用深度学习的指南
这篇文章给了一个实际的例子,用神经网络预测一段序列当中的motif, 分割成2000小段序列,每段50bp,实验已知每段是否包含motif 比如增强子,标记为0、1,分成训练集和预测集进行验证,最后解释,实际真实的序列是(CGACCGAACTCC
. Of course, the neural network doesn't know this)https://colab.research.google.com/drive/17E4h5aAOioh5DiTo7MZg4hpL6Z_0FyWr
先对数据编码,一般对四种碱基编码为:A [1,0,0,0],T [0,0,1,0],C [0,1,0,0] ,G [0,0,0,1]
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
# The LabelEncoder encodes a sequence of bases as a sequence of integers.
integer_encoder = LabelEncoder()
# The OneHotEncoder converts an array of integers to a sparse matrix where
# each row corresponds to one possible value of each feature.
one_hot_encoder = OneHotEncoder(categories=[range(4)])
input_features = []
for sequence in sequences:
integer_encoded = integer_encoder.fit_transform(list(sequence))
integer_encoded = np.array(integer_encoded).reshape(-1, 1)
one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
# print(one_hot_encoded,type(one_hot_encoded))
input_features.append(one_hot_encoded.toarray())
# print(one_hot_encoded.toarray(),type(one_hot_encoded.toarray()))
np.set_printoptions(threshold=40)
#np.stack(arrays, axis=0)则表示取出每个二维数组(最外层有两个中括号)相应的索引
#对应的数组进行堆叠,这里np.stack(arrays, axis=0)则表示
#arrays[0], arrays[1], arrays[2]进行堆叠,所以结果与原始数组一样。
input_features = np.stack(input_features)
print("Example sequence\n-----------------------")
print('DNA Sequence #1:\n',sequences[1][:10],'...',sequences[1][-10:])
print(len(input_features[0]),"==",len(sequences[1]),"and",len(input_features),"==",len(sequences))
print('One hot encoding of Sequence #1:\n',input_features[1].T)
Example sequence ----------------------- DNA Sequence #1: GAGTTTATAT ... TGTCGCGTCG 50 == 50 and 2000 == 2000 One hot encoding of Sequence #1: [[0. 1. 0. ... 0. 0. 0.] [0. 0. 0. ... 0. 1. 0.] [1. 0. 1. ... 0. 0. 1.] [0. 0. 0. ... 1. 0. 0.]]