Microarray Classification

最新推荐文章于 2024-02-25 14:11:51 发布

Al_Liz_Well

最新推荐文章于 2024-02-25 14:11:51 发布

阅读量387

点赞数

分类专栏： Microarray Analysis 文章标签： Machine Learning Supervised Unsupervised Implementation

本文链接：https://blog.csdn.net/Al_Liz_Well/article/details/51718942

版权

Microarray Analysis 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Classification is equivalent to machine learning methods. We are trying to learn from the existing data and also predict the new things.Task: assign object to a class based on measurement on object,such as expression profle or other things.

Machine learning:

1. Unsupervised learning:

Ignore known class labels ( normal or cancer )

Sometimes can not even separate the known class

2. Supervised learning:

Exstract useful features based on known class labels to best separate classes

Can overfit the data,so need to separate training and test set

Unsupervised learning:

1. Clustering : the most important unsupervised learning method

No guarantee that the known sample will cluster.

Try batch effect removal or different clustering methods: change linkage,select subset genes ( semi-supervised )

2. K Nearest Neighbours ( KNN )

Before used in missing values estimation.For observation X with unknown label , find the k observation in the training data closest ( such as correlation ) to X.

Predict the label of X based on majority vote by KNN.

K can be determined by predictability of known samples( semi - superivised )

Can extend KNN by assigning weights to the neighbours by inverse distance from test - sample

KNN offer little insights into mechanism

3. Multidimensional Scaling ( MDS )

Based on distance between data in high dimensional space ( such as correlation )

Try to approximate pairwise distance between samples:give a 2D - 3D representation approximately the pairwise distance relationship as much as possible

4. Principal Component Analysis ( PCA )

Linear transformation that projects the data onto new coordinate system( linear combinations of the original variables ) to capture as much of the variation in the data as possible.

The first principal component accounts for the greatest possible variance in dataset.

The second principal component accounts for the next highest variance and uncorrelated ( orthogonal to ) the first principal component.

There are many dimensional space at the beginning, then we are trying to shrunk it into the first single dimension by doing linear combination of all the genes.

Finding the Projections:

•Looking for alinear combination to transform the original data matrix X to:

Y= dTX=d1 X1+ d2 X2+..+ dp Xp

•Where d=(d1 ,d2 ,.., dp)Tis a column vector of weights with

d1²+d2²+..+ dp²=1

•Maximize thevariance of the projection of the observations on the Y variables

PCA is achieved by singular value decomposition : X= UDV^T

•X is the original data

•U (N×N) is the relative projection of the points

•V is project directions

–v1 is a unit vector, direction of the first projection

–The eigenvector with the largest eigenvalue

–Linear combination ( relative importance ) of each gene ( if PCA on samples )

•D is scaling factor (eigenvalue)

–Diagonal matrix, d1>= d2>=d3>=0

–d_m²measures the variances captures by the m^thprincipal component

•u1d1 is distance along v1 from origin (first principal components)

–Expression value projected on v1

–u₁d₁captures the largest variance of original X

–v2 is 2nd projection direction, orthogonal to PC1,U₂d₂captures the 2nd largest var of X

Note :

PCA and MDS are both good dimension reduction methods.

PCA is a good clustering method,and PCA can be conducted on genes or on samples.

PCA is only powerful when the biological question is related to the highest variance in the dataset.

PCA for batch effect detection : PCA can identify batch effect.

Supervised Learning :

1. Logistic Regression :more statistical way of machine learning,you find a combination of genes,then you give them different weight to predict the sample classification.

Data (yi,xi) , i = 1,2,...,n.

Model :

b0:the intercept(whether this curve is gonna be on the left or on the right)

b1:regression slope (how steep is this slope)

b0+b1X = 0 :the decision boundary Pb(1)=Pb(0)=0.5

Logit : natural regression is different with linear regression: change in probability is not linear with changes in X.

Sample classification: Y: cancer 1, normal 0

Model:

β0:the intercept

β₁,β₂,...,β_p : give a weight to some gene selected, not all the gene. Select some genes that are important, give these genes some weights, just so that you can best separate cancer versus normal.

Note: find subset of p genes whose expression collectively predict new sample classification.

2. Support Vector Machine ( SVM )

SVM: try to maximize the separation distance between the samples lie on the boundary. ( The internal samples don’t really matter that much, it’s just trying to make sure the boundaries are separated well )

SVM: find the hyperplane that maximize the margin. Margin determined by support vectors ( samples on the class edge ), others irrelevant.

Cross Validation:

For every supervised machine learning, in order to make sure we don’t overfit ,we use cross validation to test the samples that are left out.

When we do not have enough samples: do Leave-1 cross validation on n data points ( Build classifier on (n-1) samples, test on the one left out )

When we have enough samples: do N-fold cross validation.( divide data into N equal subsets build classifier on (N-1) subsets, compute error rate on the one left out subset )

Implementation :

1. Generate training data:

n = 200

S = cbind(c(1,0.2),c(0.2,1))

x1=mvrnorm(n=n,mu=c(1,1),sigma=S)

x1=mvrnorm(n=n,mu=c(-1,-1),Sigma=S)

x=rbind(x1,x2)

colnames(x)=c(“x1”,”x2”)

y1=rep(c(1,-1),each=n)

y2=rep(c(1,0),each=n)

df=data.frame(cbind(y1,y2,x))

2. Implementation of KNN: