Microarray Classification

        Classification is equivalent to machine learning methods. We are trying to learn from the existing data and also predict the new things.Task: assign object to a class based on measurement on object,such as expression profle or other things.

Machine learning:

1. Unsupervised learning: 

Ignore known class labels ( normal or cancer ) 

Sometimes can not even separate the known class

2. Supervised learning: 

Exstract useful features based on known class labels to best separate classes

Can overfit the data,so need to separate training and test set

Unsupervised learning:

1. Clustering : the most important unsupervised learning method

No guarantee that the known sample will cluster.

Try batch effect removal or different clustering methods: change linkage,select subset genes ( semi-supervised )

2. K Nearest Neighbours ( KNN )

Before used in missing values estimation.For observation X with unknown label , find the k observation  in the training data closest ( such as correlation ) to X.

Predict the label of X based on majority vote by KNN.

K can be determined by predictability of known samples( semi - superivised )

Can extend KNN by assigning weights to the neighbours by inverse distance from test - sample

KNN offer little insights into mechanism

3. Multidimensional Scaling ( MDS )

Based on distance between data in high dimensional space ( such as correlation )

Try to approximate pairwise distance between samples:give a 2D - 3D representation approximately the pairwise distance relationship as much as possible 

4. Principal Component Analysis ( PCA )

Linear transformation that projects the data onto new coordinate system( linear combinations of the original variables ) to capture as much of the variation in the data as possible.

The first principal component accounts for the greatest possible variance in dataset.

The second principal component accounts for the next highest variance and uncorrelated ( orthogonal to ) the first principal component.


There are many dimensional space at the beginning, then we are trying to shrunk it into the first single dimension by doing linear combination of all the genes.

Finding the Projections:

    •Looking for alinear combination to transform the original data matrix X to:

         Y= dTX=d1 X1+ d2 X2+..+ dp Xp

    •Where d=(d1 ,d2 ,.., dp)Tis a column vector of weights with

                   d1²+d2²+..+ dp²=1

    •Maximize thevariance of the projection of the observations on the Y variables

PCA is achieved by singular value decomposition : X= UDVT

    •X is the original data
    •U (N×N) is the relative projection of the points
    •V is project directions
        –v1 is a unit vector, direction of the first projection
        –The eigenvector with the largest eigenvalue
        –Linear combination ( relative importance ) of each gene ( if PCA on samples )
    •D is scaling factor (eigenvalue)
        –Diagonal matrix, d1>= d2>=d3>=0
        –dmmeasures the variances captures by the  mth principal component
    •u1d1 is distance along v1 from origin (first principal components)
        –Expression value projected on v1
        –u1d1 captures the largest variance of original X
        –v2 is 2nd projection direction, orthogonal to PC1,U2dcaptures the 2nd largest var of  X
Note : 
PCA and MDS are both good dimension reduction methods. 
PCA is a good clustering method,and PCA can be conducted on genes or on samples.
PCA is only powerful when the biological question is related to the highest variance in the dataset.
PCA for batch effect detection : PCA can identify batch effect.
Supervised Learning :

1. Logistic Regression :more statistical way of machine learning,you find a combination of genes,then you give them different weight to predict the sample classification.

Data (yi,xi) , i = 1,2,...,n. 

Model :


b0:the intercept(whether this curve is gonna be on the left or on the right)

b1:regression slope (how steep is this slope)

b0+b1X = 0 :the decision boundary Pb(1)=Pb(0)=0.5

Logit : natural regression is different with linear regression: change in probability is not linear with changes in X.


Sample classification: Y: cancer 1, normal 0

Model:

β0:the intercept

β12,...,βp : give a weight to some gene selected, not all the gene. Select some genes that are important, give these genes some weights, just so that you can best separate cancer versus normal.

Note: find subset of p genes whose expression collectively predict new sample classification.

2. Support Vector Machine ( SVM )

SVM: try to maximize the separation distance between the samples lie on the boundary. ( The internal samples don’t really matter that much, it’s just trying to make sure the boundaries are separated well )

SVM: find the hyperplane that maximize the margin. Margin determined by support vectors ( samples on the class edge ), others irrelevant.

Cross Validation:

For every supervised machine learning, in order to make sure we don’t overfit ,we use cross validation to test the samples that are left out.

When we do not have enough samples: do Leave-1 cross validation on n data points ( Build classifier on (n-1) samples, test on the one left out )

When we have enough samples: do N-fold cross validation.( divide data into N equal subsets build classifier on (N-1) subsets, compute error rate on the one left out subset )


Implementation :

1. Generate training data:

n = 200

S = cbind(c(1,0.2),c(0.2,1))

x1=mvrnorm(n=n,mu=c(1,1),sigma=S)

x1=mvrnorm(n=n,mu=c(-1,-1),Sigma=S)

x=rbind(x1,x2)

colnames(x)=c(“x1”,”x2”)

y1=rep(c(1,-1),each=n)

y2=rep(c(1,0),each=n)

df=data.frame(cbind(y1,y2,x)) 

2. Implementation of KNN:

library(caret)

Ctrl=trainControl(method=”repeatedcv”,repeats=5)

KNN=train(x=df[,3:4],y=as.factor(df[,1]),method="knn",tuneGrid=data.frame(.k=seq(from=1,to=15,by=2)),trControl=Ctrl)

3. Implementation of Logistic Regression:

# Logistic model

LR = glm(y2~x1+x2,data=df,family=binomial())

4. Implementation of Linear Discriminant Analysis:

# LDA model

LDA = lda(y1~x1+x2,data=df)

5. Implementation of SVM:

library("caret");

# SVM model

Ctrl = trainControl(method="repeatedcv",repeats=5);

SVM = train(as.factor(y1)~x1+x2,data=df,method="svmLinear",trControl=Ctrl);



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值