Classification is equivalent to machine learning methods. We are trying to learn from the existing data and also predict the new things.Task: assign object to a class based on measurement on object,such as expression profle or other things.
Machine learning:
1. Unsupervised learning:
Ignore known class labels ( normal or cancer )
Sometimes can not even separate the known class
2. Supervised learning:
Exstract useful features based on known class labels to best separate classes
Can overfit the data,so need to separate training and test set
Unsupervised learning:
1. Clustering : the most important unsupervised learning method
No guarantee that the known sample will cluster.
Try batch effect removal or different clustering methods: change linkage,select subset genes ( semi-supervised )
2. K Nearest Neighbours ( KNN )
Before used in missing values estimation.For observation X with unknown label , find the k observation in the training data closest ( such as correlation ) to X.
Predict the label of X based on majority vote by KNN.
K can be determined by predictability of known samples( semi - superivised )
Can extend KNN by assigning weights to the neighbours by inverse distance from test - sample
KNN offer little insights into mechanism
3. Multidimensional Scaling ( MDS )
Based on distance between data in high dimensional space ( such as correlation )
Try to approximate pairwise distance between samples:give a 2D - 3D representation approximately the pairwise distance relationship as much as possible
4. Principal Component Analysis ( PCA )
Linear transformation that projects the data onto new coordinate system( linear combinations of the original variables ) to capture as much of the variation in the data as possible.
The first principal component accounts for the greatest possible variance in dataset.
The second principal component accounts for the next highest variance and uncorrelated ( orthogonal to ) the first principal component.
There are many dimensional space at the beginning, then we are trying to shrunk it into the first single dimension by doing linear combination of all the genes.
Finding the Projections:
Y= dTX=d1 X1+ d2 X2+..+ dp Xp
d1²+d2²+..+ dp²=1
PCA is achieved by singular value decomposition : X= UDVT
1. Logistic Regression :more statistical way of machine learning,you find a combination of genes,then you give them different weight to predict the sample classification.
Data (yi,xi) , i = 1,2,...,n.
Model :
b0:the intercept(whether this curve is gonna be on the left or on the right)
b1:regression slope (how steep is this slope)
b0+b1X = 0 :the decision boundary Pb(1)=Pb(0)=0.5
Logit : natural regression is different with linear regression: change in probability is not linear with changes in X.
Sample classification: Y: cancer 1, normal 0
Model:
β0:the intercept
β1,β2,...,βp : give a weight to some gene selected, not all the gene. Select some genes that are important, give these genes some weights, just so that you can best separate cancer versus normal.
Note: find subset of p genes whose expression collectively predict new sample classification.
2. Support Vector Machine ( SVM )
SVM: try to maximize the separation distance between the samples lie on the boundary. ( The internal samples don’t really matter that much, it’s just trying to make sure the boundaries are separated well )
SVM: find the hyperplane that maximize the margin. Margin determined by support vectors ( samples on the class edge ), others irrelevant.
Cross Validation:
For every supervised machine learning, in order to make sure we don’t overfit ,we use cross validation to test the samples that are left out.
When we do not have enough samples: do Leave-1 cross validation on n data points ( Build classifier on (n-1) samples, test on the one left out )
When we have enough samples: do N-fold cross validation.( divide data into N equal subsets build classifier on (N-1) subsets, compute error rate on the one left out subset )
Implementation :
1. Generate training data:
n = 200
S = cbind(c(1,0.2),c(0.2,1))
x1=mvrnorm(n=n,mu=c(1,1),sigma=S)
x1=mvrnorm(n=n,mu=c(-1,-1),Sigma=S)
x=rbind(x1,x2)
colnames(x)=c(“x1”,”x2”)
y1=rep(c(1,-1),each=n)
y2=rep(c(1,0),each=n)
df=data.frame(cbind(y1,y2,x))
2. Implementation of KNN:
library(caret)
Ctrl=trainControl(method=”repeatedcv”,repeats=5)
KNN=train(x=df[,3:4],y=as.factor(df[,1]),method="knn",tuneGrid=data.frame(.k=seq(from=1,to=15,by=2)),trControl=Ctrl)
3. Implementation of Logistic Regression:
# Logistic model
LR = glm(y2~x1+x2,data=df,family=binomial())
4. Implementation of Linear Discriminant Analysis:
# LDA model
LDA = lda(y1~x1+x2,data=df)
5. Implementation of SVM:
library("caret");
# SVM model
Ctrl = trainControl(method="repeatedcv",repeats=5);
SVM = train(as.factor(y1)~x1+x2,data=df,method="svmLinear",trControl=Ctrl);