# 初识R语言——PCA的实现

## 初识R语言 —— PCA的实现

### 数据描述

Human body consists of about 70 trillion cells, where each of the cells have DNA molecules called genome (Figure 1). Here, the genome is only a storage unit for genetic information, which needs to be partially copied into a smaller unit called RNA for the actual utility (Figure 1). Each RNA molecule is much smaller than genome and only contains information of a single gene, while genome has genetic information of every genes. Here, the process partially copying genome into RNA is called transcription (Figure 1). After RNAs are transcribed from genome, they subsequently converted into polypeptides (or proteins), which are the actual machineries running cellular processes. This conversion is named as translation to distinguish from transcription (Figure 1).

Figure 1. Description of transcription process

Cell needs to transcribe each gene only when it is needed, so each gene need to be selectively transcribed. And the control which gene is transcribed or not depends on the control unit called transcription factor, which is itself a protein. In general, multiple transcription factors are needed to transcribe a single gene, where the transcription factors are combined into a single protein-complex along with other mediator proteins. By bending DNA molecules into U-shaped structure, transcription factor complex make a physical force to move forward the copying machinery for transcription process (Figure 2). Therefore, the quantity of each gene’s transcript is controlled by the quantity of the corresponding transcription factors.

Figure 2. Description of transcription factor’s action

As described, the biology of transcription process is well studied and the mechanism is quite straightforward. So, the only remaining problem is matching which factors control which genes. Extracting this information by using regression model is a famous problem in bioinformatics. We provide you a genome-wide profile of RNA quantity to make a regression model. The data we provide is RNA quantity data extracted from 20 different people. Each people have a single target gene (TG) as a dependent variable, and nine transcription factors (TF) as input variables.

### R语言实现PCA

#### 样本数据读入

> gene_table <- read.table("GeneExpr.Table.txt", sep="\t", head=T)
> gene_table 

#### 主要成分分析

> pca_princomp <- princomp(gene_table[,-10],cor=T)
> pca_prcomp <- prcomp(gene_table[,-10],scale=T)

> biplot(pca_prcomp, col = c("black", "white"))

#### 得到降维后数据

> pca_data <- predict(pca_prcomp)

> plot(pca_data[,1:2]) 

