导入示例数据
library (SNPassoc )
data (SNPs )
SNPs [1 :8 ,1 :8 ]
id casco sex blood.pre protein snp10001 snp10002 snp10003 1 1 Female 13.7 75640.52 TT CC GG 2 1 Female 12.7 28688.22 TT AC GG 3 1 Female 12.9 17279.59 TT CC GG 4 1 Male 14.6 27253.99 CT CC GG 5 1 Female 13.4 38066.57 TT AC GG 6 1 Female 11.3 9872.46 TT CC GG 7 1 Female 11.9 11132.90 TT AC GG 8 1 Male 12.4 29973.43 TT AC GG
提取SNP数据,并转化格式
这里比较重要的是,row.names这一列表示ID,里面的数据全是SNP数据
myDat <- SNPs [,-(2 :5 )]
row .names(myDat) <- myDat$id;
myDat <- myDat[,-1 ]
myDat [1 :5 ,1 :5 ]
# str(myDat)
myDat <- as .matrix(myDat)
snp10001 snp10002 snp10003 snp10004 snp10005 TT CC GG GG GG TT AC GG GG AG TT CC GG GG GG CT CC GG GG GG TT AC GG GG GG
利用synbreed包进行转化,可以补全缺失值,转化基因型
Recoding alleles from character/factor/numeric into the number of copies of the minor alleles, i.e. 0, 1 and 2. In codeGeno, in the first step heterozygous genotypes are coded as 1. From the other genotypes, the less frequent genotype is coded as 2 and the remaining genotype as 0. 利用等位基因频率对基因型进行转化,多的纯合体为0,杂合为1,少的纯合体为2
library(synbreed)
cp <- create.gpData (geno = myDat)
cp .dat <- codeGeno(gpData = cp ,label.heter = "alleleCoding" , maf = 0.01 , nmiss = 0.1 ,
impute = TRUE, impute.type = "random" , verbose = TRUE)
step 1 : 1 marker(s) removed with > 10 % missing values
step 2 : Recoding alleles
step 4 : 12 marker(s) removed with maf < 0.01
step 7 : Imputing of missing values
step 7d : Random imputing of missing values
step 8 : No recoding of alleles necessary after imputation
step 9 : 0 marker(s) removed with maf < 0.01
step 10 : No duplicated markers removed
End : 22 marker(s) remain after the check
Summary of imputation
total number of missing values : 37
number of random imputations : 37
如果报错说是多余两个基因型,那是因为没有考虑缺失值,需要保存到csv中,再读取进去
write.csv (myDat,"snps.csv" )
ge <- read.csv ("snps.csv" ,header = T,row.names = 1 ,na.strings = "NA" )
summary(ge)
ge <- as.matrix (ge)
gp <- create.gpData (geno = ge)
cp .dat <- codeGeno(gpData = gp,label.heter = "alleleCoding" , maf = 0.01 , nmiss = 0.1 ,
impute = TRUE, impute.type = "random" , verbose = TRUE)
snp10001 snp10002 snp10003 snp10004 snp10005 snp10006 snp10007 snp10008
CC:12 AA: 5 GG :144 GG :156 AA: 3 AA:157 CC:157 CC:104
CT:53 AC:78 NA's: 13 NA's: 1 AG:70 CG: 44
TT:92 CC:74 GG:84 GG: 9
snp10009 snp100010 snp100011 snp100012 snp100013 snp100014 snp100015
AA :72 TT :147 CC: 1 CC : 3 AA :101 AA :27 AG: 13
AG :79 NA's: 10 CG: 2 CG :68 AG : 35 AC :74 GG:144
GG : 5 GG:154 GG :84 GG : 9 CC :52
NA's: 1 NA's: 2 NA's: 12 NA's: 4
snp100016 snp100017 snp100018 snp100019 snp100020 snp100021 snp100022
GG :152 CC : 5 CC : 5 CC:32 AA: 9 GG:157 AA :156
NA's: 5 CT :83 CT :84 CG:75 AG: 43 NA's: 1
TT :67 TT :67 GG:50 GG:105
NA's: 2 NA's: 1
snp100023 snp100024 snp100025 snp100026 snp100027 snp100028 snp100029
AA : 5 CC :14 CC:157 GG :156 CC :68 CC :34 AA :14
AT :78 CT :51 NA's: 1 CG :82 CT :72 AG :48
TT :71 TT :91 GG : 5 TT :50 GG :94
NA's: 3 NA's: 1 NA's: 2 NA's: 1 NA's: 1
snp100030 snp100031 snp100032 snp100033 snp100034 snp100035
AA:157 TT :102 AA :34 AA :34 CC :14 TT :146
NA's: 55 AG :70 AG :69 CT :48 NA's: 11
GG :52 GG :49 TT :94
NA's: 1 NA's: 5 NA's: 1
step 1 : 1 marker(s) removed with > 10 % missing values
step 2 : Recoding alleles
step 4 : 12 marker(s) removed with maf < 0.01
step 7 : Imputing of missing values
step 7d : Random imputing of missing values
step 8 : No recoding of alleles necessary after imputation
step 9 : 0 marker(s) removed with maf < 0.01
step 10 : No duplicated markers removed
End : 22 marker(s) remain after the check
Summary of imputation
total number of missing values : 37
number of random imputations : 37
查看一下转化后的结果
gee <- cp .dat $geno
gee[1 :5 ,1 :5 ]
snp10001 snp10002 snp10005 snp10008 snp10009 1 0 0 0 0 0 2 0 1 1 0 1 3 0 0 0 0 0 4 1 0 0 0 0 5 0 1 0 0 1