如何将原始SNP信息转化为0,1,2的矩阵形式

导入示例数据

library(SNPassoc)
data(SNPs)
SNPs[1:8,1:8]
idcascosexblood.preproteinsnp10001snp10002snp10003
1 1 Female 13.7 75640.52TT CC GG
2 1 Female 12.7 28688.22TT AC GG
3 1 Female 12.9 17279.59TT CC GG
4 1 Male 14.6 27253.99CT CC GG
5 1 Female 13.4 38066.57TT AC GG
6 1 Female 11.3 9872.46TT CC GG
7 1 Female 11.9 11132.90TT AC GG
8 1 Male 12.4 29973.43TT AC GG

提取SNP数据,并转化格式

这里比较重要的是,row.names这一列表示ID,里面的数据全是SNP数据

myDat<- SNPs[,-(2:5)]
row.names(myDat) <- myDat$id;
myDat <- myDat[,-1]
myDat[1:5,1:5]
# str(myDat)
myDat <- as.matrix(myDat)
snp10001snp10002snp10003snp10004snp10005
TTCCGGGGGG
TTACGGGGAG
TTCCGGGGGG
CTCCGGGGGG
TTACGGGGGG

利用synbreed包进行转化,可以补全缺失值,转化基因型

Recoding alleles from character/factor/numeric into the number of copies of the minor alleles, i.e. 0, 1 and 2. In codeGeno, in the first step heterozygous genotypes are coded as 1. From the other genotypes, the less frequent genotype is coded as 2 and the remaining genotype as 0.
利用等位基因频率对基因型进行转化,多的纯合体为0,杂合为1,少的纯合体为2

library(synbreed)
cp <- create.gpData(geno = myDat)
cp.dat <- codeGeno(gpData = cp,label.heter = "alleleCoding", maf = 0.01, nmiss = 0.1,
                   impute = TRUE, impute.type = "random", verbose = TRUE)

   step 1  : 1 marker(s) removed with > 10 % missing values 
   step 2  : Recoding alleles 
   step 4  : 12 marker(s) removed with maf < 0.01 
   step 7  : Imputing of missing values 
   step 7d : Random imputing of missing values 
   step 8  : No recoding of alleles necessary after imputation 
   step 9  : 0 marker(s) removed with maf < 0.01 
   step 10 : No duplicated markers removed 
   End     : 22 marker(s) remain after the check

     Summary of imputation 
    total number of missing values                : 37 
    number of random imputations                  : 37 

如果报错说是多余两个基因型,那是因为没有考虑缺失值,需要保存到csv中,再读取进去

write.csv(myDat,"snps.csv")
ge <- read.csv("snps.csv",header = T,row.names = 1,na.strings = "NA")
summary(ge)
ge <- as.matrix(ge)
gp <- create.gpData(geno = ge)
cp.dat <- codeGeno(gpData = gp,label.heter = "alleleCoding", maf = 0.01, nmiss = 0.1,
                   impute = TRUE, impute.type = "random", verbose = TRUE)

 snp10001 snp10002 snp10003   snp10004   snp10005 snp10006 snp10007 snp10008
 CC:12    AA: 5    GG  :144   GG  :156   AA: 3    AA:157   CC:157   CC:104  
 CT:53    AC:78    NA's: 13   NA's:  1   AG:70                      CG: 44  
 TT:92    CC:74                          GG:84                      GG:  9  

 snp10009  snp100010  snp100011 snp100012 snp100013  snp100014 snp100015
 AA  :72   TT  :147   CC:  1    CC  : 3   AA  :101   AA  :27   AG: 13   
 AG  :79   NA's: 10   CG:  2    CG  :68   AG  : 35   AC  :74   GG:144   
 GG  : 5              GG:154    GG  :84   GG  :  9   CC  :52            
 NA's: 1                        NA's: 2   NA's: 12   NA's: 4            
 snp100016  snp100017 snp100018 snp100019 snp100020 snp100021 snp100022 
 GG  :152   CC  : 5   CC  : 5   CC:32     AA:  9    GG:157    AA  :156  
 NA's:  5   CT  :83   CT  :84   CG:75     AG: 43              NA's:  1  
            TT  :67   TT  :67   GG:50     GG:105                        
            NA's: 2   NA's: 1                                           
 snp100023 snp100024 snp100025 snp100026  snp100027 snp100028 snp100029
 AA  : 5   CC  :14   CC:157    GG  :156   CC  :68   CC  :34   AA  :14  
 AT  :78   CT  :51             NA's:  1   CG  :82   CT  :72   AG  :48  
 TT  :71   TT  :91                        GG  : 5   TT  :50   GG  :94  
 NA's: 3   NA's: 1                        NA's: 2   NA's: 1   NA's: 1  
 snp100030 snp100031  snp100032 snp100033 snp100034 snp100035 
 AA:157    TT  :102   AA  :34   AA  :34   CC  :14   TT  :146  
           NA's: 55   AG  :70   AG  :69   CT  :48   NA's: 11  
                      GG  :52   GG  :49   TT  :94             
                      NA's: 1   NA's: 5   NA's: 1             


   step 1  : 1 marker(s) removed with > 10 % missing values 
   step 2  : Recoding alleles 
   step 4  : 12 marker(s) removed with maf < 0.01 
   step 7  : Imputing of missing values 
   step 7d : Random imputing of missing values 
   step 8  : No recoding of alleles necessary after imputation 
   step 9  : 0 marker(s) removed with maf < 0.01 
   step 10 : No duplicated markers removed 
   End     : 22 marker(s) remain after the check

     Summary of imputation 
    total number of missing values                : 37 
    number of random imputations                  : 37 

查看一下转化后的结果

gee <- cp.dat$geno
gee[1:5,1:5]
snp10001snp10002snp10005snp10008snp10009
100000
201101
300000
410000
501001
  • 9
    点赞
  • 25
    收藏
    觉得还不错? 一键收藏
  • 9
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 9
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值