【生信进阶练习1000days】day1-Bioconductor的一些补充小用法与Working with Genomic Ranges

学习章节

https://bioconductor.github.io/BiocWorkshops/r-and-bioconductor-for-everyone-an-introduction.html#introduction-to-bioconductor

1. Bioconductor的一些补充小用法

使用valid()查看包安装的版本情况

> BiocManager::valid()

* sessionInfo()

R version 3.5.3 (2019-03-11)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936  LC_CTYPE=Chinese (Simplified)_China.936   
[3] LC_MONETARY=Chinese (Simplified)_China.936 LC_NUMERIC=C                              
[5] LC_TIME=Chinese (Simplified)_China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] BiocManager_1.30.4 compiler_3.5.3     tools_3.5.3        yaml_2.2.0        

Bioconductor version '3.8'

  * 45 packages out-of-date
  * 0 packages too new

create a valid installation with

  BiocManager::install(c(
    "AnnotationFilter", "backports", "biomaRt", "Biostrings", "broom", "checkmate", "devtools",
    "digest", "dplyr", "DT", "ensembldb", "fs", "GenomicAlignments", "GenomicFeatures",
    "ggplot2", "ggrepel", "gridGraphics", "httpuv", "igraph", "knitr", "pillar", "processx",
    "progress", "ProtGenerics", "rcmdcheck", "RcppArmadillo", "remotes", "reprex", "rGREAT",
    "rlang", "robustbase", "rtracklayer", "rvest", "segmented", "shiny", "shinyFiles", "sys",
    "testthat", "tinytex", "usethis", "vegan", "WGCNA", "xfun", "xtable", "zip"
  ), update = TRUE, ask = FALSE)

more details: BiocManager::valid()$too_new, BiocManager::valid()$out_of_date

Warning message:
45 packages out-of-date; 0 packages too new 

avaliable() 用于搜索相关的包

## 例如这里输入TxDb.Hsapiens,它会自动匹配相关的包
BiocManager::available("TxDb.Hsapiens")
#> [1] "TxDb.Hsapiens.BioMart.igis"                 
#> [2] "TxDb.Hsapiens.UCSC.hg18.knownGene"          
#> [3] "TxDb.Hsapiens.UCSC.hg19.knownGene"          
#> [4] "TxDb.Hsapiens.UCSC.hg19.lincRNAsTranscripts"
#> [5] "TxDb.Hsapiens.UCSC.hg38.knownGene"

通过网站https://support.bioconductor.org/ 来查询解决和包相关的问题

2. Working with Genomic Ranges

  • Importing data
  • Working with genomic ranges
  • Genomic annotations
  • Overlaps

rtracklayer包提供函数import() 帮助用户读取基因组格式的文件(例如:BED,GTF,VCF,FASTA) 并封装成Bioconductor下的对象。GenomicRanges包提供了多种函数,来在基因组坐标系下操纵各种数据。

2.1 importing data 导入bed文件
library(rtracklayer)
## 使用file.choose() 来选择文件
fname <- file.choose()   # CpGislands.Hsapiens.hg38.UCSC.bed
fname
file.exists(fname)
## 使用import()函数导入bed文件。导入后,将以GenomicRanges的对象描述这份CpG岛数据。
cpg <- import(fname)
cpg

注意 BED格式的文件,它们的坐标体系是0-based的,并且intervals是半开区间(start在范围内,end在范围后一个坐标里)
而Bioconductor得坐标是1-based,并且是闭区间(start和end坐标都包含在范围内),因此使用import()函数导入数据得时候,它会自动将bed文件坐标转换为Bioconductor文件对象的坐标。

2.2 Working with genomic ranges
## 使用函数 keepStandardChromosomes 保留标准染色体,标准染色体指的是chr1-22和x,y染色体
## 很多时候我们获取的数据里 可能染色体不止chr1-22与x,y ; 
## 可能还包含其它染色体,例如:chr22_KI270738v1_random这样的染色体。
## 通过keepStandardChromosomes 就可以去除这些其它的染色体,只保留标准染色体
cpg <- keepStandardChromosomes(cpg, pruning.mode = "coarse")
cpg

GenomicRanges对象包含两个部分

  • 必需的:seqnames(染色体号),start(起始位点),end(终止位点),strand (正链或负链)
  • 非必需的:另外的元素,例如本例中的name。

必需的元素内容,可以使用函数start(), end(), width()获取。非必需的元素,使用$符号获取内容。

head( start(cpg) )
#> [1] 155188537   2226774  36306230  47708823  53737730 144179072
head( cpg$name )

## 使用subset()函数获取染色体1号和2号上的cpg岛
subset(cpg, seqnames %in% c("chr1", "chr2"))
2.3 Genomic annotations
## 导入注释数据
library("TxDb.Hsapiens.UCSC.hg38.knownGene")
## 查看注释数据中的所有转录本
tx <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)
tx
## 为了演示方便,目前只保留标准染色体
tx <- keepStandardChromosomes(tx, pruning.mode="coarse")
tx
2.4 Overlaps

可以使用findOverlaps(),nearest(),precede()follow()函数,来完成overlaps

## count the number of CpG islands that overlap each transcript
olaps <- countOverlaps(tx, cpg)
length(olaps)     # 1 count per transcript
#> [1] 182435
table(olaps)
## 将计算的overlaps加入GR对象 
tx$cpgOverlaps <- countOverlaps(tx, cpg)
tx
## subsetting the GRanges objects for transcripts satisfying particular conditions, in a coordinated fashion ## where the software does all the book-keeping to makes sure the correct ranges are selected.
subset(tx, cpgOverlaps > 10)

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值