Chapter9 Working with Range Data
A Crash Course in Genomic Ranges and Coordinate Systems
CrossMap is a command-line tool that converts many data formats (BED, GFF/ GTF, SAM/BAM, Wiggle, VCF) between coordinate systems of different assembly versions.
NCBI Genome Remapping Service is a web-based tool supporting a variety of genomes and formats.
LiftOver is also a web-based tool for converting between genomes hosted on the UCSC Genome Browser’s site.
• 0-based coordinate system, with half-closed, half-open intervals.
• 1-based coordinate system, with closed intervals.
An Interactive Introduction to Range Data with GenomicRanges
Installing and Working with Bioconductor Packages
Bioconductor is an open source software project that creates R bioinformatics packages and serves as a repository for them
GenomicRanges
Used to represent and work with genomic ranges
GenomicFeatures
Used to represent and work with ranges that represent gene models and other features of a genome (genes, exons, UTRs, transcripts, etc.)
Biostrings and BSgenome
Used for manipulating genomic sequence data in R (we’ll cover the subset of these packages used for extracting sequences from ranges)
rtracklayer
Used for reading in common bioinformatics formats like BED, GTF/GFF, and WIG
1. biocLite():Installing Bioconductor packages
Install Bioconductor’s primary packages: (be sure your R version is up to date first)
> source("http://bioconductor.org/biocLite.R")
> biocLite()
2. Install the GenomicRanges package
> biocLite("GenomicRanges")
Load the BiocInstaller package with library(BiocInstaller) first. biocLite() will notify you when some of your packages are out of date and need to be upgraded (which it can do automatically for you)
If you run into an unexpected error with a Bioconductor package, it’s a good idea to run biocUpdatePackages() and biocValid() before debugging.
See the GenomicRanges reference manual and vignettes
Storing Generic Ranges with IRanges
> rng <- IRanges(start=4, end=13)
> rng
IRanges of length 1
start end width
[1] 4 13 10
The most important fact to note: IRanges (and GenomicRanges) is 1-based, and uses closed intervals. The 1-based system was adopted to be consistent with R’s 1-based system (recall the first element in an R vector has index 1).
> IRanges(start=4, width=3)
IRanges of length 1
start end width
[1] 4 6 3
> IRanges(end=5, width=5)
IRanges of length 1
start end width
[1] 1 5 5
An IRanges object containing many ranges:
> x <- IRanges(start=c(4, 7, 2, 20), end=c(13, 7, 5, 23))
> x
IRanges of length 4
start end width
[1] 4 13 10
[2] 7 7 1
[3] 2 5 4
[4] 20 23 4
Each range can be given a name
> names(x) <- letters[1:4]
> x
IRanges of length 4
start end width names
[1] 4 13 10 a
[2] 7 7 1 b
[3] 2 5 4 c
[4] 20 23 4 d
Chapter9用到R的内容太多,先跳过。。