GPos-class {GenomicRanges} | R Documentation |
The GPos class is a container for storing a set of genomic positions where most of the positions are typically (but not necessarily) adjacent. Because genomic positions can be seen as genomic ranges of width 1, the GPos class extends the GenomicRanges virtual class. Note that even though a GRanges instance can be used for storing genomic positions, using a GPos object will be much more memory-efficient, especially when the object contains long runs of adjacent positions in ascending order.
GPos(pos_runs) # constructor function
pos_runs |
A GRanges object (or any other GenomicRanges derivative)
where each range is interpreted as a run of adjacent ascending genomic
positions on the same strand.
If |
A GPos object.
GPos objects support the same set of getters as other GenomicRanges
derivatives (i.e. seqnames()
, ranges()
, start()
,
end()
, strand()
, mcols()
, seqinfo()
,
etc...), plus the pos()
getter which is equivalent to
start()
or end()
. See ?GenomicRanges
for the
list of getters supported by GenomicRanges derivatives.
IMPORTANT NOTES:
ranges()
returns an IPos object instead
of the IRanges object that one gets with other
GenomicRanges derivatives. To get an IRanges
object, you need to call ranges()
again on the
IPos object i.e. do ranges(ranges(x))
)
on GPos object x
.
Note that a GPos object cannot hold names i.e. names()
always returns NULL
on it.
Like GRanges objects, GPos objects support the following setters:
The seqnames()
and strand()
setters.
The mcols()
and metadata()
setters.
The family of setters that operate on the seqinfo component of
an object:
seqlevels()
,
seqlevelsStyle()
,
seqlengths()
,
isCircular()
,
genome()
,
and seqinfo()
.
These setters are defined and documented in the GenomeInfoDb
package.
However, there is no pos()
setter for GPos objects at the moment
(although one might be added in the future).
From GenomicRanges to GPos:
A GenomicRanges derivative x
in which all the ranges have a
width of 1 can be coerced to a GPos object with as(x, "GPos")
.
The names on x
are not propagated (a warning is issued if x
has names on it).
From GPos to GRanges:
A GPos object x
can be coerced to a GRanges object
with as(x, "GRanges")
. However be aware that the resulting object
can use thousands times (or more) memory than x
!
See "MEMORY USAGE" in the Examples section below.
From GPos to ordinary R objects:
Like with any other GenomicRanges derivative, as.character()
,
as.factor()
, and as.data.frame()
work on a GPos object
x
. Note however that as.data.frame(x)
returns a data frame
with a pos
column (containing pos(x)
) instead of the
start
, end
, and width
columns that one gets with other
GenomicRanges derivatives.
A GPos object can be subsetted exactly like a GRanges object.
GPos objects can be combined (a.k.a. appended) with c()
or
append()
.
Like with any other GRanges object, split()
and relist()
work on a GPos object.
Like for any Vector derivative, the length of a
GPos object cannot exceed .Machine$integer.max
(i.e. 2^31 on
most platforms). GPos()
will return an error if pos_runs
contains too many genomic positions.
Internal representation of GPos objects has changed in GenomicRanges
1.29.10 (Bioc 3.6). Update any old object x
with:
x <- updateObject(x, verbose=TRUE)
.
Hervé Pagès; based on ideas borrowed from Georg Stricker georg.stricker@in.tum.de and Julien Gagneur gagneur@in.tum.de
The IPos class in the IRanges package for a memory-efficient representation of integer positions (i.e. integer ranges of width 1).
GenomicRanges and GRanges objects.
The seqinfo
accessor and family in
the GenomeInfoDb package for accessing/modifying the seqinfo
component of an object.
GenomicRanges-comparison for comparing and ordering genomic positions.
findOverlaps-methods for finding overlapping genomic ranges and/or positions.
nearest-methods for finding the nearest genomic range/position neighbor.
The snpsBySeqname
,
snpsByOverlaps
, and
snpsById
methods for
SNPlocs objects defined in the BSgenome
package for extractors that return a GPos object.
SummarizedExperiment objects in the SummarizedExperiment package.
## --------------------------------------------------------------------- ## BASIC EXAMPLES ## --------------------------------------------------------------------- ## Example 1: gpos1 <- GPos(c("chr1:44-53", "chr1:5-10", "chr2:2-5")) gpos1 length(gpos1) seqnames(gpos1) pos(gpos1) # same as 'start(gpos1)' and 'end(gpos1)' strand(gpos1) as.character(gpos1) as.data.frame(gpos1) as(gpos1, "GRanges") as.data.frame(as(gpos1, "GRanges")) gpos1[9:17] ## Example 2: pos_runs <- GRanges("chrI", IRanges(c(1, 6, 12, 17), c(5, 10, 16, 20)), strand=c("+", "-", "-", "+")) gpos2 <- GPos(pos_runs) gpos2 strand(gpos2) ## Example 3: gpos3A <- gpos3B <- GPos(c("chrI:1-1000", "chrI:1005-2000")) npos <- length(gpos3A) mcols(gpos3A)$sample <- Rle("sA") sA_counts <- sample(10, npos, replace=TRUE) mcols(gpos3A)$counts <- sA_counts mcols(gpos3B)$sample <- Rle("sB") sB_counts <- sample(10, npos, replace=TRUE) mcols(gpos3B)$counts <- sB_counts gpos3 <- c(gpos3A, gpos3B) gpos3 ## Example 4: library(BSgenome.Scerevisiae.UCSC.sacCer2) genome <- BSgenome.Scerevisiae.UCSC.sacCer2 gpos4 <- GPos(seqinfo(genome)) gpos4 # all the positions along the genome are represented mcols(gpos4)$dna <- do.call("c", unname(as.list(genome))) gpos4 ## Note however that, like for any Vector derivative, the length of a ## GPos object cannot exceed '.Machine$integer.max' (i.e. 2^31 on most ## platforms) so the above only works with a "small" genome. ## For example it doesn't work with the Human genome: library(TxDb.Hsapiens.UCSC.hg19.knownGene) ## Not run: GPos(seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene)) # error! ## End(Not run) ## You can use isSmallGenome() to check upfront whether the genome is ## "small" or not. isSmallGenome(genome) isSmallGenome(TxDb.Hsapiens.UCSC.hg19.knownGene) ## --------------------------------------------------------------------- ## MEMORY USAGE ## --------------------------------------------------------------------- ## Coercion to GRanges works... gr4 <- as(gpos4, "GRanges") gr4 ## ... but is generally not a good idea: object.size(gpos4) object.size(gr4) # 8 times bigger than the GPos object! ## Shuffling the order of the positions impacts memory usage: gpos4r <- rev(gpos4) object.size(gpos4r) # significantly gpos4s <- sample(gpos4) object.size(gpos4s) # even worse! ## AN IMPORTANT NOTE: In the worst situations, GPos still performs as ## good as a GRanges object. object.size(as(gpos4r, "GRanges")) # same size as 'gpos4r' object.size(as(gpos4s, "GRanges")) # same size as 'gpos4s' ## Best case scenario is when the object is strictly sorted (i.e. ## positions are in strict ascending order). ## This can be checked with: is.unsorted(gpos4, strict=TRUE) # 'gpos4' is strictly sorted ## --------------------------------------------------------------------- ## USING MEMORY-EFFICIENT METADATA COLUMNS ## --------------------------------------------------------------------- ## In order to keep memory usage as low as possible, it is recommended ## to use a memory-efficient representation of the metadata columns that ## we want to set on the object. Rle's are particularly well suited for ## this, especially if the metadata columns contain long runs of ## identical values. This is the case for example if we want to use a ## GPos object to represent the coverage of sequencing reads along a ## genome. ## Example 5: library(pasillaBamSubset) library(Rsamtools) # for the BamFile() constructor function bamfile1 <- BamFile(untreated1_chr4()) bamfile2 <- BamFile(untreated3_chr4()) gpos5 <- GPos(seqinfo(bamfile1)) library(GenomicAlignments) # for "coverage" method for BamFile objects cov1 <- unlist(coverage(bamfile1), use.names=FALSE) cov2 <- unlist(coverage(bamfile2), use.names=FALSE) mcols(gpos5) <- DataFrame(cov1, cov2) gpos5 object.size(gpos5) # lightweight ## Keep only the positions where coverage is at least 10 in one of the ## 2 samples: gpos5[mcols(gpos5)$cov1 >= 10 | mcols(gpos5)$cov2 >= 10] ## --------------------------------------------------------------------- ## USING A GPos OBJECT IN A SummarizedExperiment OBJECT ## --------------------------------------------------------------------- ## Because the GPos class extends the GenomicRanges virtual class, a GPos ## object can be used as the rowRanges component of a SummarizedExperiment ## object. ## As a 1st example, we show how the counts for samples sA and sB in ## 'gpos3' can be stored in a SummarizedExperiment object where the rows ## correspond to unique genomic positions and the columns to samples: library(SummarizedExperiment) counts <- cbind(sA=sA_counts, sB=sB_counts) mcols(gpos3A) <- NULL rse3 <- SummarizedExperiment(list(counts=counts), rowRanges=gpos3A) rse3 rowRanges(rse3) head(assay(rse3)) ## Finally we show how the coverage data from Example 5 can be easily ## stored in a lightweight SummarizedExperiment object: cov <- mcols(gpos5) mcols(gpos5) <- NULL rse5 <- SummarizedExperiment(list(cov=cov), rowRanges=gpos5) rse5 rowRanges(rse5) assay(rse5) ## Keep only the positions where coverage is at least 10 in one of the ## 2 samples: rse5[assay(rse5)$cov1 >= 10 | assay(rse5)$cov2 >= 10]