%\VignetteIndexEntry{An Introduction to IRanges} %\VignetteDepends{} %\VignetteKeywords{Ranges} %\VignettePackage{IRanges} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \textwidth=6.5in \textheight=8.5in %\parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\IRanges}{\Rpackage{IRanges}} \title{An Introduction to \IRanges{}} \author{Patrick Aboyoun, Michael Lawrence, Herv\'e Pag\`es} \date{\today} \begin{document} \maketitle <>= options(width=72) @ \section{Introduction} The \IRanges{} package is designed to represent sequences, ranges representing indices along those sequences, and data related to those ranges. In this vignette, we will rely on simple, illustrative example datasets, rather than large, real-world data, so that each data structure and algorithm can be explained in an intuitive, graphical manner. We expect that packages that apply \IRanges{} to a particular problem domain will provide vignettes with relevant, realistic examples. The \IRanges{} package is available at bioconductor.org and can be downloaded via \Rfunction{biocLite}: <>= source("http://bioconductor.org/biocLite.R") biocLite("IRanges") @ <>= library(IRanges) @ \section{Vector objects} In the context of the \IRanges{} package, a sequence is an ordered finite collection of elements. The \IRanges{} packages represents two types of objects as sequences: (1) atomic sequences and (2) lists (or non-atomic sequences). The following subsections describe each in turn. All \IRanges{}-derived sequences inherit from the \Rclass{Vector} virtual class. \subsection{Atomic Vectors} In \R{}, atomic sequences are typically stored in atomic vectors. The \IRanges{} package includes an additional atomic sequence object type, \Rclass{Rle}, which compresses an atomic sequence through run-length encoding. We begin our discussion of atomic sequences using two \Rclass{Rle} vectors. <>= set.seed(0) lambda <- c(rep(0.001, 4500), seq(0.001, 10, length = 500), seq(10, 0.001, length = 500)) xVector <- Rle(rpois(1e7, lambda)) yVector <- Rle(rpois(1e7, lambda[c(251:length(lambda), 1:250)])) @ All atomic sequences in \R{} have three main properties: (1) a notion of length or number of elements, (2) the ability to extract elements to create new atomic sequences, and (3) the ability to be combined with one or more atomic sequences to form larger atomic sequences. The main functions for these three operations are \Rfunction{length}, \Rfunction{[}, and \Rfunction{c}. <>= length(xVector) xVector[1] zVector <- c(xVector, yVector) @ While these three methods may seem trivial, they provide a great deal of power and many atomic sequence manipulations can be constructed using them. \subsubsection{Vector Subsetting} As with ordinary \R{} atomic vectors, it is often necessary to subset one sequence from another. When this subsetting does not duplicate or reorder the elements being extracted, the result is called a \textit{subsequence}. In general, the \Rfunction{[} function can be used to construct a new sequence or extract a subsequence, but its interface is often inconvenient and not amenable to optimization. To compensate for this, the \IRanges{} package supports seven additional functions for sequence extraction: \begin{enumerate} \item \Rfunction{window} - Extracts a subsequence over a specified region. \item \Rfunction{subset} - Extracts the subsequence specified by a logical vector. \item \Rfunction{head} - Extracts a consecutive subsequence containing the first n elements. \item \Rfunction{tail} - Extracts a consecutive subsequence containing the last n elements. \item \Rfunction{rev} - Creates a new sequence with the elements in the reverse order. \item \Rfunction{rep} - Creates a new sequence by repeating sequence elements. \end{enumerate} The following code illustrates how these functions are used on an ordinary \R{} \Rclass{integer} vector: <>= xSnippet <- xVector[IRanges(4751, 4760)] xSnippet head(xSnippet) tail(xSnippet) rev(xSnippet) rep(xSnippet, 2) subset(xSnippet, xSnippet >= 5L) @ \subsubsection{Combining Vectors} The \IRanges{} package uses two generic functions, \Rfunction{c} and \Rfunction{append}, for combining two \Rclass{Vector} objects. The methods for \Rclass{Vector} objects follow the definition that these two functions are given the the \Rpackage{base} package. <>= c(xSnippet, rev(xSnippet)) append(xSnippet, xSnippet, after = 3) @ \subsubsection{Looping over Vectors and Vector subsets} In \R{}, \Rfunction{for} looping can be an expensive operation. To compensate for this, \IRanges{} uses three generics, \Rfunction{endoapply}, \Rfunction{lapply}, and \Rfunction{sapply}, for looping over sequences and two generics, \Rfunction{aggregate} and \Rfunction{shiftApply}, to perform calculations over subsequences. The \Rfunction{lapply} and \Rfunction{sapply} functions are familiar to many \R{} users since they are the standard functions for looping over the elements of an \R{} \Rclass{list} object. The \Rfunction{endoapply} function performs an endomorphism equivalent to \Rfunction{lapply}, i.e. returns a \Rclass{Vector} object of the same class as the input rather than a \Rclass{list} object. More will be given on these three functions in the Lists subsection. The \Rfunction{aggregate} function combines sequence extraction functionality of the \Rfunction{window} function with looping capabilities of the \Rfunction{sapply} function. For example, here is some code to compute medians across a moving window of width 3 using the function \Rfunction{aggregate}: <>= xSnippet aggregate(xSnippet, start = 1:8, width = 3, FUN = median) @ The \Rfunction{shiftApply} function is a looping operation involving two sequences whose elements are lined up via a positional shift operation. For example, the elements of \Robject{xVector} and \Robject{yVector} were simulated from Poisson distributions with the mean of element i from \Robject{yVector} being equivalent to the mean of element i + 250 from \Robject{xVector}. If we did not know the size of the shift, we could estimate it by finding the shift that maximizes the correlation between \Robject{xVector} and \Robject{yVector}. <>= cor(xVector, yVector) shifts <- seq(235, 265, by=3) corrs <- shiftApply(shifts, yVector, xVector, FUN = cor) @ % <>= plot(shifts, corrs) @ The result is shown in Fig.~\ref{figshiftcorrs}. \begin{figure}[tb] \begin{center} \includegraphics[width=0.5\textwidth]{IRangesOverview-figshiftcorrs} \caption{\label{figshiftcorrs}% Correlation between \Robject{xVector} and \Robject{yVector} for various shifts.} \end{center} \end{figure} \subsubsection{Run Length Encoding} Up until this point we have used \R{} atomic vectors to represent atomic sequences, but there are times when these object become too large to manage in memory. When there are lots of consecutive repeats in the sequence, the data can be compressed and managed in memory through a run-length encoding where a data value is paired with a run length. For example, the sequence \{1, 1, 1, 2, 3, 3\} can be represented as values = \{1, 2, 3\}, run lengths = \{3, 1, 2\}. The \Rclass{Rle} class in \IRanges{} is used to represent a run-length encoded (compressed) sequence of \Rclass{logical}, \Rclass{integer}, \Rclass{numeric}, \Rclass{complex}, \Rclass{character}, or \Rclass{raw} values. One way to construct an \Rclass{Rle} object is through the \Rclass{Rle} constructor function: <>= xRle <- Rle(xVector) yRle <- Rle(yVector) xRle yRle @ When there are lots of consecutive repeats, the memory savings through an RLE can be quite dramatic. For example, the \Robject{xRle} object occupies less than one quarter of the space of the original \Robject{xVector} object, while storing the same information: <>= as.vector(object.size(xRle) / object.size(xVector)) identical(as.vector(xRle), xVector) @ The functions \Rfunction{runValue} and \Rfunction{runLength} extract the run values and run lengths from an \Rclass{Rle} object respectively: <>= head(runValue(xRle)) head(runLength(xRle)) @ The \Rclass{Rle} class supports many of the basic methods associated with \R{} atomic vectors including the Ops, Math, Math2, Summary, and Complex group generics. Here is a example of manipulating \Rclass{Rle} objects using methods from the Ops group: <>= xRle > 0 xRle + yRle xRle > 0 | yRle > 0 @ Here are some from the Summary group: <>= range(xRle) sum(xRle > 0 | yRle > 0) @ And here is one from the Math group: <>= log1p(xRle) @ As with the atomic vectors, the \Rfunction{cor} and \Rfunction{shiftApply} functions operate on \Rclass{Rle} objects: <>= cor(xRle, yRle) shiftApply(249:251, yRle, xRle, FUN = function(x, y) var(x, y) / (sd(x) * sd(y))) @ For more information on the methods supported by the \Rclass{Rle} class, consult the \Rcode{Rle} man page. \subsection{Lists} In many data analysis situation there is a desire to organize and manipulate multiple objects simultaneously. Typically this is done in \R{} through the usage of a list. While a list serves as a generic container, it does not confer any information about the specific class of its elements, provides no infrastructure to ensure type safety, and the S3 and S4 method dispatch mechanisms do not support method selection for lists with homogeneous object types. The \Rclass{List} virtual class defined in the \IRanges{} package addresses these issues. \Rclass{List} is a direct extension of \Rclass{Vector}. \subsubsection{Lists of Atomic Vectors} The first type of lists we consider are those containing atomic sequences such as \Rclass{integer} vectors or \Rclass{Rle} objects. We may wish to define a method that retrieves the length of each atomic sequence element, without special type checking. To enable this, we define collection classes such as \Rclass{IntegerList} and \Rclass{RleList}, which inherit from the \Rclass{List} virtual class, for representing lists of \Rclass{integer} vectors and \Rclass{Rle} objects respectively. <>= getClassDef("RleList") @ As the class definition above shows, the \Rclass{RleList} class is virtual with subclasses \Rclass{SimpleRleList} and \Rclass{CompressedRleList}. A \Rclass{SimpleRleList} class uses a regular \R{} list to store the underlying elements and the \Rclass{CompressedRleList} class stores the elements in an unlisted form and keeps track of where the element breaks are. The former ``simple list" class is useful when the Rle elements are long and the latter ``compressed list" class is useful when the list is long and/or sparse (i.e. a number of the list elements have length 0). In fact, all of the atomic vector types (raw, logical, integer, numeric, complex, and character) have similar list classes that derive from the \Rclass{List} virtual class. For example, there is an \Rclass{IntegerList} virtual class with subclasses \Rclass{SimpleIntegerList} and \Rclass{CompressedIntegerList}. Each of the list classes for atomic sequences, be they stored as vectors or \Rclass{Rle} objects, have a constructor function with a name of the appropriate list virtual class, such as \Rclass{IntegerList}, and an optional argument \Rfunarg{compress} that takes an argument to specify whether or not to create the simple list object type or the compressed list object type. The default is to create the compressed list object type. <>= args(IntegerList) cIntList1 <- IntegerList(x = xVector, y = yVector) cIntList1 sIntList2 <- IntegerList(x = xVector, y = yVector, compress = FALSE) sIntList2 ## sparse integer list xExploded <- lapply(xVector[1:5000], function(x) seq_len(x)) cIntList2 <- IntegerList(xExploded) sIntList2 <- IntegerList(xExploded, compress = FALSE) object.size(cIntList2) object.size(sIntList2) @ The \Rfunction{length} function returns the number of elements in a \Rclass{Vector}-derived object and, for a \Rclass{List}-derived object like ``simple list" or ``compressed list", the \Rfunction{elementNROWS} function returns an integer vector containing the lengths of each of the elements: <>= length(cIntList2) Rle(elementNROWS(cIntList2)) @ Just as with ordinary \R{} \Rclass{list} objects, \Rclass{List}-derived object support the \Rfunction{[[} for element extraction, \Rfunction{c} for combining, and \Rfunction{lapply}/\Rfunction{sapply} for looping. When looping over sparse lists, the ``compressed list" classes can be much faster during computations since only the non-empty elements are looped over during the \Rfunction{lapply}/\Rfunction{sapply} computation and all the empty elements are assigned the appropriate value based on their status. <>= system.time(sapply(xExploded, mean)) system.time(sapply(sIntList2, mean)) system.time(sapply(cIntList2, mean)) identical(sapply(xExploded, mean), sapply(sIntList2, mean)) identical(sapply(xExploded, mean), sapply(cIntList2, mean)) @ Unlist ordinary \R{} \Rclass{list} objects, \Rclass{AtomicList} objects support the \Rfunction{Ops} (e.g. \Rfunction{+}, \Rfunction{==}, \Rfunction{\&}), \Rfunction{Math} (e.g. \Rfunction{log}, \Rfunction{sqrt}), \Rfunction{Math2} (e.g. \Rfunction{round}, \Rfunction{signif}), \Rfunction{Summary} (e.g. \Rfunction{min}, \Rfunction{max}, \Rfunction{sum}), and \Rfunction{Complex} (e.g. \Rfunction{Re}, \Rfunction{Im}) group generics. <>= xRleList <- RleList(xRle, 2L * rev(xRle)) yRleList <- RleList(yRle, 2L * rev(yRle)) xRleList > 0 xRleList + yRleList sum(xRleList > 0 | yRleList > 0) @ Since these atomic lists inherit from \Rclass{List}, they can also use the looping function \Rfunction{endoapply} to perform endomorphisms. <>= safe.max <- function(x) { if(length(x)) max(x) else integer(0) } endoapply(sIntList2, safe.max) endoapply(cIntList2, safe.max) endoapply(sIntList2, safe.max)[[1]] @ \section{Data Tables} To Do: \Rclass{DataTable}, \Rclass{DataFrame}, \Rclass{DataFrameList}, \Rclass{SplitDataFrameList} \section{Vector Annotations} Often when one has a collection of objects, there is a need to attach metadata that describes the collection in some way. Two kinds of metadata can be attached to a \Rclass{Vector} object: \begin{enumerate} \item Metadata about the object as a whole: this metadata is accessed via the \Rfunction{metadata} accessor and is represented as an ordinary \Rclass{list}; \item Metadata about the individual elements of the object: this metadata is accessed via the \Rfunction{mcols} accessor (\Rfunction{mcols} stands for {\it metadata columns}) and is represented as a \Rclass{DataTable} object (i.e. as an instance of a concrete subclass of \Rclass{DataTable}, e.g. a \Rclass{DataFrame} object). This \Rclass{DataTable} object can be thought of as the result of binding together one or several vector-like objects (the metadata columns) of the same length as the \Rclass{Vector} object. Each row of the \Rclass{DataTable} object annotates the corresponding element of the \Rclass{Vector} object. \end{enumerate} \section{Vector Ranges} When analyzing sequences, we are often interested in particular consecutive subsequences. For example, the \Sexpr{letters} vector could be considered a sequence of lower-case letters, in alphabetical order. We would call the first five letters (\textit{a} to \textit{e}) a consecutive subsequence, while the subsequence containing only the vowels would not be consecutive. It is not uncommon for an analysis task to focus only on the geometry of the regions, while ignoring the underlying sequence values. A list of indices would be a simple way to select a subsequence. However, a sparser representation for consecutive subsequences would be a range, a pairing of a start position and a width, as used when extracting sequences with \Rfunction{window}. When analyzing subsequences in \IRanges{}, each range is treated as an observation. The virtual \Rclass{Ranges} class represents lists of ranges, or, equivalently and as a derivative \Rclass{IntegerList}, sequences of consecutive integers. The most commonly used implementation of \Rclass{Ranges} is \Rclass{IRanges}, which stores the starts and widths as ordinary integer vectors. To construct an \Rclass{IRanges} instance, we call the \Rfunction{IRanges} constructor. Ranges are normally specified by passing two out of the three parameters: start, end and width (see \Rcode{help(IRanges)} for more information). % <>= ir1 <- IRanges(start = 1:10, width = 10:1) ir2 <- IRanges(start = 1:10, end = 11) ir3 <- IRanges(end = 11, width = 10:1) identical(ir1, ir2) & identical(ir2, ir3) ir <- IRanges(c(1, 8, 14, 15, 19, 34, 40), width = c(12, 6, 6, 15, 6, 2, 7)) @ % All of the above calls construct an \Rclass{IRanges} instance with the same ranges, using different combinations of the \Rfunarg{start}, \Rfunarg{end} and \Rfunarg{width} parameters. Accessing the starts, widths and ends is supported by every \Rclass{Ranges} implementation. <>= start(ir) @ <>= end(ir) @ <>= width(ir) @ For \Rclass{IRanges} and some other \Rclass{Ranges} derivatives, subsetting is also supported, by numeric and logical indices. <>= ir[1:4] @ <>= ir[start(ir) <= 15] @ One may think of each range as a sequence of integer ranges, and \Rclass{Ranges} is, in fact, derived from \Rclass{IntegerList}. <>= ir[[1]] @ In order to illustrate range operations, we'll create a function to plot ranges. <>= plotRanges <- function(x, xlim = x, main = deparse(substitute(x)), col = "black", sep = 0.5, ...) { height <- 1 if (is(xlim, "Ranges")) xlim <- c(min(start(xlim)), max(end(xlim))) bins <- disjointBins(IRanges(start(x), end(x) + 1)) plot.new() plot.window(xlim, c(0, max(bins)*(height + sep))) ybottom <- bins * (sep + height) - height rect(start(x)-0.5, ybottom, end(x)+0.5, ybottom + height, col = col, ...) title(main) axis(1) } @ <>= plotRanges(ir) @ \begin{figure}[tb] \begin{center} \includegraphics[width=0.5\textwidth]{IRangesOverview-ir-plotRanges} \caption{\label{fig-ir-plotRanges}% Plot of original ranges.} \end{center} \end{figure} \subsection{Normality} Sometimes, it is necessary to formally represent a subsequence, where no elements are repeated and order is preserved. Also, it is occasionally useful to think of a \Rclass{Ranges} object as a set, where no elements are repeated and order does not matter. While every \Rclass{Ranges} object, as a \Rclass{Vector} derivative, has an implicit ordering, one can enforce the same ordering for all such objects, so that ordering becomes inconsequential within that context. The \Rclass{NormalIRanges} class formally represents either a subsequence encoding or a set of integers. By definition a Ranges object is said to be \textit{normal} when its ranges are: (a) not empty (i.e. they have a non-null width); (b) not overlapping; (c) ordered from left to right; (d) not even adjacent (i.e. there must be a non empty gap between 2 consecutive ranges). There are three main advantages of using a \textit{normal} \Rclass{Ranges} object: (1) it guarantees a subsequence encoding or set of integers, (2) it is compact in terms of the number of ranges, and (3) it uniquely identifies its information, which simplifies comparisons. The \Rfunction{reduce} function reduces any \Rclass{Ranges} object to a \Rclass{NormalIRanges} by merging redundant ranges. <>= reduce(ir) plotRanges(reduce(ir)) @ \begin{figure}[tb] \begin{center} \includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-reduce} \caption{\label{fig-ranges-reduce}% Plot of reduced ranges.} \end{center} \end{figure} \subsection{Lists of \Rclass{Ranges} objects} It is common to manipulate collections of \Rclass{Ranges} objects during an analysis. Thus, the \IRanges{} package defines some specific classes for working with multiple \Rclass{Ranges} objects. The \Rclass{RangesList} class asserts that each element is a \Rclass{Ranges} object and provides convenience methods, such as \Rfunction{start}, \Rfunction{end} and \Rfunction{width} accessors that return \Rclass{IntegerList} objects, aligning with the \Rclass{RangesList} object. To explicitly construct a \Rclass{RangesList}, use the \Rfunction{RangesList} function. <>= rl <- RangesList(ir, rev(ir)) @ % <>= start(rl) @ \subsection{Vector Extraction} As the elements of a \Rclass{Ranges} object encode consecutive subsequences, they may be used directly in sequence extraction. Note that when a \textit{normal} \Rclass{Ranges} is given as the index, the result is a subsequence, as no elements are repeated or reordered. If the sequence is a \Rclass{Vector} subclass (i.e. not an ordinary \Rclass{vector}), the canonical \Rfunction{[} function accepts a \Rclass{Ranges} instance. % <>= irextract <- IRanges(start = c(4501, 4901) , width = 100) xRle[irextract] @ % \subsection{Finding Overlapping Ranges} The function \Rfunction{findOverlaps} detects overlaps between two \Rclass{Ranges} objects. <>= ol <- findOverlaps(ir, reduce(ir)) as.matrix(ol) @ \subsection{Counting Overlapping Ranges} The function \Rfunction{coverage} counts the number of ranges over each position. <>= cov <- coverage(ir) plotRanges(ir) cov <- as.vector(cov) mat <- cbind(seq_along(cov)-0.5, cov) d <- diff(cov) != 0 mat <- rbind(cbind(mat[d,1]+1, mat[d,2]), mat) mat <- mat[order(mat[,1]),] lines(mat, col="red", lwd=4) axis(2) @ \begin{figure}[tb] \begin{center} \includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-coverage} \caption{\label{fig-ranges-coverage}% Plot of ranges with accumulated coverage.} \end{center} \end{figure} \subsection{Finding Neighboring Ranges} The \Rfunction{nearest} function finds the nearest neighbor ranges (overlapping is zero distance). The \Rfunction{precede} and \Rfunction{follow} functions find the non-overlapping nearest neighbors on a specific side. \subsection{Transforming Ranges} Utilities are available for transforming a \Rclass{Ranges} object in a variety of ways. Some transformations, like \Rfunction{reduce} introduced above, can be dramatic, while others are simple per-range adjustments of the starts, ends or widths. \subsubsection{Adjusting starts, ends and widths} Perhaps the simplest transformation is to adjust the start values by a scalar offset, as performed by the \Rfunction{shift} function. Below, we shift all ranges forward 10 positions. % <>= shift(ir, 10) @ There are several other ways to transform ranges. These include \Rfunction{narrow}, \Rfunction{resize}, \Rfunction{flank}, \Rfunction{reflect}, \Rfunction{restrict}, and \Rfunction{threebands}. For example \Rfunction{narrow} supports the adjustment of start, end and width values, which should be relative to each range. These adjustments are vectorized over the ranges. As its name suggests, the ranges can only be narrowed. % <>= narrow(ir, start=1:5, width=2) @ The \Rfunction{restrict} function ensures every range falls within a set of bounds. Ranges are contracted as necessary, and the ranges that fall completely outside of but not adjacent to the bounds are dropped, by default. % <>= restrict(ir, start=2, end=3) @ The \Rfunction{threebands} function extends \Rfunction{narrow} so that the remaining left and right regions adjacent to the narrowed region are also returned in separate \Rclass{Ranges} objects. % <>= threebands(ir, start=1:5, width=2) @ The arithmetic operators \Rfunction{+}, \Rfunction{-} and \Rfunction{*} change both the start and the end/width by symmetrically expanding or contracting each range. Adding or subtracting a numeric (integer) vector to a \Rclass{Ranges} causes each range to be expanded or contracted on each side by the corresponding value in the numeric vector. <>= ir + seq_len(length(ir)) @ % The \Rfunction{*} operator symmetrically magnifies a \Rclass{Ranges} object by a factor, where positive contracts (zooms in) and negative expands (zooms out). % <>= ir * -2 # double the width @ WARNING: The semantic of these arithmetic operators might be revisited at some point. Please restrict their use to the context of interactive visualization (where they arguably provide some convenience) but avoid to use them programmatically. \subsubsection{Making ranges disjoint} A more complex type of operation is making a set of ranges disjoint, \textit{i.e.} non-overlapping. For example, \Rfunction{threebands} returns a disjoint set of three ranges for each input range. The \Rfunction{disjoin} function makes a \Rclass{Ranges} object disjoint by fragmenting it into the widest ranges where the set of overlapping ranges is the same. <>= disjoin(ir) plotRanges(disjoin(ir)) @ \begin{figure}[tb] \begin{center} \includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-disjoin} \caption{\label{fig-ranges-disjoin}% Plot of disjoined ranges.} \end{center} \end{figure} A variant of \Rfunction{disjoin} is \Rfunction{disjointBins}, which divides the ranges into bins, such that the ranges in each bin are disjoint. The return value is an integer vector of the bins. <>= disjointBins(ir) @ \subsubsection{Other transformations} Other transformations include \Rfunction{reflect} and \Rfunction{flank}. The former ``flips'' each range within a set of common reference bounds. <>= reflect(ir, IRanges(start(ir), width=width(ir)*2)) @ % The \Rfunction{flank} returns ranges of a specified width that flank, to the left (default) or right, each input range. One use case of this is forming promoter regions for a set of genes. <>= flank(ir, width = seq_len(length(ir))) @ % \subsection{Set Operations} Sometimes, it is useful to consider a \Rclass{Ranges} object as a set of integers, although there is always an implicit ordering. This is formalized by \Rclass{NormalIRanges}, above, and we now present versions of the traditional mathematical set operations \textit{complement}, \textit{union}, \textit{intersect}, and \textit{difference} for \Rclass{Ranges} objects. There are two variants for each operation. The first treats each \Rclass{Ranges} object as a set and returns a \textit{normal} value, while the other has a ``parallel'' semantic like \Rfunction{pmin}/\Rfunction{pmax} and performs the operation for each range pairing separately. The \textit{complement} operation is implemented by the \Rfunction{gaps} and \Rfunction{pgap} functions. By default, \Rfunction{gaps} will return the ranges that fall between the ranges in the (normalized) input. It is possible to specify a set of bounds, so that flanking ranges are included. <>= gaps(ir, start=1, end=50) plotRanges(gaps(ir, start=1, end=50), c(1,50)) @ \begin{figure}[tb] \begin{center} \includegraphics[width=0.5\textwidth]{IRangesOverview-ranges-gaps} \caption{\label{fig-ranges-gap}% Plot of gaps from ranges.} \end{center} \end{figure} \Rfunction{pgap} considers each parallel pairing between two \Rclass{Ranges} objects and finds the range, if any, between them. Note that the function name is singular, suggesting that only one range is returned per range in the input. <>= @ The remaining operations, \textit{union}, \textit{intersect} and \textit{difference} are implemented by the \Rfunction{[p]union}, \Rfunction{[p]intersect} and \Rfunction{[p]setdiff} functions, respectively. These are relatively self-explanatory. <>= @ <>= @ <>= @ <>= @ <>= @ <>= @ % \subsection{Mapping Ranges Between Vectors} \section{Vector Views} The \IRanges{} package provides the virtual \Rclass{Views} class, which stores a sequence together with an \Rclass{IRanges} object defining ranges on the sequence. Each range is said to represent a \textit{view} onto the sequence. Here, we will demonstrate the \Rclass{RleViews} class, where the sequence is of class \Rclass{Rle}. Other \Rclass{Views} implementations exist, such as \Rclass{XStringViews} in the \Rpackage{Biostrings} package. \subsection{Creating Views} There are two basic constructors for creating views: the \Rfunction{Views} function based on indicators and the \Rfunction{slice} based on numeric boundaries. <>= xViews <- Views(xRle, xRle >= 1) xViews <- slice(xRle, 1) xViewsList <- slice(xRleList, 1) @ \subsection{Aggregating Views} While \Rfunction{sapply} can be used to loop over each window, the native functions \Rfunction{viewMaxs}, \Rfunction{viewMins}, \Rfunction{viewSums}, and \Rfunction{viewMeans} provide fast looping to calculate their respective statistical summaries. <>= head(viewSums(xViews)) viewSums(xViewsList) head(viewMaxs(xViews)) viewMaxs(xViewsList) @ \section{IRanges in Biological Sequence Analysis} The \IRanges{} packages was primarily designed with biological sequence analysis in mind and Table \ref{table:bioseq} shows how some biological sequence analysis concepts are represented in the \IRanges{} class system. \begin{table}[ht] \begin{center} \begin{tabular}{l|l} \hline Biological Entity & \Rclass{Vector} Subclass \\ \hline Genome browser track(s) & \Rclass{GRanges}/\Rclass{GRangesList} \\ Coverage across chromosomes/contigs & \Rclass{RleList} \\ Mapped ranges to genome & \Rclass{CompressedIRangesList} \\ Data (sans ranges) across chroms/contigs & \Rclass{SplitDataFrameList} \\ \hline \end{tabular} \end{center} \caption{\Rclass{Vector} subclasses for Biological Sequence Analysis} \label{table:bioseq} \end{table} \pagebreak[4] \section{Session Information} \begin{table*}[tbp] \begin{minipage}{\textwidth} <>= toLatex(sessionInfo()) @ \end{minipage} \caption{\label{tab:sessioninfo}% The output of \Rfunction{sessionInfo} on the build system after running this vignette.} \end{table*} \end{document}