%\VignetteIndexEntry{A short presentation of the basic classes defined in Biostrings 2} %\VignetteKeywords{DNA, RNA, Sequence, Biostrings, Sequence alignment} %\VignettePackage{Biostrings} % % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % \documentclass[11pt]{article} %\usepackage[authoryear,round]{natbib} %\usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \textwidth=6.2in \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{The \Rpackage{Biostrings}~2 classes (work in progress)} \author{Herv\'e Pag\`es} \maketitle \tableofcontents % --------------------------------------------------------------------------- \section{Introduction} This document briefly presents the new set of classes implemented in the \Rpackage{Biostrings}~2 package. Like the \Rpackage{Biostrings}~1 classes (found in \Rpackage{Biostrings} v~1.4.x), they were designed to make manipulation of big strings (like DNA or RNA sequences) easy and fast. This is achieved by keeping the 3 following ideas from the \Rpackage{Biostrings}~1 package: (1) use R external pointers to store the string data, (2) use bit patterns to encode the string data, (3) provide the user with a convenient class of objects where each instance can store a set of views {\it on the same} big string (these views being typically the matches returned by a search algorithm). However, there is a flaw in the \Rclass{BioString} class design that prevents the search algorithms to return correct information about the matches (i.e. the views) that they found. The new classes address this issue by replacing the \Rclass{BioString} class (implemented in \Rpackage{Biostrings}~1) by 2 new classes: (1) the \Rclass{XString} class used to represent a {\it single} string, and (2) the \Rclass{XStringViews} class used to represent a set of views {\it on the same} \Rclass{XString} object, and by introducing new implementations and new interfaces for these 2 classes. % --------------------------------------------------------------------------- \section{The \Rclass{XString} class and its subsetting operator~\Rmethod{[}} The \Rclass{XString} is in fact a virtual class and therefore cannot be instanciated. Only subclasses (or subtypes) \Rclass{BString}, \Rclass{DNAString}, \Rclass{RNAString} and \Rclass{AAString} can. These classes are direct extensions of the \Rclass{XString} class (no additional slot). A first \Rclass{BString} object: <>= library(Biostrings) b <- BString("I am a BString object") b length(b) @ A \Rclass{DNAString} object: <>= d <- DNAString("TTGAAAA-CTC-N") d length(d) @ The differences with a \Rclass{BString} object are: (1) only letters from the {\it IUPAC extended genetic alphabet} + the gap letter ({\tt -}) are allowed and (2) each letter in the argument passed to the \Rfunction{DNAString} function is encoded in a special way before it's stored in the \Rclass{DNAString} object. Access to the individual letters: <>= d[3] d[7:12] d[] b[length(b):1] @ Only {\it in bounds} positive numeric subscripts are supported. In fact the subsetting operator for \Rclass{XString} objects is not efficient and one should always use the \Rmethod{subseq} method to extract a substring from a big string: <>= bb <- subseq(b, 3, 6) dd1 <- subseq(d, end=7) dd2 <- subseq(d, start=8) @ To {\it dump} an \Rclass{XString} object as a character vector (of length 1), use the \Rmethod{toString} method: <>= toString(dd2) @ Note that \Robject{length(dd2)} is equivalent to \Robject{nchar(toString(dd2))} but the latter would be very inefficient on a big \Rclass{DNAString} object. {\it [TODO: Make a generic of the substr() function to work with XString objects. It will be essentially doing toString(subseq()).]} % --------------------------------------------------------------------------- \section{The \Rmethod{==} binary operator for \Rclass{XString} objects} The 2 following comparisons are \Robject{TRUE}: <>= bb == "am a" dd2 != DNAString("TG") @ When the 2 sides of \Rmethod{==} don't belong to the same class then the side belonging to the ``lowest'' class is first converted to an object belonging to the class of the other side (the ``highest'' class). The class (pseudo-)order is \Rclass{character} < \Rclass{BString} < \Rclass{DNAString}. When both sides are \Rclass{XString} objects of the same subtype (e.g. both are \Rclass{DNAString} objects) then the comparison is very fast because it only has to call the C standard function {\tt memcmp()} and no memory allocation or string encoding/decoding is required. The 2 following expressions provoke an error because the right member can't be ``upgraded'' (converted) to an object of the same class than the left member: <>= cat('> bb == ""') cat('> d == bb') @ When comparing an \Rclass{RNAString} object with a \Rclass{DNAString} object, U and T are considered equals: <>= r <- RNAString(d) r r == d @ % --------------------------------------------------------------------------- \section{The \Rclass{XStringViews} class and its subsetting operators~\Rmethod{[} and~\Rmethod{[[}} An \Rclass{XStringViews} object contains a set of views {\it on the same} \Rclass{XString} object called the {\it subject} string. Here is an \Rclass{XStringViews} object with 4 views: <>= v4 <- Views(dd2, start=3:0, end=5:8) v4 length(v4) @ Note that the 2 last views are {\it out of limits}. You can select a subset of views from an \Rclass{XStringViews} object: <>= v4[4:2] @ The returned object is still an \Rclass{XStringViews} object, even if we select only one element. You need to use double-brackets to extract a given view as an \Rclass{XString} object: <>= v4[[2]] @ You can't extract a view that is {\it out of limits}: <>= cat('> v4[[3]]') cat(try(v4[[3]], silent=TRUE)) @ Note that, when \Robject{start} and \Robject{end} are numeric vectors and \Robject{i} is a {\it single} integer, \Robject{Views(b, start, end)[[i]]} is equivalent to \Robject{subseq(b, start[i], end[i])}. Subsetting also works with negative or logical values with the expected semantic (the same as for R built-in vectors): <>= v4[-3] v4[c(TRUE, FALSE)] @ Note that the logical vector is recycled to the length of \Robject{v4}. % --------------------------------------------------------------------------- \section{A few more \Rclass{XStringViews} objects} 12 views (all of the same width): <>= v12 <- Views(DNAString("TAATAATG"), start=-2:9, end=0:11) @ This is the same as doing \Robject{Views(d, start=1, end=length(d))}: <>= as(d, "Views") @ Hence the following will always return the \Robject{d} object itself: <>= as(d, "Views")[[1]] @ 3 \Rclass{XStringViews} objects with no view: <>= v12[0] v12[FALSE] Views(d) @ % --------------------------------------------------------------------------- \section{The \Rmethod{==} binary operator for \Rclass{XStringViews} objects} This operator is the vectorized version of the \Rmethod{==} operator defined previously for \Rclass{XString} objects: <>= v12 == DNAString("TAA") @ To display all the views in \Robject{v12} that are equals to a given view, you can type R cuties like: <>= v12[v12 == v12[4]] v12[v12 == v12[1]] @ This is \Robject{TRUE}: <>= v12[3] == Views(RNAString("AU"), start=0, end=2) @ % --------------------------------------------------------------------------- \section{The \Rmethod{start}, \Rmethod{end} and \Rmethod{width} methods} <>= start(v4) end(v4) width(v4) @ Note that \Robject{start(v4)[i]} is equivalent to \Robject{start(v4[i])}, except that the former will not issue an error if \Robject{i} is out of bounds (same for \Rmethod{end} and \Rmethod{width} methods). Also, when \Robject{i} is a {\it single} integer, \Robject{width(v4)[i]} is equivalent to \Robject{length(v4[[i]])} except that the former will not issue an error if \Robject{i} is out of bounds or if view \Robject{v4[i]} is {\it out of limits}. \end{document}