XStringSet-class {Biostrings} | R Documentation |
The BStringSet class is a container for storing a set of
BString
objects and for making its manipulation
easy and efficient.
Similarly, the DNAStringSet (or RNAStringSet, or AAStringSet) class is
a container for storing a set of DNAString
(or RNAString
, or AAString
) objects.
All those containers derive directly (and with no additional slots) from the XStringSet virtual class.
## Constructors: BStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE) DNAStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE) RNAStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE) AAStringSet(x=character(), start=NA, end=NA, width=NA, use.names=TRUE) ## Accessor-like methods: ## S4 method for signature 'character' width(x) ## S4 method for signature 'XStringSet' nchar(x, type="chars", allowNA=FALSE) ## ... and more (see below)
x |
Either a character vector (with no NAs), or an XString, XStringSet or XStringViews object. |
start,end,width |
Either |
use.names |
|
type,allowNA |
Ignored. |
The BStringSet
, DNAStringSet
, RNAStringSet
and
AAStringSet
functions are constructors that can be used to
turn input x
into an XStringSet object of the desired base type.
They also allow the user to "narrow" the sequences contained in x
via proper use of the start
, end
and/or width
arguments. In this context, "narrowing" means dropping a prefix or/and
a suffix of each sequence in x
.
The "narrowing" capabilities of these constructors can be illustrated
by the following property: if x
is a character vector
(with no NAs), or an XStringSet (or XStringViews) object,
then the 3 following transformations are equivalent:
BStringSet(x, start=mystart, end=myend, width=mywidth)
subseq(BStringSet(x), start=mystart, end=myend, width=mywidth)
BStringSet(subseq(x, start=mystart, end=myend, width=mywidth))
Note that, besides being more convenient, the first form is also more efficient on character vectors.
In the code snippets below,
x
is an XStringSet object.
length(x)
:
The number of sequences in x
.
width(x)
:
A vector of non-negative integers containing the number
of letters for each element in x
.
Note that width(x)
is also defined for a character vector
with no NAs and is equivalent to nchar(x, type="bytes")
.
names(x)
:
NULL
or a character vector of the same length as x
containing a short user-provided description or comment for each
element in x
.
These are the only data in an XStringSet object that can safely
be changed by the user. All the other data are immutable!
As a general recommendation, the user should never try to modify
an object by accessing its slots directly.
alphabet(x)
:
Return NULL
, DNA_ALPHABET
,
RNA_ALPHABET
or AA_ALPHABET
depending on
whether x
is a BStringSet, DNAStringSet, RNAStringSet or
AAStringSet object.
nchar(x)
:
The same as width(x)
.
In the code snippets below,
x
is a character vector (with no NAs),
or an XStringSet (or XStringViews) object.
subseq(x, start=NA, end=NA, width=NA)
:
Applies subseq
on each element in x
.
See ?subseq
for the details.
Note that this is similar to what substr
does on a
character vector. However there are some noticeable differences:
(1) the arguments are start
and stop
for
substr
;
(2) the SEW interface (start/end/width) interface of subseq
is richer (e.g. support for negative start or end values);
and (3) subseq
checks that the specified start/end/width values
are valid i.e., unlike substr
, it throws an error if
they define "out of limits" subsequences or subsequences with a
negative width.
narrow(x, start=NA, end=NA, width=NA, use.names=TRUE)
:
Same as subseq
. The only differences are: (1) narrow
has a use.names
argument; and (2) all the things narrow
and subseq
work on
(IRanges, XStringSet or
XStringViews objects for narrow
,
XVector or XStringSet objects for
subseq
). But they both work and do the same thing on an
XStringSet object.
threebands(x, start=NA, end=NA, width=NA)
:
Like the method for IRanges
objects, the
threebands
methods for character vectors and XStringSet
objects extend the capability of narrow
by returning the 3
set of subsequences (the left, middle and right subsequences)
associated to the narrowing operation.
See ?threebands
in the
IRanges package for the details.
subseq(x, start=NA, end=NA, width=NA) <- value
:
A vectorized version of the subseq<-
method for XVector objects.
See ?`subseq<-`
for the details.
In the code snippets below,
x
and values
are XStringSet objects,
and i
should be an index specifying the elements to extract.
x[i]
:
Return a new XStringSet object made of the selected elements.
x[[i]]
:
Extract the i-th XString
object from x
.
append(x, values, after=length(x))
:
Add sequences in values
to x
.
In the code snippets below,
x
and y
are XStringSet objects.
union(x, y)
:
Union of x
and y
.
intersect(x, y)
:
Intersection of x
and y
.
setdiff(x, y)
:
Asymmetric set difference of x
and y
.
setequal(x, y)
:
Set equality of x
to y
.
In the code snippets below,
x
is an XStringSet object.
unlist(x)
:
Turns x
into an XString object by combining the
sequences in x
together.
Fast equivalent to do.call(c, as.list(x))
.
as.character(x, use.names=TRUE)
:
Converts x
to a character vector of the same length as x
.
The use.names
argument controls whether or not names(x)
should be propagated to the names of the returned vector.
as.factor(x)
:
Converts x
to a factor, via as.character(x)
.
as.matrix(x, use.names=TRUE)
:
Returns a character matrix containing the "exploded" representation of
the strings. Can only be used on an XStringSet object with
equal-width strings.
The use.names
argument controls whether or not names(x)
should be propagated to the row names of the returned matrix.
toString(x)
:
Equivalent to toString(as.character(x))
.
show(x)
:
By default the show
method displays 5 head and 5 tail
lines. The number of lines can be altered by setting the global
options showHeadLines
and showTailLines
. If the
object length is less than the sum of the options, the full object
is displayed. These options affect GRanges, GappedAlignments,
Ranges and XString objects.
H. Pagès
XStringSet-comparison,
XString-class,
XStringViews-class,
XStringSetList-class,
subseq
,
narrow
,
substr
,
compact
,
XVectorList-class
## --------------------------------------------------------------------- ## A. USING THE XStringSet CONSTRUCTORS ON A CHARACTER VECTOR OR FACTOR ## --------------------------------------------------------------------- ## Note that there is no XStringSet() constructor, but an XStringSet ## family of constructors: BStringSet(), DNAStringSet(), RNAStringSet(), ## etc... x0 <- c("#CTC-NACCAGTAT", "#TTGA", "TACCTAGAG") width(x0) x1 <- BStringSet(x0) x1 ## 3 equivalent ways to obtain the same BStringSet object: BStringSet(x0, start=4, end=-3) subseq(x1, start=4, end=-3) BStringSet(subseq(x0, start=4, end=-3)) dna0 <- DNAStringSet(x0, start=4, end=-3) dna0 names(dna0) names(dna0)[2] <- "seqB" dna0 ## When the input vector contains a lot of duplicates, turning it into ## a factor first before passing it to the constructor will produce an ## XStringSet object that is more compact in memory: library(hgu95av2probe) x2 <- sample(hgu95av2probe$sequence, 999000, replace=TRUE) dna2a <- DNAStringSet(x2) dna2b <- DNAStringSet(factor(x2)) # slower but result is more compact object.size(dna2a) object.size(dna2b) ## --------------------------------------------------------------------- ## B. USING THE XStringSet CONSTRUCTORS ON A SINGLE SEQUENCE (XString ## OBJECT OR CHARACTER STRING) ## --------------------------------------------------------------------- x3 <- "abcdefghij" BStringSet(x3, start=2, end=6:2) # behaves like 'substring(x3, 2, 6:2)' BStringSet(x3, start=-(1:6)) x4 <- BString(x3) BStringSet(x4, end=-(1:6), width=3) ## Randomly extract 1 million 40-mers from C. elegans chrI: extractRandomReads <- function(subject, nread, readlength) { if (!is.integer(readlength)) readlength <- as.integer(readlength) start <- sample(length(subject) - readlength + 1L, nread, replace=TRUE) DNAStringSet(subject, start=start, width=readlength) } library(BSgenome.Celegans.UCSC.ce2) rndreads <- extractRandomReads(Celegans$chrI, 1000000, 40) ## Notes: ## - This takes only 2 or 3 seconds versus several hours for a solution ## using substring() on a standard character string. ## - The short sequences in 'rndreads' can be seen as the result of a ## simulated high-throughput sequencing experiment. A non-realistic ## one though because: ## (a) It assumes that the underlying technology is perfect (the ## generated reads have no technology induced errors). ## (b) It assumes that the sequenced genome is exactly the same as the ## reference genome. ## (c) The simulated reads can contain IUPAC ambiguity letters only ## because the reference genome contains them. In a real ## high-throughput sequencing experiment, the sequenced genome ## of course doesn't contain those letters, but the sequencer ## can introduce them in the generated reads to indicate ambiguous ## base-calling. ## (d) The simulated reads come from the plus strand only of a single ## chromosome. ## - See the getSeq() function in the BSgenome package for how to ## circumvent (d) i.e. how to generate reads that come from the whole ## genome (plus and minus strands of all chromosomes). ## --------------------------------------------------------------------- ## C. USING THE XStringSet CONSTRUCTORS ON AN XStringSet OBJECT ## --------------------------------------------------------------------- library(drosophila2probe) probes <- DNAStringSet(drosophila2probe) probes RNAStringSet(probes, start=2, end=-5) # does NOT copy the sequence data! ## --------------------------------------------------------------------- ## D. USING THE XStringSet CONSTRUCTORS ON AN ORDINARY list OF XString ## OBJECTS ## --------------------------------------------------------------------- probes10 <- head(probes, n=10) set.seed(33) shuffled_nucleotides <- lapply(probes10, sample) shuffled_nucleotides DNAStringSet(shuffled_nucleotides) # does NOT copy the sequence data! ## Note that the same result can be obtained in a more compact way with ## just: set.seed(33) endoapply(probes10, sample) ## --------------------------------------------------------------------- ## E. USING subseq() ON AN XStringSet OBJECT ## --------------------------------------------------------------------- subseq(probes, start=2, end=-5) subseq(probes, start=13, end=13) <- "N" probes ## Add/remove a prefix: subseq(probes, start=1, end=0) <- "--" probes subseq(probes, end=2) <- "" probes ## Do more complicated things: subseq(probes, start=4:7, end=7) <- c("YYYY", "YYY", "YY", "Y") subseq(probes, start=4, end=6) <- subseq(probes, start=-2:-5) probes ## --------------------------------------------------------------------- ## F. UNLISTING AN XStringSet OBJECT ## --------------------------------------------------------------------- library(drosophila2probe) probes <- DNAStringSet(drosophila2probe) unlist(probes) ## --------------------------------------------------------------------- ## G. COMPACTING AN XStringSet OBJECT ## --------------------------------------------------------------------- ## As a particular type of XVectorList objects, XStringSet objects can ## optionally be compacted. Compacting is done typically before ## serialization. See ?compact for more information. library(drosophila2probe) probes <- DNAStringSet(drosophila2probe) y <- subseq(probes[1:12], start=5) probes@pool y@pool object.size(probes) object.size(y) y0 <- compact(y) y0@pool object.size(y0)