fetchExtendedChromInfoFromUCSC {GenomeInfoDb} | R Documentation |
Fetch the chromosomes info for some UCSC genomes. Only supports hg38, hg19, hg18, panTro4, panTro3, panTro2, bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, mm10, mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, galGal3, gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3, and sacCer2 at the moment.
fetchExtendedChromInfoFromUCSC(genome, goldenPath_url="http://hgdownload.cse.ucsc.edu/goldenPath", quiet=FALSE)
genome |
A single string specifying the UCSC genome e.g. |
goldenPath_url |
A single string specifying the URL to the UCSC goldenPath location.
This URL is used internally to build the full URL to the 'chromInfo'
MySQL dump containing chromosomes information for |
quiet |
|
Chromosomes information (e.g. names and lengths) for any UCSC genome is stored in the UCSC database in the 'chromInfo' table, and is normally available as a MySQL dump at:
goldenPath_url/<genome>/database/chromInfo.txt.gz
fetchExtendedChromInfoFromUCSC
downloads and imports that table
into a data frame, keeps only the UCSC_seqlevel
and
UCSC_seqlength
columns (after renaming them), and adds the
circular
logical column.
Then, if this UCSC genome is based on an NCBI assembly (e.g. hg38 is based on GRCh38), the NCBI seqlevels and GenBank accession numbers are extracted from the NCBI assembly report and the UCSC seqlevels matched to them (using some guided heuristic). Finally the NCBI seqlevels and GenBank accession numbers are added to the returned data frame.
A data frame with 1 row per seqlevel in the UCSC genome, and at least 3 columns:
UCSC_seqlevel
: Character vector with no NAs. This is the
chrom
field of the UCSC 'chromInfo' table for the
genome. See Details section above.
UCSC_seqlength
: Integer vector with no NAs. This is the
size
field of the UCSC 'chromInfo' table for the
genome. See Details section above.
circular
: Logical vector with no NAs. This knowledge is
stored in the GenomeInfoDb package itself for the supported
genomes.
If the UCSC genome is *not* based on an NCBI assembly (e.g. gasAcu1, ce10,
sacCer2), there are no additional columns and a warning is emitted (unless
quiet
is set to TRUE
). In this case, the rows are sorted
by UCSC seqlevel rank as determined by rankSeqlevels()
.
If the UCSC genome is based on an NCBI assembly (e.g. sacCer3), the returned data frame has 3 additional columns:
NCBI_seqlevel
: Character vector. This information is
obtained from the NCBI assembly report for the genome. Will contain
NAs for UCSC seqlevels with no corresponding NCBI seqlevels (e.g.
for chrM in hg18 or chrUextra in dm3), in which case
fetchExtendedChromInfoFromUCSC
emits a warning (unless
quiet
is set to TRUE
).
SequenceRole
: Factor with levels assembled-molecule
,
alt-scaffold
, unlocalized-scaffold
,
unplaced-scaffold
, and pseudo-scaffold
. For
UCSC seqlevels with corresponding NCBI seqlevels this information
is obtained from the NCBI assembly report. Otherwise it is
obtained from a base of knowledge included in the GenomeInfoDb
package. Can contain NAs but no warning is emitted in that case.
GenBankAccn
: Character vector. This information is obtained
from the NCBI assembly report for the genome. Can contain NAs but no
warning is emitted in that case.
In this case, the rows are sorted first by level in the SequenceRole
column, that is, assembled-molecule
s first, then alt-scaffold
s,
etc, and NAs last. Then within each group they are sorted by UCSC seqlevel
rank as determined by rankSeqlevels()
.
fetchExtendedChromInfoFromUCSC
queries the UCSC Genome Browser as
well as the FTP site at NCBI and thus requires internet access.
Only supports the hg38, hg19, hg18, panTro4, panTro3, panTro2, bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, mm10, mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, galGal3, gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3, and sacCer2 genomes at the moment. More will come...
H. Pagès
The seqlevels
getter and setter.
The rankSeqlevels
function for ranking sequence names.
The seqlevelsStyle
getter and setter.
The getBSgenome
utility in the
BSgenome package for searching the installed BSgenome
data packages.
## All the examples below require internet access! ## --------------------------------------------------------------------- ## A. BASIC EXAMPLE ## --------------------------------------------------------------------- ## The sacCer3 UCSC genome is based on an NCBI assembly (RefSeq Assembly ## ID is GCF_000146045.2): sacCer3_chrominfo <- fetchExtendedChromInfoFromUCSC("sacCer3") sacCer3_chrominfo ## But the sacCer2 UCSC genome is not: sacCer2_chrominfo <- fetchExtendedChromInfoFromUCSC("sacCer2") sacCer2_chrominfo ## --------------------------------------------------------------------- ## B. USING fetchExtendedChromInfoFromUCSC() TO PUT UCSC SEQLEVELS ON ## THE GRCh38 GENOME ## --------------------------------------------------------------------- ## Load the BSgenome.Hsapiens.NCBI.GRCh38 package: library(BSgenome) genome <- getBSgenome("GRCh38") # this loads the # BSgenome.Hsapiens.NCBI.GRCh38 package ## A quick look at the GRCh38 seqlevels: length(seqlevels(genome)) head(seqlevels(genome), n=30) ## Fetch the extended chromosomes info for the hg38 genome: hg38_chrominfo <- fetchExtendedChromInfoFromUCSC("hg38") dim(hg38_chrominfo) head(hg38_chrominfo, n=30) ## 2 sanity checks: ## 1. Check the NCBI seqlevels: stopifnot(setequal(hg38_chrominfo$NCBI_seqlevel, seqlevels(genome))) ## 2. Check that the sequence lengths in 'hg38_chrominfo' (which are ## coming from the same 'chromInfo' table as the UCSC seqlevels) ## are the same as in 'genome': stopifnot( identical(hg38_chrominfo$UCSC_seqlength, unname(seqlengths(genome)[hg38_chrominfo$NCBI_seqlevel])) ) ## Extract the hg38 seqlevels and put the GRCh38 seqlevels on it as ## the names: hg38_seqlevels <- setNames(hg38_chrominfo$UCSC_seqlevel, hg38_chrominfo$NCBI_seqlevel) ## Set the hg38 seqlevels on 'genome': seqlevels(genome) <- hg38_seqlevels[seqlevels(genome)] head(seqlevels(genome), n=30)