xmlEventParse {XML} | R Documentation |
This is the event-driven or SAX (Simple API for XML)
style parser which process XML without building the tree
but rather identifies tokens in the stream of characters
and passes them to handlers which can make sense of them
in context.
This reads and processes the contents of an XML file or string by
invoking user-level functions associated with different
components of the XML tree. These components include
the beginning and end of XML elements, e.g
<myTag x="1">
and </myTag>
respectively,
comments, CDATA (escaped character data), entities, processing
instructions, etc.
This allows the caller to create the appropriate data structure from the
XML document contents rather than the default tree (see
xmlTreeParse)
and so avoids having the entire document in memory.
This is important for large documents and where we would end up with
essentially 2 copies of the data in memory at once, i.e
the tree and the R data structure containing the information taken
from the tree.
When dealing with classes of XML documents whose instances could be large,
this approach is desirable but a little more cumbersome to program
than the standard DOM (Document Object Model) approach provided
by XMLTreeParse
.
Note that xmlTreeParse
does allow a hybrid style of
processing that allows us to apply handlers to nodes in the tree
as they are being converted to R objects. This is a style of
event-driven or asynchronous calling
In addition to the generic token event handlers such as
"begin an XML element" (the startElement
handler), one can
also provide handler functions for specific tags/elements such
as <myTag>
with handler elements with the same name as the
XML element of interest, i.e. "myTag" = function(x, attrs)
.
When the event parser is reading text nodes,
it may call the text handler function with different
sub-strings of the text within the node.
Essentially, the parser collects up n characters into a buffer and
passes this as a single string the text handler and then continues
collecting more text until the buffer is full or there is no more text.
It passes each sub-string to the text handler.
If trim
is TRUE
, it removes leading and trailing white
space from the substring before calling the text handler. If the
resulting text is empty and ignoreBlanks
is TRUE
,
then we don't bother calling the text handler function.
So the key thing to remember about dealing with text is that the entire text of a node may come in multiple separate calls to the text handler. A common idiom is to have the text handler concatenate the values it is passed in separate calls and to have the end element handler process the entire text and reset the text variable to be empty.
xmlEventParse(file, handlers = xmlEventHandler(), ignoreBlanks = FALSE, addContext=TRUE, useTagName = TRUE, asText = FALSE, trim=TRUE, useExpat=FALSE, isURL = FALSE, state = NULL, replaceEntities = TRUE, validate = FALSE, saxVersion = 1, branches = NULL, useDotNames = length(grep("^\\.", names(handlers))) > 0, error = xmlErrorCumulator(), addFinalizer = NA)
file |
the source of the XML content.
This can be a string giving the name of a file or remote URL,
the XML itself, a connection object, or a function.
If this is a string, and If a connection is given, the parser incrementally reads one line at
a time by calling the function If invoking the Support for connections and functions in this form is only provided if one is using libxml2 and not libxml version 1. |
handlers |
a closure object that contains functions which will be invoked
as the XML components in the document are encountered by the parser.
The standard function or handler names are
The call signature for the entityDeclaration function was changed in
version 1.7-0. Note that in earlier versions, the C routine did not
invoke any R function and so no code will actually break.
Also, we have renamed The new signature is
If we are dealing with an internal entity,
the content will be the string containing
the value of the entity.
If we are dealing with an external entity,
then |
ignoreBlanks |
a logical value indicating whether text elements made up entirely of white space should be included in the resulting ‘tree’. |
addContext |
logical value indicating whether the callback functions in ‘handlers’ should be invoked with contextual information about the parser and the position in the tree, such as node depth, path indices for the node relative the root, etc. If this is True, each callback function should support .... |
useTagName |
a logical value.
If this is If the value is |
asText |
logical value indicating that the first argument, ‘file’, should be treated as the XML text to parse, not the name of a file. This allows the contents of documents to be retrieved from different sources (e.g. HTTP servers, XML-RPC, etc.) and still use this parser. |
trim |
whether to strip white space from the beginning and end of text strings. |
useExpat |
a logical value indicating whether to use the expat SAX parser, or to default to the libxml. If this is TRUE, the library must have been compiled with support for expat. See supportsExpat. |
isURL |
indicates whether the |
state |
an optional S object that is passed to the
callbacks and can be modified to communicate state between
the callbacks. If this is given, the callbacks should accept
an argument named |
replaceEntities |
logical value indicating whether to substitute entity references with their text directly. This should be left as False. The text still appears as the value of the node, but there is more information about its source, allowing the parse to be reversed with full reference information. |
saxVersion |
an integer value which should be either 1 or 2.
This specifies which SAX interface to use in the C code.
The essential difference is the number of arguments passed to the
|
validate |
Currently, this has no effect as the libxml2 parser uses a document structure to do validation. a logical indicating whether to use a validating parser or not, or in other words check the contents against the DTD specification. If this is true, warning messages will be displayed about errors in the DTD and/or document, but the parsing will proceed except for the presence of terminal errors. |
branches |
a named list of functions.
Each element identifies an XML element name.
If an XML element of that name is encountered in
the SAX stream, the stream is processed until the
end of that element and an internal node (see
Note that the branches mechanism works top-down and does not
work for nested tags. If one specifies an element name in the
One can cause the parser to collect a branch without identifying
the node within the See the file This is a two step process. In the future, we might make it so that the R function handling the start-element event could directly collect the branch and continue its operations without having to call another function asynchronously. |
useDotNames |
a logical value
indicating whether to use the
newer format for identifying general element function handlers
with the '.' prefix, e.g. .text, .comment, .startElement.
If this is |
error |
a function that is called when an XML error is encountered.
This is called with 6 arguments and is described in |
addFinalizer |
a logical value or identifier for a C routine that controls whether we register finalizers on the intenal node. |
This is now implemented using the libxml parser. Originally, this was implemented via the Expat XML parser by Jim Clark (http://www.jclark.com).
The return value is the ‘handlers’ argument. It is assumed that this is a closure and that the callback functions have manipulated variables local to it and that the caller knows how to extract this.
The libxml parser can read URLs via http or ftp.
It does not require the support of wget
as used
in other parts of R, but uses its own facilities
to connect to remote servers.
The idea for the hybrid SAX/DOM mode where we consume tokens in the stream to create an entire node for a sub-tree of the document was first suggested to me by Seth Falcon at the Fred Hutchinson Cancer Research Center. It is similar to the XML::Twig module in Perl by Michel Rodriguez.
Duncan Temple Lang
http://www.w3.org/XML, http://www.jclark.com/xml
xmlTreeParse
xmlStopParser
XMLParserContextFunction
fileName <- system.file("exampleData", "mtcars.xml", package="XML") # Print the name of each XML tag encountered at the beginning of each # tag. # Uses the libxml SAX parser. xmlEventParse(fileName, list(startElement=function(name, attrs){ cat(name,"\n") }), useTagName=FALSE, addContext = FALSE) ## Not run: # Parse the text rather than a file or URL by reading the URL's contents # and making it a single string. Then call xmlEventParse xmlURL <- "http://www.omegahat.net/Scripts/Data/mtcars.xml" xmlText <- paste(scan(xmlURL, what="",sep="\n"),"\n",collapse="\n") xmlEventParse(xmlText, asText=TRUE) ## End(Not run) # Using a state object to share mutable data across callbacks f <- system.file("exampleData", "gnumeric.xml", package = "XML") zz <- xmlEventParse(f, handlers = list(startElement=function(name, atts, .state) { .state = .state + 1 print(.state) .state }), state = 0) print(zz) # Illustrate the startDocument and endDocument handlers. xmlEventParse(fileName, handlers = list(startDocument = function() { cat("Starting document\n") }, endDocument = function() { cat("ending document\n") }), saxVersion = 2) if(libxmlVersion()$major >= 2) { startElement = function(x, ...) cat(x, "\n") xmlEventParse(file(f), handlers = list(startElement = startElement)) # Parse with a function providing the input as needed. xmlConnection = function(con) { if(is.character(con)) con = file(con, "r") if(isOpen(con, "r")) open(con, "r") function(len) { if(len < 0) { close(con) return(character(0)) } x = character(0) tmp = "" while(length(tmp) > 0 && nchar(tmp) == 0) { tmp = readLines(con, 1) if(length(tmp) == 0) break if(nchar(tmp) == 0) x = append(x, "\n") else x = tmp } if(length(tmp) == 0) return(tmp) x = paste(x, collapse="") x } } ff = xmlConnection(f) xmlEventParse(ff, handlers = list(startElement = startElement)) # Parse from a connection. Each time the parser needs more input, it # calls readLines(<con>, 1) xmlEventParse(file(f), handlers = list(startElement = startElement)) # using SAX 2 h = list(startElement = function(name, attrs, namespace, allNamespaces){ cat("Starting", name,"\n") if(length(attrs)) print(attrs) print(namespace) print(allNamespaces) }, endElement = function(name, uri) { cat("Finishing", name, "\n") }) xmlEventParse(system.file("exampleData", "namespaces.xml", package="XML"), handlers = h, saxVersion = 2) # This example is not very realistic but illustrates how to use the # branches argument. It forces the creation of complete nodes for # elements named <b> and extracts the id attribute. # This could be done directly on the startElement, but this just # illustrates the mechanism. filename = system.file("exampleData", "branch.xml", package="XML") b.counter = function() { nodes <- character() f = function(node) { nodes <<- c(nodes, xmlGetAttr(node, "id"))} list(b = f, nodes = function() nodes) } b = b.counter() invisible(xmlEventParse(filename, branches = b["b"])) b$nodes() filename = system.file("exampleData", "branch.xml", package="XML") invisible(xmlEventParse(filename, branches = list(b = function(node) { print(names(node))}))) invisible(xmlEventParse(filename, branches = list(b = function(node) { print(xmlName(xmlChildren(node)[[1]]))}))) } ############################################ # Stopping the parser mid-way and an example of using XMLParserContextFunction. startElement = function(ctxt, name, attrs, ...) { print(ctxt) print(name) if(name == "rewriteURI") { cat("Terminating parser\n") xmlStopParser(ctxt) } } class(startElement) = "XMLParserContextFunction" endElement = function(name, ...) cat("ending", name, "\n") fileName = system.file("exampleData", "catalog.xml", package = "XML") xmlEventParse(fileName, handlers = list(startElement = startElement, endElement = endElement))