htmlParse(file, ignoreBlanks = TRUE, handlers = NULL, replaceEntities = FALSE,
asText = FALSE, trim = TRUE, validate = FALSE, getDTD = TRUE,
isURL = FALSE, asTree = FALSE, addAttributeNamespaces = FALSE,
useInternalNodes = TRUE, isSchema = FALSE, fullNamespaceInfo = FALSE,
encoding = character(),
useDotNames = length(grep("^\\.", names(handlers))) > 0,
xinclude = TRUE, addFinalizer = TRUE,
error = htmlErrorHandler, isHTML = TRUE,
options = integer(), parentFirst = FALSE)
xmlSchemaParse(file, asText = FALSE, xinclude = TRUE, error = xmlErrorCumulator())
Arguments
file
The name of the file containing the XML contents. This can contain \~ which is expanded to the user’s home directory. It can also be a URL. See isURL. Additionally, the file can be compressed (gzip) and is read directly without the user having to de-compress (gunzip) it.
ignoreBlanks
logical value indicating whether text elements made up entirely of white space should be included in the resulting ‘tree’.
handlers
Optional collection of functions used to map the different XML nodes to R objects. Typically, this is a named list of functions, and a closure can be used to provide local data. This provides a way of filtering the tree as it is being created in R, adding or removing nodes, and generally processing them as they are constructed in the C code.
In a recent addition to the package (version 0.99-8), if this is specified as a single function object, we call that function for each node (of any type) in the underlying DOM tree. It is invoked with the new node and its parent node. This applies to regular nodes and also comments, processing instructions, CDATA nodes, etc. So this function must be sufficiently general to handle them all.
replaceEntities
logical value indicating whether to substitute entity references with their text directly. This should be left as False. The text still appears as the value of the node, but there is more information about its source, allowing the parse to be reversed with full reference information.
asText
logical value indicating that the first argument, ‘file’, should be treated as the XML text to parse, not the name of a file. This allows the contents of documents to be retrieved from different sources (e.g. HTTP servers, XML-RPC, etc.) and still use this parser.
trim
whether to strip white space from the beginning and end of text strings.
validate
logical indicating whether to use a validating parser or not, or in other words check the contents against the DTD specification. If this is true, warning messages will be displayed about errors in the DTD and/or document, but the parsing will proceed except for the presence of terminal errors. This is ignored when parsing an HTML document.
getDTD
logical flag indicating whether the DTD (both internal and external) should be returned along with the document nodes. This changes the return type. This is ignored when parsing an HTML document.
isURL
indicates whether the file argument refers to a URL (accessible via ftp or http) or a regular file on the system. If asText is TRUE, this should not be specified. The function attempts to determine whether the data source is a URL by using
asTree
this only applies when on passes a value for the handlers argument and is used then to determine whether the DOM tree should be returned or the handlers object.
addAttributeNamespaces
a logical value indicating whether to return the namespace in the names of the attributes within a node or to omit them. If this is TRUE, an attribute such as xsi:type="xsd:string" is reported with the name xsi:type. If it is FALSE, the name of the attribute is type.
useInternalNodes
a logical value indicating whether to call the converter functions with objects of class XMLInternalNode rather than XMLNode. This should make things faster as we do not convert the contents of the internal nodes to R explicit objects. Also, it allows one to access the parent and ancestor nodes. However, since the objects refer to volatile C-level objects, one cannot store these nodes for use in further computations within R. They “disappear” after the processing the XML document is completed.
If this argument is TRUE and no handlers are provided, the return value is a reference to the internal C-level document pointer. This can be used to do post-processing via XPath expressions using
This is ignored when parsing an HTML document.
isSchema
a logical value indicating whether the document is an XML schema (TRUE) and should be parsed as such using the built-in schema parser in libxml.
fullNamespaceInfo
a logical value indicating whether to provide the namespace URI and prefix on each node or just the prefix. The latter (FALSE) is currently the default as that was the original way the package behaved. However, using TRUE is more informative and we will make this the default in the future.
This is ignored when parsing an HTML document.
encoding
a character string (scalar) giving the encoding for the document. This is optional as the document should contain its own encoding information. However, if it doesn’t, the caller can specify this for the parser. If the XML/HTML document does specify its own encoding that value is used regardless of any value specified by the caller. (That’s just the way it goes!) So this is to be used as a safety net in case the document does not have an encoding and the caller happens to know theactual encoding.
useDotNames
a logical value indicating whether to use the newer format for identifying general element function handlers with the ‘.’ prefix, e.g. .text, .comment, .startElement. If this is FALSE, then the older format text, comment, startElement, … are used. This causes problems when there are indeed nodes named text or comment or startElement as a node-specific handler are confused with the corresponding general handler of the same name. Using TRUE means that your list of handlers should have names that use the ‘.’ prefix for these general element handlers. This is the preferred way to write new code.
xinclude
a logical value indicating whether to process nodes of the form to insert content from other parts of (potentially different) documents. TRUE means resolve the external references; FALSE means leave the node as is. Of course, one can process these nodes oneself after document has been parse using handler functions or working on the DOM. Please note that the syntax for inclusion using XPointer is not the same as XPath and the results can be a little unexpected and confusing. See the libxml2 documentation for more details.
addFinalizer
a logical value indicating whether the default finalizer routine should be registered to free the internal xmlDoc when R no longer has a reference to this external pointer object. This is only relevant when useInternalNodes is TRUE.
error
a function that is invoked when the XML parser reports an error. When an error is encountered, this is called with 7 arguments. See
If parsing completes and no document is generated, this function is called again with only argument which is a character vector of length 0. This gives the function an opportunity to report all the errors and raise an exception rather than doing this when it sees th first one.
This function can do what it likes with the information. It can raise an R error or let parser continue and potentially find further errors.
The default value of this argument supplies a function that cumulates the errors
If this is NULL, the default error handler function in the package
isHTML
a logical value that allows this function to be used for parsing HTML documents. This causes validation and processing of a DTD to be turned off. This is currently experimental so that we can implement htmlParse with this same function.
options
an integer value or vector of values that are combined (OR’ed) together to specify options for the XML parser. This is the same as the options parameter for
parentFirst
a logical value for use when we have handler functions and are traversing the tree. This controls whether we process the node before processing its children, or process the children before their parent node.