The XPath Query Search Engine
The JAST 2.5 Toolkit provides standard
components for searching XML documents in memory, using an implementation
of the W3C XPath standard. This implementation supports the
abbreviated subset of the W3C XPath language, rather than
the full XPath syntax, for the sake of providing a simpler, more
efficient search engine. It supports searching DOM-trees in order to find
sets of elements or attributes with particular names or filtered content.
Most useful XPath queries may be executed using just this subset.
This part of the user guide describes how to create an XPath
query and how to initiate searching by matching a query against a DOM-tree,
or against a collection of sub-trees of a DOM-tree. XPath
and all classes supporting the search engine are found in the package:
uk.ac.sheffield.jast.xpath .
Once you have understood the basic XPath searching concepts presented in
this quick-start guide, please refer to the
JAST 2.5 package APIs for more detailed
information.
The XPath Query Syntax
The World-Wide Web Consortium (W3C) specifies a standard for conducting
searches in XML trees (whether in memory or on disk) called XPath.
This standard specifies the syntax of a language for constructing pattern
expressions, which denote paths through an XML tree. The pattern expressions
are matched against an XML tree and select one or more nodes (if the match
succeeds) that match the pattern. The result may be a set of elements, of
attributes, or of other kinds of content, according to the pattern.
The full XPath pattern language syntax is quite rich, including
specifiers for which axis to explore, what kinds of node to select and
what predicates to apply to the nodes. The full syntax also includes a large
subset of the functions normally found in a programming language. The W3C
also defines an abbreviated syntax, which focuses on matching elements,
attributes and text by their names and values. Examples of this more convenient
abbreviated syntax include the following:
. |
selects the current node, known as the context |
.. |
selects the parent node of the context node (a relative path)
|
/. |
selects the document node containing the context node (an absolute
path, starting from the document root) |
//. |
selects all nodes in the current document (an absolute pattern
starting from the document root) |
Film |
selects the Film children of the context node (a
relative path) |
Film/Director |
selects the Director children of the Film
children of the context node (a relative path) |
/Catalogue/Film |
selects the Film children of the Catalogue
root element of the document node (an absolute path) |
@year |
selects the year attribute of the context node (a
relative path) |
Film/@year |
selects the year attributes of the Film
children of the context node (a relative path) |
//Film |
selects all Film descendants of the document node
(an absolute path) |
Catalogue//Title |
selects all Title descendants of the Catalogue
child of the context node (a relative path) |
Catalogue//@rating |
selects the rating attributes of all descendants of
the Catalogue child of the context node (a relative
path) |
/Catalogue/* |
selects all the element children of the Catalogue
root element of the document node (an absolute path) |
Film[@year] |
selects all Film children of the context node that
have a year attribute (a relative path) |
Film[@year>1976] |
selects all Film children of the context node, whose
year attribute has a value greater than 1976 (a
relative path) |
/Catalogue/*[Director] |
selects all the element children of the Catalogue
root element that have a Director child node (an
absolute path) |
Film[Director='George Lucas'] |
selects all Film children of the context node, whose
Director child node has the value "George Lucas"
(a relative path) |
/Catalogue/Film[2] |
selects the second Film child of the Catalogue root
element (an absolute path)
|
Catalogue/TVShow[last()] |
selects the last TVShow child of the Catalogue child
element of the context node (a relative path)
|
/Catalogue/Film[position()>2] |
selects the third to last Film children inclusive of
the Catalogue root element (an absolute path)
|
The XPath Search Engine
The main class of interest is XPath , which represents
a compiled XPath pattern and also contains the search engine for matching
a pattern against an XML memory tree. JAST only supports matching XPaths
against DOM-trees held in memory; it does not support matching against XML
files on disk. An XPath object is
created using the constructor XPath(String) , supplying the
XPath pattern string as the argument. Behind the scenes, this invokes a
compiler called XPathCompiler , which compiles the pattern
string into a sequence of navigation rules and filters, known as
XPathRules . When you execute an XPath query, these rules
are applied, one at a time, to the current context, yielding a new
context. The matching process is top-down and breadth-first, finishing
when the rules are exhausted, or no further nodes match the pattern.
To execute an XPath query, a program should construct an instance of
XPath with the desired query pattern, and then invoke one
of its matching-methods on the starting context for the match.
XPath provides two methods: match(Content node)
to initiate a query on a single Content context-node (where
Content is the abstract superclass of all DOM-tree nodes),
and match(List<Content> nodes) to initiate a query on
some collection of Content nodes. For example:
// Find all Film nodes under the root Catalogue node
XPath findFilms = new XPath("/Catalogue/Film");
List<Content> films = findFilms.match(document); // node context
// Find all year attributes of the Films in this list
XPath findYears = new XPath("@year");
List<Content> years = findYears.match(films); // list context
// Find the text of any Director node in the document
XPath findText = new XPath("//Director/text()");
List<Content> names = findText.match(document);
The first XPath searches for all Film elements under the
root element Catalogue of the document . This
returns a list of nodes. These are the new context for the second XPath
search, which returns all year attributes of the
Film s returned by the first search. The third example
shows how to return a list of Text nodes storing the values
of all the Director elements anywhere in the document.
If the same XPath
search pattern needs to be used many times, it is most efficient to
create the XPath object once, and re-use it many times;
otherwise the XPathCompiler will be invoked again to
recompile the string search pattern.
While the developer need not be concerned about how the various
XPathRule classes work, these consist of rules that navigate
one step along: the self-axis, or the attribute-axis,
or the child-axis, or the parent-axis or the
descendants-or-self-axis. In general, a rule may advance one step
along its chosen axis. If it contains a Filter , then this will
apply a constraint to the set of applicable nodes along the chosen
axis, such that only those nodes passing the constraint are returned in the next
cycle. A Filter applies a test to each applicable node, such as
testing its element-type, or its name, or constraining its value. See the
XML Filter Guide for more details
of how filters are used generally within the JAST toolkit.
Level of Compliance to W3C Standards
The JAST implementation of XPath supports the W3C abbreviated syntax for
most simple XPath patterns. It supports searching along the self-axis,
parent-axis, child-axis, attribute-axis and
descendants-or-self axis. It supports absolute paths starting from
the root and relative paths starting from the context node. It supports
selecting the context node, the parent of the context, attributes of the context,
the children of the context and all inclusive descendants of the context
(including the context node). In addition to selecting element-children (by
name), it supports the general node() and specific text() ,
comment() and processing-instruction()
child-content selectors. The wildcard * may be given for
attribute or element names as a whole, (but not used as part of a name). Matching
alternative paths is also supported, using the path combinator |
(with surrounding spaces).
The JAST XPath implementation supports predicates testing the self-axis,
the parent-axis, the attribute-axis, the child-axis and
the descendants-or-self-axis. Predicates between [...] may
include any embedded XPath query pattern, optionally terminating with an inequality
comparison. Predicates may be followed by further query patterns, if a deeper
search is desired, starting from the nodes that were filtered by the predicate.
Predicates consisting simply of path-names test that the context has
descendants with these names. Predicates terminating with attribute-selectors test
that the context has descendants with these attributes. Predicates terminating with
other content-selectors (such as text() or comment() ) test
that the context has descendants with this type of content. Predicates ending with
an inequality comparison perform a value-restriction on the value of the last element,
attribute or content selected. Value-restriction predicates may use any of the six
inequality operators. Compound predicates with operators or and
and are supported (with surrounding spaces), where and
may also be expressed implicitly using a sequence of predicates [...][...] .
Negation may be expressed using the function not(...) containing a
nested predicate expression. Position-based filtering is supported with the function
[position()<n] (using any of the six inequality operators). Forwards
indexing is supported using an index [n] , and backwards indexing using
the function [last()] and [last()-n] , where n
is an XPath index from 1..N.
This implementation does not support the W3C full syntax for XPath. It only
supports those axes which have a short-form in the W3C abbreviated XPath syntax.
It therefore does not support the preceding, following,
preceding-sibling, following-sibling, ancestor,
ancestor-or-self or descendant-axes, which have no equivalent
short-form. It does not yet support precedence re-ordering through patterns
grouped with (...) . Arbitrary arithmetical and string functions are
not supported. These restrictions were deliberately chosen to limit the
complexity of this XPath processor.
|