JAST: Java Abstract Syntax Trees

Natural Java idioms for processing XML data

You are here: JAST Home / User Guide / XPath Query /
Department of Computer Science

The XPath Query Search Engine

The JAST 2.5 Toolkit provides standard components for searching XML documents in memory, using an implementation of the W3C XPath standard. This implementation supports the abbreviated subset of the W3C XPath language, rather than the full XPath syntax, for the sake of providing a simpler, more efficient search engine. It supports searching DOM-trees in order to find sets of elements or attributes with particular names or filtered content. Most useful XPath queries may be executed using just this subset.

This part of the user guide describes how to create an XPath query and how to initiate searching by matching a query against a DOM-tree, or against a collection of sub-trees of a DOM-tree. XPath and all classes supporting the search engine are found in the package: uk.ac.sheffield.jast.xpath. Once you have understood the basic XPath searching concepts presented in this quick-start guide, please refer to the JAST 2.5 package APIs for more detailed information.

The XPath Query Syntax

The World-Wide Web Consortium (W3C) specifies a standard for conducting searches in XML trees (whether in memory or on disk) called XPath. This standard specifies the syntax of a language for constructing pattern expressions, which denote paths through an XML tree. The pattern expressions are matched against an XML tree and select one or more nodes (if the match succeeds) that match the pattern. The result may be a set of elements, of attributes, or of other kinds of content, according to the pattern.

The full XPath pattern language syntax is quite rich, including specifiers for which axis to explore, what kinds of node to select and what predicates to apply to the nodes. The full syntax also includes a large subset of the functions normally found in a programming language. The W3C also defines an abbreviated syntax, which focuses on matching elements, attributes and text by their names and values. Examples of this more convenient abbreviated syntax include the following:

. selects the current node, known as the context
.. selects the parent node of the context node (a relative path)
/. selects the document node containing the context node (an absolute path, starting from the document root)
//. selects all nodes in the current document (an absolute pattern starting from the document root)
Film selects the Film children of the context node (a relative path)
Film/Director selects the Director children of the Film children of the context node (a relative path)
/Catalogue/Film selects the Film children of the Catalogue root element of the document node (an absolute path)
@year selects the year attribute of the context node (a relative path)
Film/@year selects the year attributes of the Film children of the context node (a relative path)
//Film selects all Film descendants of the document node (an absolute path)
Catalogue//Title selects all Title descendants of the Catalogue child of the context node (a relative path)
Catalogue//@rating selects the rating attributes of all descendants of the Catalogue child of the context node (a relative path)
/Catalogue/* selects all the element children of the Catalogue root element of the document node (an absolute path)
Film[@year] selects all Film children of the context node that have a year attribute (a relative path)
Film[@year>1976] selects all Film children of the context node, whose year attribute has a value greater than 1976 (a relative path)
/Catalogue/*[Director] selects all the element children of the Catalogue root element that have a Director child node (an absolute path)
Film[Director='George Lucas'] selects all Film children of the context node, whose Director child node has the value "George Lucas" (a relative path)
/Catalogue/Film[2] selects the second Film child of the Catalogue root element (an absolute path)
Catalogue/TVShow[last()] selects the last TVShow child of the Catalogue child element of the context node (a relative path)
/Catalogue/Film[position()>2] selects the third to last Film children inclusive of the Catalogue root element (an absolute path)

The XPath Search Engine

The main class of interest is XPath, which represents a compiled XPath pattern and also contains the search engine for matching a pattern against an XML memory tree. JAST only supports matching XPaths against DOM-trees held in memory; it does not support matching against XML files on disk. An XPath object is created using the constructor XPath(String), supplying the XPath pattern string as the argument. Behind the scenes, this invokes a compiler called XPathCompiler, which compiles the pattern string into a sequence of navigation rules and filters, known as XPathRules. When you execute an XPath query, these rules are applied, one at a time, to the current context, yielding a new context. The matching process is top-down and breadth-first, finishing when the rules are exhausted, or no further nodes match the pattern.

To execute an XPath query, a program should construct an instance of XPath with the desired query pattern, and then invoke one of its matching-methods on the starting context for the match. XPath provides two methods: match(Content node) to initiate a query on a single Content context-node (where Content is the abstract superclass of all DOM-tree nodes), and match(List<Content> nodes) to initiate a query on some collection of Content nodes. For example:

	    // Find all Film nodes under the root Catalogue node
	XPath findFilms = new XPath("/Catalogue/Film");
	List<Content> films = findFilms.match(document);   // node context

	    // Find all year attributes of the Films in this list
	XPath findYears = new XPath("@year");
	List<Content> years = findYears.match(films);      // list context

	    // Find the text of any Director node in the document
	XPath findText = new XPath("//Director/text()");
	List<Content> names = findText.match(document);
The first XPath searches for all Film elements under the root element Catalogue of the document. This returns a list of nodes. These are the new context for the second XPath search, which returns all year attributes of the Films returned by the first search. The third example shows how to return a list of Text nodes storing the values of all the Director elements anywhere in the document. If the same XPath search pattern needs to be used many times, it is most efficient to create the XPath object once, and re-use it many times; otherwise the XPathCompiler will be invoked again to recompile the string search pattern.

While the developer need not be concerned about how the various XPathRule classes work, these consist of rules that navigate one step along: the self-axis, or the attribute-axis, or the child-axis, or the parent-axis or the descendants-or-self-axis. In general, a rule may advance one step along its chosen axis. If it contains a Filter, then this will apply a constraint to the set of applicable nodes along the chosen axis, such that only those nodes passing the constraint are returned in the next cycle. A Filter applies a test to each applicable node, such as testing its element-type, or its name, or constraining its value. See the XML Filter Guide for more details of how filters are used generally within the JAST toolkit.

Level of Compliance to W3C Standards

The JAST implementation of XPath supports the W3C abbreviated syntax for most simple XPath patterns. It supports searching along the self-axis, parent-axis, child-axis, attribute-axis and descendants-or-self axis. It supports absolute paths starting from the root and relative paths starting from the context node. It supports selecting the context node, the parent of the context, attributes of the context, the children of the context and all inclusive descendants of the context (including the context node). In addition to selecting element-children (by name), it supports the general node() and specific text(), comment() and processing-instruction() child-content selectors. The wildcard * may be given for attribute or element names as a whole, (but not used as part of a name). Matching alternative paths is also supported, using the path combinator | (with surrounding spaces).

The JAST XPath implementation supports predicates testing the self-axis, the parent-axis, the attribute-axis, the child-axis and the descendants-or-self-axis. Predicates between [...] may include any embedded XPath query pattern, optionally terminating with an inequality comparison. Predicates may be followed by further query patterns, if a deeper search is desired, starting from the nodes that were filtered by the predicate.

Predicates consisting simply of path-names test that the context has descendants with these names. Predicates terminating with attribute-selectors test that the context has descendants with these attributes. Predicates terminating with other content-selectors (such as text() or comment()) test that the context has descendants with this type of content. Predicates ending with an inequality comparison perform a value-restriction on the value of the last element, attribute or content selected. Value-restriction predicates may use any of the six inequality operators. Compound predicates with operators or and and are supported (with surrounding spaces), where and may also be expressed implicitly using a sequence of predicates [...][...]. Negation may be expressed using the function not(...) containing a nested predicate expression. Position-based filtering is supported with the function [position()<n] (using any of the six inequality operators). Forwards indexing is supported using an index [n], and backwards indexing using the function [last()] and [last()-n], where n is an XPath index from 1..N.

This implementation does not support the W3C full syntax for XPath. It only supports those axes which have a short-form in the W3C abbreviated XPath syntax. It therefore does not support the preceding, following, preceding-sibling, following-sibling, ancestor, ancestor-or-self or descendant-axes, which have no equivalent short-form. It does not yet support precedence re-ordering through patterns grouped with (...). Arbitrary arithmetical and string functions are not supported. These restrictions were deliberately chosen to limit the complexity of this XPath processor.

Regent Court, 211 Portobello, Sheffield S1 4DP, United Kingdom