JAST: Java Abstract Syntax Trees

Natural Java idioms for processing XML data

You are here: JAST Home / User Guide / XML Filters /
Department of Computer Science

The XML Node Filtering Kit

The JAST 2.5 Toolkit provides standard filters for selecting subsets of nodes attached to DOM-trees in memory. The main components to use are the abstract class Filter and its concrete descendants. Behind the scenes, filters are used internally as part of the XPath search engine and also as part of the XML document validation engine. If the programmer wishes, filters can be used explicitly in programs that select subsets of nodes from XML DOM-trees.

This part of the user guide describes the XML DOM-tree filtering kit. All classes referred to here are found in the package: uk.ac.sheffield.jast.filter. Once you have got the basic idea of how to conduct filtering from this quick-start guide, please refer to the JAST 2.5 package APIs for more detailed information.

Filtering the XML Memory Model

Filters can be used to select subsets of nodes from an XML DOM-tree. Every XML DOM-tree is rooted in some node that conforms to the base type Content. This class provides an API for retrieving all the children, or subsets of the children of a node. Relevant methods in the API of Content include the following, some of which accept Filter arguments:

	public List<Content> getContents();         // all contents
	public List<Content> getContents(Filter);   // a filtered subset
	public Content getContent(int);             // node at index
	public Content getContent(Filter, int);     // filtered at index
where the kind of Filter supplied determines what types of node are returned. Filtering may return nodes of one or more specific types, or with a given name, or constraint on the value, or any combination of these. The Content API also includes methods for removing single or multiple children.
	public List<Content> removeContents(Filter); // remove a subset
	public Content removeContent(Content);       // remove a node
	public Content removeContent(int);           // remove at index

Removing a node, or list of nodes, returns the node, or node list, that was removed. Removed nodes no longer have a parent node, so may be reattached elsewhere, if desired. (XML nodes may only be attached to at most one parent; and attempting to attach a node to multiple parents will raise an exception).

Filters can be used freely by the programmer. They provide the most efficient Java implementation of how to filter subsets of nodes. They are used internally within the JAST software. For example, the methods of the Element class that return named children are encoded using NameFilter objects, to select Element children that have the desired name. The following calls are nearly identical (apart from the returned List-type) in their implementation:

	    // Access children in the Element API
	List<Element> children = element.getChildren("Person");

	    // Approximately equivalent implementation 
	List<Content> contents = element.getContents(
			new NameFilter("Person", Content.ELEMENT));
The point is that a NameFilter automatically determines whether to compare against the short element-name, or the full prefixed name, and interns the name-symbol to be compared, such that this can be done via a fast identity-check, rather than with the slower Java equals() method that is usually used to compare two Strings. It is almost always more efficient to use a Filter than to construct your own tests within a Java program.

The Filter Class Library

The main API class to use is Filter and its descendants. All Filter subclasses provide a method:

	public boolean accept(Content);  // does the filter accept the node?
which returns true if the particular filter accepts the node and false otherwise. This method is implemented differently in each subclass of Filter, according to the test that the filter is seeking to apply. Filters may be used individually, or in combination, using methods that logically combine filters in the most efficient way. The most commonly used filters are:

  • NodeFilter - filters for nodes of a specified content-type, or set of content-types.
  • NameFilter - filters for named nodes with a specified name (and optional content-type).
  • NameSetFilter - filters for named nodes whose name comes from a specified set (with optional content-type).
  • PrefixFilter - filters for named nodes with a specified namespace prefix (and optional content-type).
  • TextFilter - filters for text or data nodes that contain non-empty text (with optional content-type).
  • ValueFilter - filters for nodes with a specified exact value (and optional content-type).
  • RangeFilter - filters for nodes whose value comes from a specified range (with optional content-type).
  • RestrictFilter - filters for nodes with a specified restriction on their value (with optional content-type).
  • WidthFilter - filters for nodes with a specified restriction on their field width (with optional content-type).
  • TypeFilter - filters for nodes whose value is of a specific XSD or DTD basic type (with optional content-type).
  • NumberFilter - filters for nodes whose value satisfies an IEEE number type and range (with optional content-type).
  • PatternFilter - filters for nodes whose value satisfies a general regular expression (with optional content-type).
  • PositionFilter - filters for nodes whose index position satisfies a range constraint (with optional content-type).
  • ChildFilter - filters for nodes whose children satisfy a specified filter pattern (with optional content-type).
  • PropertyFilter - filters for nodes whose attributes satisfy a specified filter pattern (with optional content-type).
  • XPathFilter - filters for nodes that match a predicate expressed by an XPath query pattern.

Filters use the most efficient algorithms for matching nodes. For example, NodeFilter uses a bit-mask pattern to select nodes whose content-type matches the pattern. Comparison uses fast bitwise arithmetic on content-type identifiers. NameFilter compiles the name pattern to decide whether to match against a short name, or full identifier including the XML namespace prefix. All symbols are interned and compared using basic equality for speed. ValueFilter compiles a comparison operator and a reference value into a closure that may be applied to Element or Attribute values, which are converted to the correct types before comparison, and may be applied efficiently multiple times to many nodes. We emphasise is that it is almost always more efficient to use a Filter than to construct your own tests within a Java program.

Creating Compound Filters

Filters may be combined with other filters using logical combination methods that return the most suitable filter. The root class Filter declares the following API, which is suitably implemented in each descendant:

	public Filter and(Filter other); 	// Satisfy this and other
	public Filter or(Filter other); 	// Satisfy this or other
	public Filter not(); 			// Satisfy the complement

By default, these methods will construct a compound filter, which wraps one or more component filters (but sometimes they modify the simple filter instead). Compound filters use short-circuit Boolean evaluation to determine the outcome as efficiently as possible (viz. evaluating as few of the component filters as is necessary to determine the outcome). The compound filters are the following:

  • NotFilter - filters for nodes that do not satisfy the single component filter.
  • AndFilter - filters for nodes that satisfy both of the two component filters.
  • OrFilter - filters for nodes that satisfy either of the two component filters.

While compound filters may be constructed explicitly, it is usually more worthwhile to use the logical combination methods instead. Sometimes, these methods merge the constraints of the pair of filters, returning a more efficient single filter than any manually created compound filter. For example, whereas and() will by default return the AndFilter compound filter, when any filter is combined with a NodeFilter, this simply results in a specialised version of the same filter, accepting a more restricted content-type. Similarly, whereas not() will by default return a NotFilter, asking for the complement of a NotFilter will return the original filter. Similar rules apply De Morgan's Law over compound filters to simplify the levels of nesting and also to merge child and attribute constraints when combining multiple ChildFilter or PropertyFilter filters.

Regent Court, 211 Portobello, Sheffield S1 4DP, United Kingdom