The XML Node Filtering Kit
The JAST 2.5 Toolkit provides standard filters
for selecting subsets of nodes attached to DOM-trees in memory. The main
components to use are the abstract class Filter and its concrete
descendants. Behind the scenes, filters are used internally as part of the
XPath search engine and also as part of the XML document validation engine.
If the programmer wishes, filters can be used explicitly in programs that select
subsets of nodes from XML DOM-trees.
This part of the user guide describes the XML DOM-tree filtering kit. All
classes referred to here are found in the package:
uk.ac.sheffield.jast.filter .
Once you have got the basic idea of how to conduct filtering from this
quick-start guide, please refer to the
JAST 2.5 package APIs for more detailed
information.
Filtering the XML Memory Model
Filters can be used to select subsets of nodes from an XML DOM-tree.
Every XML DOM-tree is rooted in some node that conforms to the base type
Content . This class provides an API for retrieving all the
children, or subsets of the children of a node. Relevant methods in the API
of Content include the following, some of which accept
Filter arguments:
public List<Content> getContents(); // all contents
public List<Content> getContents(Filter); // a filtered subset
public Content getContent(int); // node at index
public Content getContent(Filter, int); // filtered at index
where the kind of Filter supplied determines what types of node
are returned. Filtering may return nodes of one or more specific types, or
with a given name, or constraint on the value, or any combination of these.
The Content API also includes methods for removing single or
multiple children.
public List<Content> removeContents(Filter); // remove a subset
public Content removeContent(Content); // remove a node
public Content removeContent(int); // remove at index
Removing a node, or list of nodes, returns the node, or node list, that
was removed. Removed nodes no longer have a parent node, so may be
reattached elsewhere, if desired. (XML nodes may only be attached to
at most one parent; and attempting to attach a node to multiple parents
will raise an exception).
Filters can be used freely by the programmer. They provide the most
efficient Java implementation of how to filter subsets of nodes. They
are used internally within the JAST software. For example, the methods
of the Element class that return named children are encoded
using NameFilter objects, to select Element
children that have the desired name. The following calls are nearly
identical (apart from the returned List -type) in their
implementation:
// Access children in the Element API
List<Element> children = element.getChildren("Person");
// Approximately equivalent implementation
List<Content> contents = element.getContents(
new NameFilter("Person", Content.ELEMENT));
The point is that a NameFilter automatically determines
whether to compare against the short element-name, or the full prefixed name,
and interns the name-symbol to be compared, such that this can be done via a
fast identity-check, rather than with the slower Java equals()
method that is usually used to compare two Strings. It is almost always
more efficient to use a Filter than to construct your own tests
within a Java program.
The Filter Class Library
The main API class to use is Filter and its descendants.
All Filter subclasses provide a method:
public boolean accept(Content); // does the filter accept the node?
which returns true if the particular filter accepts the node
and false otherwise. This method is implemented differently
in each subclass of Filter , according to the test that the
filter is seeking to apply.
Filters may be used individually, or in combination, using methods that
logically combine filters in the most efficient way. The most commonly
used filters are:
NodeFilter - filters for nodes of a specified
content-type, or set of content-types.
NameFilter - filters for named nodes with a specified name
(and optional content-type).
NameSetFilter - filters for named nodes whose name comes
from a specified set (with optional content-type).
PrefixFilter - filters for named nodes with a specified
namespace prefix (and optional content-type).
TextFilter - filters for text or data nodes that contain
non-empty text (with optional content-type).
ValueFilter - filters for nodes with a specified exact
value (and optional content-type).
RangeFilter - filters for nodes whose value comes from a
specified range (with optional content-type).
RestrictFilter - filters for nodes with a specified
restriction on their value (with optional content-type).
WidthFilter - filters for nodes with a specified
restriction on their field width (with optional content-type).
TypeFilter - filters for nodes whose value is of a
specific XSD or DTD basic type (with optional content-type).
NumberFilter - filters for nodes whose value satisfies
an IEEE number type and range (with optional content-type).
PatternFilter - filters for nodes whose value satisfies
a general regular expression (with optional content-type).
PositionFilter - filters for nodes whose index position
satisfies a range constraint (with optional content-type).
ChildFilter - filters for nodes whose children satisfy
a specified filter pattern (with optional content-type).
PropertyFilter - filters for nodes whose attributes
satisfy a specified filter pattern (with optional content-type).
XPathFilter - filters for nodes that match a predicate
expressed by an XPath query pattern.
Filters use the most efficient algorithms for matching nodes. For
example, NodeFilter uses a bit-mask pattern to select nodes
whose content-type matches the pattern. Comparison uses fast bitwise
arithmetic on content-type identifiers. NameFilter
compiles the name pattern to decide whether to match against a short
name, or full identifier including the XML namespace prefix. All
symbols are interned and compared using basic equality for speed.
ValueFilter compiles a comparison operator and a reference
value into a closure that may be applied to Element or
Attribute values, which are converted to the correct
types before comparison, and may be applied efficiently multiple times
to many nodes. We emphasise is that it is almost always more efficient
to use a Filter than to construct your own tests within
a Java program.
Creating Compound Filters
Filters may be combined with other filters using logical combination
methods that return the most suitable filter. The root class
Filter declares the following API, which is suitably
implemented in each descendant:
public Filter and(Filter other); // Satisfy this and other
public Filter or(Filter other); // Satisfy this or other
public Filter not(); // Satisfy the complement
By default, these methods will construct a compound filter, which
wraps one or more component filters (but sometimes they modify the simple
filter instead). Compound filters use short-circuit
Boolean evaluation to determine the outcome as efficiently as possible
(viz. evaluating as few of the component filters as is necessary to
determine the outcome). The compound filters are the following:
NotFilter - filters for nodes that do not satisfy the
single component filter.
AndFilter - filters for nodes that satisfy both of the
two component filters.
OrFilter - filters for nodes that satisfy either of the
two component filters.
While compound filters may be constructed explicitly, it is usually more
worthwhile to use the logical combination methods instead. Sometimes, these
methods merge the constraints of the pair of filters, returning a more
efficient single filter than any manually created compound filter.
For example, whereas and() will by default return the
AndFilter compound filter, when any filter is combined
with a NodeFilter , this simply results in a specialised
version of the same filter, accepting a more restricted content-type.
Similarly, whereas not() will by default return a
NotFilter , asking for the complement of a NotFilter
will return the original filter. Similar rules apply De Morgan's
Law over compound filters to simplify the levels of nesting and
also to merge child and attribute constraints when combining multiple
ChildFilter or PropertyFilter filters.
|