![]() |
The Simons
|
This document describes the design rationale used in developing the Collection class hierarchy in the Simons Component Library. Collections are a fundamental kind of component in any software library. Common kinds of collection include Vector, Set and Map. From the outset, it was decided that the collections should be provided as a hierarchy of related classes, in the true spirit of object-oriented programming. These would be specialised according to their similarities and differences, such that obvious semantic differences between kinds of collection were reflected in the class hierarchy. Commonality would be captured in intermediate abstract classes, which could function as interfaces, capable of manipulating compatible collections. The mechanism for visiting all the elements of a collection should be both simple and secure.
From the outset, it was decided that the collections should be provided in the same way as other classes, related in a class hierarchy according to their similarities and differences. This is in the object-oriented spirit of Smalltalk, but contrasts with the C++ standard library, which implements each collection from scratch. A single classification hierarchy was adopted:
Collection is the abstract superclass of all collections of objects. The Collection class is the root of the collections hierarchy, which includes the Sequence, Unordered and Sorted collections. It defines the interface for adding, counting and visiting the elements of a Collection. The semantics of adding an element varies in each subclass, and may be position-sensitive, value-sensitive or disregard duplicates. Counting returns the size of the Collection. Elements may be visited in one pass, by iterating over the Collection using an index in the range 0..n-1 where n is the size of the Collection. Element membership is tested in a single pass, by default, but this is reimplemented more efficiently in descendant classes. Comparison of two Collections and hashing on a Collection are abstract, deferred until subclasses. These operations expect to return a result based on the values, and possibly also the ordering, of elements. All the elements of one Collection may be added to another, according to the semantics of adding.
Sequence is the abstract superclass of all sequential collections. The Sequence class is an intermediate class in the collections hierarchy and the superclass of all sequences, such as Vector, List and String. In a Sequence, the inserted order of elements is significant. Adding new elements appends these to the end of the Sequence. Elements are extrinsically ordered, according to their index position in a Sequence. The notions of a first and last element are meaningful. Elements may be visited in order via an index in the range 0..n-1, where n is the size of the Sequence. Sequence defines the interface for accessing, inserting, removing and replacing elements at index positions. Comparing two Sequences performs a lexicographic comparison of their elements. Hashing on a Sequence is sensitive to both the values and the order of the elements.
Unordered is the abstract superclass of all unordered collections. The Unordered class is an intermediate class in the collections hierarchy and the superclass of all unordered collections, such as Set, Bag and Map. In an Unordered collection, the inserted order of elements is not significant. Adding new elements includes them anywhere in the Unordered collection. The notions of element membership and a count of the number of occurrences are meaningful. Elements may be visited in an arbitrary order via an index in the range 0..n-1, where n is the size of the Unordered collection. Unordered defines the interface for seeking and counting specific elements. Unordered defines the interface for adding, removing and retaining single elements and collections of elements. Comparing two Unordered collections performs a symmetric test for inclusion. Hashing on an Unordered collection is sensitive only to the values, but not the order, of the elements.
Sorted is the abstract superclass of all sorted collections. The Sorted class is an intermediate class in the collections hierarchy and the superclass of all sorted collections, such as SortedSet and SortedList. In a Sorted collection, the order of elements is significant, determined by their relative magnitude. Adding new elements inserts them in sorted position with respect to other elements. Elements are intrinsically ordered, that is, they are logically stored in ascending order, based on the comparison of one element with its neighbours. The notions of a smallest and largest element are meaningful. Elements may be visited in order via an index in the range 0..n-1 where n is the size of the Sorted collection. Sorted defines the interface for seeking and counting specific elements. Sorted defines the interface for adding and removing elements and extracting the smallest and largest elements. Comparing two Sorted collections performs a lexicographical comparison of elements. Hashing on a Sorted collection is sensitive to the values of the elements (since their order is intrinsic).
The Simons Component Library took a fundamental design decision,
to represent the commonality between collections directly in the class
hierarchy. This means that the obvious Sequence
generalisation
over a Vector
and a List
is represented directly, by
making Vector
and List
subclasses of
Sequence
. Not only do classes like Collection
and
Sequence
capture in one place all the algorithms common to these
kinds of collection, but it is possible to substitute a List
or a
Vector
where any kind of Sequence
was expected.
By contrast, the standard C++ library implements all collections from scratch,
for the sake of maximum efficiency. Obvious generalisations,
such as the sequence
generalisation over a vector
and a list
only exist "conceptually", as textual specifications
describing the kinds of operation that any sequence is expected to provide.
(Other than this, the only commonality is provided by generic functions
that accept iterators).
In the SCL, collections may be assigned to any compatible variable type.
For example, a variable of the type SequenceID
may receive any
subclass of Sequence
, including Vector
,
List
and even String
, which is a bona fide
Sequence
; but not Set
or Map
, which are
subclasses of Unordered
. Any compatible collection stored
in a SequenceID
variable can be accessed through the methods
declared in the Sequence
interface; and by virtue of inheritance,
through methods declared in the Collection
interface also.
Flexible styles of coding may be used. For example, a program that needs
to use some kind of Sequence
may express this through use of
SequenceID
variables everwhere, without committing to any
particular representation. Not only can a late decision be made to use eg a
List
representation, but the program can even decide to change
the representation of the sequence partway through its execution, if it were
determined that a Vector
representation might be more efficient.
Converting between different types of collection is simple, for example:
SequenceID seq = new List;
...
seq = new Vector(seq);
...
List
into a Vector
, faithfully
transferring all the elements (by reference) in the same order as before.
Every concrete collection type has a constructor that accepts another
collection as argument. Many algorithms may be expressed simply by changing
the type of collection. For example, to find out how many distinct
elements there are in a Vector
, simply change this to a
Set
. Or, to count the occurrences of each kind of element in a
List
, simply change this to a Bag
.
The methods of the collections are bound dynamically. This supports a style
of coding in which general algorithms provided in the abstract classes
are inherited by default, but may be overridden in specific subclasses
where a more efficient implementation can be provided. For example,
Vector
and List
inherit all but four of their
methods from Sequence
, since the majority of methods may be
defined in terms of the getAt
, putAt
,
addAt
and removeAt
operations (following the
Template Method design pattern). However, in addition to defining
these four basic operations, Vector
redefines those operations
which access the back of the Vector
, since these may be provided
more efficiently. Likewise, List
redefines those operations that
access the front of the List
. These kinds of sequence select
the most efficient methods for their representation dynamically, when
accessed through a SequenceID
variable.
The elements stored in SCL collections may be any kind of Object
and elements are always stored by reference (following the standard SCL policy that
all components have reference semantics). This means that it is much simpler
to manage elements than in the standard C++ library, which has a mixed regime
of value and reference semantics, depending on how type parameters are instantiated.
For example, a standard vector<data>
reserves primary storage
for objects of the type data
, which are deallocated when the
vector
goes out of scope. If, in the same program, a standard
list<data*>
needs to manipulate the same elements, it must do
so via pointers, hence the pointer-type data*
used to instantiate the
type parameter. However, it is possible for the vector
to be
destroyed, with all its elements, before the list
has finished
with them, such that the program has dangling pointers and crashes.
All this is avoided in the SCL by the standard reference-counting mechanism.
When a Vector
receives elements, their reference counts are incremented
automatically when they are assigned to the ObjectID
variables
that make up the storage block of the Vector
. The elements will
have a reference count of at least 1 until the Vector
is
destroyed. If, in the same program, a List
manipulates the same
elements, these are passed by reference to the List
and likewise
their reference counts are incremented when they are stored. If the same
element is stored in both a Vector
and a List
, it
will have a reference count of at least 2. So, it does not matter whether the
Vector
or List
is destroyed first, since the element
will survive until its reference count reaches 0. The element may in any
case be stored in some other variable in the program and survive when both
collections have been destroyed.
Collections may also store values of the basic types, provided these are
wrapped, or "boxed up" as objects. The basic types are: Boolean
,
Character
, Natural
, Integer
and
Decimal
. A corresponding family of wrappers may be used to
wrap values of these types. The basic types are treated implicitly
as though they were subtypes of Value
, a subclass of
Object
. It is possible to assign simple values directly to
variables of the type ValueID
, for example:
ValueID val1 = 3; // constructs an IntegerBox
ValueID val2 = true; // constructs a BooleanBox
Object
interface, such as comparison and hashing,
are available. The simplest way to add basic values to collections is therefore:
SequenceID seq = new Vector;
seq->add( ValueID(3) );
seq->add( ValueID(true) );
Integer elem1 = seq->getAt(0)->toInteger();
Boolean elem2 = seq->getAt(1)->toBoolean();
Object
.
These methods raise a type failure exception if the value is not in fact of
the declared type.
One of the chief goals of the Simons Component Library is to provide simple
and natural programming idioms. This is reflected in the policy regarding
element access and iteration. Collections all support access to their
elements using an Integer
index. The root Collection
class provides an abstract method item(i)
which returns the ith
element of a collection. This means that the natural iteration idiom is a
standard for-loop:
for (Integer i = 0; i < size(); ++i)
... item(i) ...
Sequence
collections, the index is literally the extrinsic order,
or the position at which the element was inserted. For the Sorted
collections, the index denotes the intrinsic order, or the rank of the element
once it has been sorted with respect to other elements in ascending order.
For the Unordered
collections, the index has no meaning other than
to identify an element at a different location in the collection.
Apart from providing the most natural programming idiom, there were two further benefits to be gained from providing element access through an index. Firstly, it permits checking for out-of-range access. All bounds checks can be implemented cheaply as one-sided comparisons made after casting the Integer index to a Natural number:
if (Natural(index) < size()) ... // OK, carry on
Providing efficient access to a collection's elements via an index was an
interesting programming challenge. In other collection libraries, constant time
access is usually only offered for those collections which allocate their storage
as a single array block. In the SCL, it was desired to provide constant, or
near-constant time access to elements also in those collections that were
built from linked structures or hash-tables. This was achieved using internal
memoisation, using small memo-objects to map from indices to storage locations.
For all collections, repeated access to the object at the same index is granted
in constant time. Access to the object at the next index requires one internal
iteration step. This is provided in constant time for linked structures and
in amortised constant time for hashed structures. Forward traversal through
any collection is therefore in linear time. This is expected to be the most
common traversal idiom. The only collections that directly support backward
traversal and simultaneous traversals using multiple indices are those with
block storage, such as Vector
or String
. Efficient
work-arounds are possible to visit other kinds of collection in the same
logical orders.
Collections with memo-objects maintain the integrity of the map from index to element position under all other operations. The memo is kept if it can be determined that the map is valid after an insertion or removal of an element, otherwise it is reset and the next element access via an index will result in a linear scan of the collection to that index position. Many efficient insertion idioms are possible, such as splicing one collection into another in forward or reverse order, at an index position. Operations which combine a collection with itself are allowed, and generate a temporary clone. Operations which are inefficient, and therefore not advised, are backwards traversal and multiple traversal by many indices. These will reset the memo after each step, and yield quadratic-time traversal. If reverse traversal is desired, it is more efficient to reverse-copy the collection and forwards traverse the copy (two linear traversals, therefore linear time). If multiple traversal is desired, it is more efficient to clone the collection and traverse two copies (again, two linear traversals, so linear time). This is encouraged, since the cost of copying elements by reference is small. Ultimately, all traversals may be accomplished efficiently in linear time.
This documentation was created by Dr Anthony J H Simons, Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, United Kingdom.