The Simons
Component Library

Introduction Class Hierarchy Class Listing Index of Classes Index of Methods Header Files

The Collections Hierarchy

Classification Compatibility Element Management Checked Iteration

This document describes the design rationale used in developing the Collection class hierarchy in the Simons Component Library. Collections are a fundamental kind of component in any software library. Common kinds of collection include Vector, Set and Map. From the outset, it was decided that the collections should be provided as a hierarchy of related classes, in the true spirit of object-oriented programming. These would be specialised according to their similarities and differences, such that obvious semantic differences between kinds of collection were reflected in the class hierarchy. Commonality would be captured in intermediate abstract classes, which could function as interfaces, capable of manipulating compatible collections. The mechanism for visiting all the elements of a collection should be both simple and secure.

Classification of the Collections

View the: Collection class hierarchy

From the outset, it was decided that the collections should be provided in the same way as other classes, related in a class hierarchy according to their similarities and differences. This is in the object-oriented spirit of Smalltalk, but contrasts with the C++ standard library, which implements each collection from scratch. A single classification hierarchy was adopted:

Collection - all collections
- Sequence - the sequential collections
- Unordered - the unordered collections
- Sorted - the sorted collections

Collection is the abstract superclass of all collections of objects. The Collection class is the root of the collections hierarchy, which includes the Sequence, Unordered and Sorted collections. It defines the interface for adding, counting and visiting the elements of a Collection. The semantics of adding an element varies in each subclass, and may be position-sensitive, value-sensitive or disregard duplicates. Counting returns the size of the Collection. Elements may be visited in one pass, by iterating over the Collection using an index in the range 0..n-1 where n is the size of the Collection. Element membership is tested in a single pass, by default, but this is reimplemented more efficiently in descendant classes. Comparison of two Collections and hashing on a Collection are abstract, deferred until subclasses. These operations expect to return a result based on the values, and possibly also the ordering, of elements. All the elements of one Collection may be added to another, according to the semantics of adding.

Sequence

View the: Sequence class hierarchy

Sequence is the abstract superclass of all sequential collections. The Sequence class is an intermediate class in the collections hierarchy and the superclass of all sequences, such as Vector, List and String. In a Sequence, the inserted order of elements is significant. Adding new elements appends these to the end of the Sequence. Elements are extrinsically ordered, according to their index position in a Sequence. The notions of a first and last element are meaningful. Elements may be visited in order via an index in the range 0..n-1, where n is the size of the Sequence. Sequence defines the interface for accessing, inserting, removing and replacing elements at index positions. Comparing two Sequences performs a lexicographic comparison of their elements. Hashing on a Sequence is sensitive to both the values and the order of the elements.

Unordered

View the: Unordered class hierarchy

Unordered is the abstract superclass of all unordered collections. The Unordered class is an intermediate class in the collections hierarchy and the superclass of all unordered collections, such as Set, Bag and Map. In an Unordered collection, the inserted order of elements is not significant. Adding new elements includes them anywhere in the Unordered collection. The notions of element membership and a count of the number of occurrences are meaningful. Elements may be visited in an arbitrary order via an index in the range 0..n-1, where n is the size of the Unordered collection. Unordered defines the interface for seeking and counting specific elements. Unordered defines the interface for adding, removing and retaining single elements and collections of elements. Comparing two Unordered collections performs a symmetric test for inclusion. Hashing on an Unordered collection is sensitive only to the values, but not the order, of the elements.

Sorted

To be implemented - work in progress

Sorted is the abstract superclass of all sorted collections. The Sorted class is an intermediate class in the collections hierarchy and the superclass of all sorted collections, such as SortedSet and SortedList. In a Sorted collection, the order of elements is significant, determined by their relative magnitude. Adding new elements inserts them in sorted position with respect to other elements. Elements are intrinsically ordered, that is, they are logically stored in ascending order, based on the comparison of one element with its neighbours. The notions of a smallest and largest element are meaningful. Elements may be visited in order via an index in the range 0..n-1 where n is the size of the Sorted collection. Sorted defines the interface for seeking and counting specific elements. Sorted defines the interface for adding and removing elements and extracting the smallest and largest elements. Comparing two Sorted collections performs a lexicographical comparison of elements. Hashing on a Sorted collection is sensitive to the values of the elements (since their order is intrinsic).

Compatibility of the Collections

The Simons Component Library took a fundamental design decision, to represent the commonality between collections directly in the class hierarchy. This means that the obvious Sequence generalisation over a Vector and a List is represented directly, by making Vector and List subclasses of Sequence. Not only do classes like Collection and Sequence capture in one place all the algorithms common to these kinds of collection, but it is possible to substitute a List or a Vector where any kind of Sequence was expected. By contrast, the standard C++ library implements all collections from scratch, for the sake of maximum efficiency. Obvious generalisations, such as the sequence generalisation over a vector and a list only exist "conceptually", as textual specifications describing the kinds of operation that any sequence is expected to provide. (Other than this, the only commonality is provided by generic functions that accept iterators).

In the SCL, collections may be assigned to any compatible variable type. For example, a variable of the type SequenceID may receive any subclass of Sequence, including Vector, List and even String, which is a bona fide Sequence; but not Set or Map, which are subclasses of Unordered. Any compatible collection stored in a SequenceID variable can be accessed through the methods declared in the Sequence interface; and by virtue of inheritance, through methods declared in the Collection interface also.

Flexible styles of coding may be used. For example, a program that needs to use some kind of Sequence may express this through use of SequenceID variables everwhere, without committing to any particular representation. Not only can a late decision be made to use eg a List representation, but the program can even decide to change the representation of the sequence partway through its execution, if it were determined that a Vector representation might be more efficient. Converting between different types of collection is simple, for example:


SequenceID seq = new List;

   ...

seq = new Vector(seq);
 
   ...

will convert the List into a Vector, faithfully transferring all the elements (by reference) in the same order as before. Every concrete collection type has a constructor that accepts another collection as argument. Many algorithms may be expressed simply by changing the type of collection. For example, to find out how many distinct elements there are in a Vector, simply change this to a Set. Or, to count the occurrences of each kind of element in a List, simply change this to a Bag.

The methods of the collections are bound dynamically. This supports a style of coding in which general algorithms provided in the abstract classes are inherited by default, but may be overridden in specific subclasses where a more efficient implementation can be provided. For example, Vector and List inherit all but four of their methods from Sequence, since the majority of methods may be defined in terms of the getAt, putAt, addAt and removeAt operations (following the Template Method design pattern). However, in addition to defining these four basic operations, Vector redefines those operations which access the back of the Vector, since these may be provided more efficiently. Likewise, List redefines those operations that access the front of the List. These kinds of sequence select the most efficient methods for their representation dynamically, when accessed through a SequenceID variable.

Dynamic Element Management

The elements stored in SCL collections may be any kind of Object and elements are always stored by reference (following the standard SCL policy that all components have reference semantics). This means that it is much simpler to manage elements than in the standard C++ library, which has a mixed regime of value and reference semantics, depending on how type parameters are instantiated. For example, a standard vector<data> reserves primary storage for objects of the type data, which are deallocated when the vector goes out of scope. If, in the same program, a standard list<data*> needs to manipulate the same elements, it must do so via pointers, hence the pointer-type data* used to instantiate the type parameter. However, it is possible for the vector to be destroyed, with all its elements, before the list has finished with them, such that the program has dangling pointers and crashes.

All this is avoided in the SCL by the standard reference-counting mechanism. When a Vector receives elements, their reference counts are incremented automatically when they are assigned to the ObjectID variables that make up the storage block of the Vector. The elements will have a reference count of at least 1 until the Vector is destroyed. If, in the same program, a List manipulates the same elements, these are passed by reference to the List and likewise their reference counts are incremented when they are stored. If the same element is stored in both a Vector and a List, it will have a reference count of at least 2. So, it does not matter whether the Vector or List is destroyed first, since the element will survive until its reference count reaches 0. The element may in any case be stored in some other variable in the program and survive when both collections have been destroyed.

Collections may also store values of the basic types, provided these are wrapped, or "boxed up" as objects. The basic types are: Boolean, Character, Natural, Integer and Decimal. A corresponding family of wrappers may be used to wrap values of these types. The basic types are treated implicitly as though they were subtypes of Value, a subclass of Object. It is possible to assign simple values directly to variables of the type ValueID, for example:


ValueID val1 = 3;     // constructs an IntegerBox

ValueID val2 = true;  // constructs a BooleanBox

Each simple value will be wrapped in a suitable object type, whose only purpose is to allow the value to be treated like an object. Only those operations in the Object interface, such as comparison and hashing, are available. The simplest way to add basic values to collections is therefore:


SequenceID seq = new Vector;

seq->add( ValueID(3) );

seq->add( ValueID(true) );

To retrieve basic values from a collection is equally simple:


Integer elem1 = seq->getAt(0)->toInteger();

Boolean elem2 = seq->getAt(1)->toBoolean();

since a conversion method to each basic type is defined in Object. These methods raise a type failure exception if the value is not in fact of the declared type.

Checked Iteration over Elements

One of the chief goals of the Simons Component Library is to provide simple and natural programming idioms. This is reflected in the policy regarding element access and iteration. Collections all support access to their elements using an Integer index. The root Collection class provides an abstract method item(i) which returns the ith element of a collection. This means that the natural iteration idiom is a standard for-loop:


for (Integer i = 0; i < size(); ++i)

   ... item(i) ...

The meaning of the index varies from one collection to the next. For the Sequence collections, the index is literally the extrinsic order, or the position at which the element was inserted. For the Sorted collections, the index denotes the intrinsic order, or the rank of the element once it has been sorted with respect to other elements in ascending order. For the Unordered collections, the index has no meaning other than to identify an element at a different location in the collection.

Apart from providing the most natural programming idiom, there were two further benefits to be gained from providing element access through an index. Firstly, it permits checking for out-of-range access. All bounds checks can be implemented cheaply as one-sided comparisons made after casting the Integer index to a Natural number:


if (Natural(index) < size()) ... // OK, carry on

Secondly, it means that all access to a collection's elements is mediated through the collection itself, rather than through separate iterators. This supports safer algorithms which may traverse and update collections at the same time. By contrast, the standard C++ library provides access to collections through iterators, generalisations of pointers into arrays. This means that all traversal idioms require a start-pointer and an off-the-end pointer as backstop; copying idioms require at least three pointers and all operations are typically unchecked. The fact that iterators break the encapsulation of collections and bypass their public interface was another reason for rejecting them in the SCL.

Providing efficient access to a collection's elements via an index was an interesting programming challenge. In other collection libraries, constant time access is usually only offered for those collections which allocate their storage as a single array block. In the SCL, it was desired to provide constant, or near-constant time access to elements also in those collections that were built from linked structures or hash-tables. This was achieved using internal memoisation, using small memo-objects to map from indices to storage locations. For all collections, repeated access to the object at the same index is granted in constant time. Access to the object at the next index requires one internal iteration step. This is provided in constant time for linked structures and in amortised constant time for hashed structures. Forward traversal through any collection is therefore in linear time. This is expected to be the most common traversal idiom. The only collections that directly support backward traversal and simultaneous traversals using multiple indices are those with block storage, such as Vector or String. Efficient work-arounds are possible to visit other kinds of collection in the same logical orders.

Collections with memo-objects maintain the integrity of the map from index to element position under all other operations. The memo is kept if it can be determined that the map is valid after an insertion or removal of an element, otherwise it is reset and the next element access via an index will result in a linear scan of the collection to that index position. Many efficient insertion idioms are possible, such as splicing one collection into another in forward or reverse order, at an index position. Operations which combine a collection with itself are allowed, and generate a temporary clone. Operations which are inefficient, and therefore not advised, are backwards traversal and multiple traversal by many indices. These will reset the memo after each step, and yield quadratic-time traversal. If reverse traversal is desired, it is more efficient to reverse-copy the collection and forwards traverse the copy (two linear traversals, therefore linear time). If multiple traversal is desired, it is more efficient to clone the collection and traverse two copies (again, two linear traversals, so linear time). This is encouraged, since the cost of copying elements by reference is small. Ultimately, all traversals may be accomplished efficiently in linear time.

This documentation was created by Dr Anthony J H Simons, Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, United Kingdom.

Generated on Mon Jan 09 14:14:34 2006 for The Simons Component Library by

1.3