next up previous
Next: Sentence Splitter: Up: LaSIE Modules Previous: LaSIE Modules

Tokenizer:

identifies word boundaries in a text, returning byte offsets (or character positions) to be used as indices in the GDM database.

The Tokenizer also attempts to identify certain known text types - currently only Wall Street Journal and Reuters - the formats of which allow the identification of a document identifier, a header/body boundary, and section (or paragraph) boundaries. For Wall Street Journal texts, exclusion zones (as specified for MUC-6) are also recognised.

Unrecognised text types are treated as plain text, with no header/body boundary, and with blank lines assumed to indicate paragraph boundaries.



Gillian Callaghan 2000-03-29