The Tokenizer also attempts to identify certain known text types - currently only Wall Street Journal and Reuters - the formats of which allow the identification of a document identifier, a header/body boundary, and section (or paragraph) boundaries. For Wall Street Journal texts, exclusion zones (as specified for MUC-6) are also recognised.
Unrecognised text types are treated as plain text, with no header/body boundary, and with blank lines assumed to indicate paragraph boundaries.