REVEAL:
The identification of anomalous segments in text on a large scale.

See anything anomalous?

Introduction

Developing technology to identify anomalous information in text will enable a wide range of applications. This project proposes an investigation into automatic techniques to identify textual material that is in some way inconsistent with its surrounding context. This may range from messages that are not related to the normal topics (for example detecting non-work related email on work email accounts, or spam on newsgroups) through to nonsense out-of-context messages (spam again, disguised text, translation error detection, etc) or to a variation in style or authorship. Our goal is to develop an NLP technique for detecting when language usage is out of the norm.

One application of this research may be identifying potentially threatening information about illegal activities from email messages, or on a bulletin board. Current technology is largely limited to spotting words or phrases in the discussion that might indicate a danger explicitly. As this technology is now reasonably sound, we may assume that people wanting to discuss subversive activities will avoid the use of words that will trigger an automatic system to flag their message. In the case of drug transactions, one thinks of dealers NOT using "heroin" or "cocaine" and, in the current bad guy scenario, we assume they are similarly NOT going to use the explicit names of explosives or dangerous agents they have or may be seeking. Our proposal concerns how one might learn to identify such discussions at an early stage, conducted using only apparently innocent words, in the way horse and grass?were once apparently innocent names for drugs, but which later became as indicative as their originals.


back homeHOME