next up previous
Next: Previous work on ML Up: Can we make Information Previous: Introduction

Background: The Information Extraction Context

Extracting and managing information has always been important for intelligence agencies, but it clear that, in the next decade, technologies for these functions will also be crucial to education, medicine, and commerce. It is estimated that 80% of our information is textual, and Information Extraction (IE) has emerged as a new technology as part of the search for better methods of finding, storing, accessing and mining such information.

IE itself is an automatic method for locating important facts in electronic documents (e.g. newspaper articles, news feeds, web pages, transcripts of broadcasts, etc.) and storing them in a data base for processing with techniques like data mining, or with off-the-shelf products like spreadsheets, summarisers and report generators. The historic application scenario for Information Extraction is a company that wants, say, the extraction of all ship sinkings, recorded in public news wires in any language world-wide, put into a single data base showing ship name, tonnage, date and place of loss etc. Lloyds of London had performed this particular task with human readers of the world's newspapers for a hundred years.

The key notion in IE is that of a ``template": a linguistic pattern, usually a set of attribute value pairs, with the values being text strings, created by experts to capture the structure of the facts sought in a given domain, and which IE systems apply to text corpora with the aid of extraction rules that seek those fillers in the corpus, given a set of syntactic, semantic and pragmatic constraints.

IE as a modern language processing technology was developed largely in the US. but with strong development centres elsewhere [18], [19], [30], [34], [27] Over 25 systems world wide, have participated in the recent MUC competitions, most of which have a generic structure [34] and previously unreliable tasks of identifying, names, dates, organizations, countries, and currencies automatically - often referred to as TE, or Template Element, tasks - have become extremely accurate (over 95% accuracy for the best systems).. In interpreting MUC figures, it should also be borne in mind that the overall recall and precision of human-provided IE information as a whole is estimated to be about 20% worse [16], [14], [15] than the best human performance; it was measured by how well intelligence analysts perform the task manually when compared to a ``gold star" experienced intelligence analyst.

Adaptivity in the MUC development context has meant the one-month period in which competing centres adapt their system to new training data sets provided by DARPA; this period therefore provides a benchmark for human-only adaptivity of IE systems.

This paper describes the adaptivity problem, to new domains and genres, that constitutes the central problem to the extension and acceptability of IE, and to increase the principled multi-linguality of IE systems, which we take to mean extending their ability to extract information in one language and present it to a user in another.


next up previous
Next: Previous work on ML Up: Can we make Information Previous: Introduction
Gillian Callaghan 2000-03-29