Given an IE system that performs an extraction task against texts in one language, it is natural to consider how to modify the system to perform the same task against texts in another. More generally, there may be a requirement to do the extraction task against texts in an arbitrary number of languages and to present results to a user who has no knowledge of the source language from which the information has been extracted. To minimise the language-specific alterations that need to be made in extending an IE system to a new language, it is important to separate the task-specific conceptual knowledge the system uses, which may be assumed to be language independent, from the language-dependent lexical knowledge the system requires, which unavoidably must be extended for each new language.
At Sheffield, we have adapted the architecture of the LaSIE system [26], an IE system originally designed to do monolingual extraction from English texts, to support a clean separation between conceptual and lexical information. This separation allows hard-to-acquire, domain-specific, conceptual knowledge to be represented only once, and hence to be reused in extracting information from texts in multiple languages, while standard lexical resources can be used to extend language coverage. Preliminary experiments with extending the system to French and Spanish have shown substantial results, and by a method quite different from attaching a classic (monolingual) IE system to a machine translation (MT) system.
The M-LaSie (multilingual) system relies on a robust domain model that constitutes the central exchange through which all multilingual information circulates. The addition of a new language to the IE system consists mainly of mapping a new monolingual lexicon to the domain model and adding a new syntactic/semantic analysis front-end, with no interaction at all with other languages in the system.
The language independent domain model can be compared to the use of an interlingua representation in MT (see, e.g., [35]). An IE system, however, does not require full generation capabilities from the intermediate representation, and the task will be well-specified by a limited `domain model' rather than a full unrestricted `world model'. This makes an interlingua representation feasible for IE, because it will not involve finding solutions to all the problems of such a representation, only those issues directly relevant to the current IE task.
A French-Spanish-English prototype of this architecture has been implemented and successfully tested on a limited amount of data. The architecture has been further developed in the AVENTINUS project [21].