Fabio Ciravegna
Department of Computer Science , University of Sheffield
Regent Court, 211 Portobello Street, S1 4DP, Sheffield, UK
email: F.Ciravegna@dcs.shef.ac.uk
www: www.dcs.shef.ac.uk/~fabio/

Knowledge management is the key source for competitive advantage. The success or failure of a company can depend on the ability to find the right information at the right time and to correctly integrate new information with existing structured knowledge, in order to facilitate communication and knowledge sharing and to support knowledge-based organisations. The vast majority of information is textual, therefore tools for structuring textual data starting from its content represents one of the fundamental steps in successfully managing information. The Web explosion (and the increasing usage of Inter/Intranet technologies as a core channel for communication) focuses the needs towards Web-based documents and texts.

Amilcare is a system for Information Extraction from Web documents for Knowledge Management that provides both accuracy and easy user customisation. It implements the (LP)2 algorithm [Ciravegna 2000, Ciravegna 2001a, Ciravegna 2001b]. It maintains most of the characteristics of previous (LP)2’s implementations such as Learning Pinocchio ( ) in terms of easy of use, but it includes a number of new features that:

Reduce the application development time (by reducing both the learning time and the number of training session needed for application tuning)
Support users in the whole application development process, from the initial task definition to the application delivery and use.

Supporting Users

Amilcare comprehensively supports the user in the whole application development cycle, from design to delivery and even during post-marketing assistance via its unique set of tools. Human computer interaction experts and information extraction experts have worked together in the design of tools for user support.

Application development is divided in the following steps:

  1. Application design: the goal of this step is to define a template, i.e., a kind of form the system must fill with the extracted information. Amilcare provides a set of tools for helping the user to identify the correct application settings: it provides a graphical interface that allows information highlighting in text examples, coupled with a set of methods for the semi-automatic organization of information into templates and (in future releases) unsupervised methods for helping identifying the information present in the relevant documents. Considering that choosing a representative set of texts may be difficult, a number of statistical tools are provided for checking the representativeness of the corpus selected by the user, so to avoid the (not infrequent) problems of wrong example selection.
  2. System training: in this phase the system learns how to extract information for a particular application by analysing a number of user-defined examples (i.e. a set of documents with associated the information to be extracted). a simple graphical interface is provided that allows information highlighting via mouse. Considering that providing examples can be tedious , Amilcare provides facilities for reducing the quantity of texts to be tagged via active learning, a strategy that may reduce the need of training examples up to 80%.
  3. Result validation: a fundamental step in the application development is the tuning of results according to the specific application needs: given that a 100% accurate information extraction process is out of grasp of the current technology, it is necessary to be able to balance the ability to find information (recall) with the precision in information identification sot to identify the correct mix of precision and recall. Amilcare provides a set of tools for result monitoring, both from a qualitative point of view (inspecting the system results on a set of test texts with error highlighting) and statistical point of view (accuracy, precision, recall).
    Amilcare’s tuning interface is designed to bridge the user’s qualitative vision (“you are not capturing enough information”) with the numerical concepts the system is able to manipulate (e.g. moving error thresholds in order to obtain higher recall). CPU time needed for retuning is 1/10 of the initial learning time.
  4. Application delivery: once the system performance has been tuned to the application needs the information extraction engine can be delivered as a black box module to be integrated in the user environment. A powerful API allow text feeding and result extraction.
  5. Post-marketing monitoring: Amilcare provides tools that are fundamental once the application has been delivered to the final user. They allow to statistically compare both the corpus received for analysis and the results obtained at training/testing time with those  on the corpus received. This is fundamental because the kind of texts received can change in time (e.g. initially only very short texts were received but then long texts start to appear) and the user must be sure that such a change (that may not be noticed by the system administrator) does not affect the system performances. Moreover Amilcare is also able to statistically monitor its accuracy on new texts by measuring the statistical distribution of identified information across texts and issue worning in case such distribution radically differs from the one observed on the training corpus.

The application development cycle is shown in the next figure.



Amilcare’s development is supported under the Advanced Knowledge Technologies (AKT) Interdisciplinary Research Collaboration (IRC), which is sponsored by the UK Engineering and Physical Sciences Research Council under grant number GR/N15764/01. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing official policies or endorsements, either express or implied, of the EPSRC or any other member of the AKT IRC.

Amilcare is based on Gate, a tool for architectures for language engineering developed at the University of Sheffield. Gate is used for preprocessing texts, i.e. for tokenization, sentence identification, part of speech tagging and gazetteer lookup.

Fabio Ciravegna, Department of Computer Science, University of Sheffield,

Regent Court, 211 Portobello Street, S1 4DP, Sheffield, UK.  
F.Ciravegna@dcs.shef.ac.uk , www: http://www.dcs.shef.ac.uk/~fabio/