Paper
This paper is concerned with systematically exploring and evaluating a range of possible boolean
retrieval strategies for use within a Question Answering (QA) system. We firstly set out two
evaluation metrics - coverage and recall - which are specifically designed for use in evaluating
retrieval performance in a QA context, and apply these measure in quantifying the performance of
some standard ranked retrieval systems for this purpose. We then consider a series of possible
boolean retrieval strategies for use in QA, which concern the way that boolean queries are
generated from questions to retrieve passages relevant to finding the question's answer, and
evaluate their performance. This line of research should ultimately lead to an increased understanding
of how best to formulate retrieval strategies for QA and of which component methods can usefully
contribute to such strategies.
A Pattern Based Approach to Answering Factoid, List and Definition Questions
Mark A. Greenwood and Horacio Saggion. A Pattern Based Approach to Answering Factoid, List and Definition Questions. In Proceedings of the 7th RIAO
Conference (RIAO 2004), Avignon, France, 26-28 April, 2004.
Finding textual answers to open-domain questions in large text collections is a difficult problem.
In this paper we concentrate on three types of questions: factoid, list, and definition questions
and present two systems that make use of regular expression patterns and other devices in order to
locate possible answers. While the factoid and list system acquires patterns in an off-line phase
using the Web, the definition system uses a library of patterns identified by corpus analysis to
acquire knowledge in an on-line stage for each new question. Results are reported over the
question sets from the question answering track of the 2002 and 2003 Text REtrieval Conference (TREC).
AnswerFinder: Question Answering from your Desktop
Mark A. Greenwood. AnswerFinder: Question Answering from your Desktop. In Proceedings of the 7th Annual Colloquium for the
UK Special Interest Group for Computational Linguistics (CLUK '04), University of Birmingham, UK, 6-7 January, 2004.
For many years Internet search engines have made a valiant attempt at quickly finding documents
relevant to a users query. As the size of the Internet continues to grow, however, these search
engines are returning more and more documents for a single query, leaving the user to wade
through a vast amount of text. What is required are a new range of systems that are not only easy
to use but are capable of returning just the answer to the user's question.
The AnswerFinder application outlined in this paper attempts to meet both of these requirements.
The University of Sheffield's TREC 2003 Q&A Experiments
Robert Gaizauskas, Mark A. Greenwood, Mark Hepple, Ian Roberts, Horacio Saggion and Matthew Sargaison. The University of Sheffield's TREC 2003 Q&A Experiments.
In Proceedings of the 12th Text REtrieval Conference, 2003.
The systems entered by the University of Sheffield in the question answering track of previous TRECs have
been developments of the system first entered in TREC 8 (Humphreys et al., 1999). Although a range of improvements
have been made to the system over the last four years (Scott and Gaizauskas, 2000; Greenwood
et al., 2002), none has resulted in a significant performance increase. For this reason it was decided to approach
the TREC 2003 evaluation more as a learning experience than as a forum in which to promote a particular
approach to QA. We view this as the beginning of a process that will lead to much fuller appreciation
of how to build more effective QA systems.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering
Mark A. Greenwood and Robert Gaizauskas. Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering.
In Proceedings of the Workshop on Natural Language Processing for Question Answering (EACL03), pages 29-34, Budapest,
Hungary, April 14, 2003.
This paper explores one particular limitation common to question answering systems which
operate by using induced surface matching text patterns - namely the problems concerned with question
specific words appearing within the induced answer extraction pattern.
We suggest a solution to this problem by generalising the learned answer extraction patterns
to include named entity classes.
The slides given above were presented at the workshop on the 14th of April, however, an
alternative version
of the presentation was given to the Dialogue and Question Answering reading group on the 10th of April as a practice presentation.
The University of Sheffield TREC 2002 Q&A System
Mark A. Greenwood, Ian Roberts and Robert Gaizauskas. The University of Sheffield TREC 2002 Q&A System.
In Proceedings of the 11th Text REtrieval Conference, 2002.
The system entered by the University of Sheffield in the question answering track of TREC 2002
represents a significant development over the Sheffield system entered into TREC-8
and TREC-9, although the underlying architecture remains the same. The essence
of the approach is to pass the question to an information retrieval (IR) system which uses it
as a query to do passage retrieval against the text collection. The top ranked passages output
from the IR system are then passed to a modified information extraction (IE) system. Syntactic
and semantic analysis of these passages, along with the question, is carried out to identify the
"sought entity" from the question and to score potential matches for this sought
entity in each of the retrieved passages. The potential matches are then combined or discarded
based on a number of criteria. The highest scoring match is then proposed as the answer to the
question.
Question Answering - First Year Report
Mark A. Greenwood. Question Answering. First Year PhD Progress Report, Department of Computer Science,
The University of Sheffield, UK, October 2002.
This document reports the progress made during the first year of my studies into question answering, mainly
concerned with work undertaken to allow us to take part in TREC 2002. The document also includes a comprehensive
review of the history of question answering, covering early work in areas such as natural language interfaces to
databases and reading comprehension systems. The report concludes with an outline of a proposal for the next two
years of my studies, which aim to show the uses and benefits of natural language techniques to the field of
question answering by examining them against the backdrop of a simply pattern matching system, similar in idea
to those which have recently been shown to be highly successful.
This report was accepted as fulfilment of my responsibilities
as a first year PhD student at a transfer viva meeting on the 23rd of October 2002.
Information Extraction
Dependency Pattern Models for Information Extraction
Mark Stevenson and Mark A. Greenwood. Dependency Pattern Models for Information Extraction. Research on Language and Computation, 7(1):13-39, 2009.
Several techniques for the automatic acquisition of Information Extraction (IE) systems have used dependency trees to form the basis
of an extraction pattern representation. These approaches have used a variety of pattern models (schemes for representing IE patterns
based on particular parts of the dependency analysis). An appropriate pattern model should be expressive enough to represent
the information which is to be extracted from text without being overly complex. Previous investigations into the appropriateness of
the currently proposed models have been limited. This paper compares a variety of pattern models, including ones which have been
previously reported and variations of them. Each model is evaluated using existing data consisting of IE scenarios from two very
different domains (newswire stories and biomedical journal articles). The models are analysed in terms of their ability to
represent relevant information, number of patterns generated and performance on an IE scenario. It was found that the best
performance was observed from two models which use the majority of relevant portions of the dependency tree without including
irrelevant sections.
Saxon: An Extensible Multimedia Annotator
Mark A. Greenwood, José Iria and Fabio Ciravegna. Saxon: An Extensible Multimedia Annotator. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC), Marrakech, Marocco, 2008.
This paper introduces Saxon, a rule based document annotator that is capable of processing and annotating several document formats and media,
both within and across documents. Furthermore, Saxon is readily extensible to support other input formats due to both it's flexible rule
formalism and the modular plugin architecture of the Runes framework upon which it is built. In this paper we introduce the Saxon rule formalism
through examples aimed at highlighting it's power and flexibility.
Doris: Managing Document-Based Knowledge in Large Organisations via Semantic Web Technologies
Rhavish Bhagdev, Jonathan Butters, Sam Chapman, Aba-Sah Dadzie, Mark A. Greenwood, José Iria and Fabio Ciravegna. Doris: Managing Document-Based Knowledge in Large Organisations via Semantic Web Technologies. In Proceedings of the 6th International Semantic Web Conference and the 2nd Asian Semantic Web Conference, 2007.
The acquisition, sharing and reuse of knowledge is a prime challenge in large organisations. Doris is a framework for defining Knowledge
Management applications based on Semantic Web technologies that enables flexible means of capturing knowledge and of searching and exploring the
knowledge and the documents where it is contained. Applications of Doris are employed in the aerospace domain at Rolls-Royce plc and in the domain of
exploring and searching archives about London of the 18th century.
A Task-Based Comparison of Information Extraction Pattern Models
Mark A. Greenwood and Mark Stevenson. A Task-Based Comparison of Information Extraction Pattern Models.
In Proceedings of the ACL2007 Workshop on Deep Linguistic Processing, Prague, Czech Republic, 2007.
Several recent approaches to Information Extraction (IE) have used dependency trees as the basis for an extraction pattern representation. These approaches have used a variety
of pattern models (schemes which define the parts of the dependency tree which can be used to form extraction patterns). Previous comparisons of these pattern models
are limited by the fact that they have used indirect tasks to evaluate each model. This limitation is addressed here in an experiment which compares four pattern models using
an unsupervised learning algorithm and a standard IE scenario. It is found that there is a wide variation between the models’ performance and suggests that one model is the
most useful for IE.
A Semi-Supervised Approach To Learning Relevant Protein-Protein Interaction Articles
Mark A. Greenwood and Mark Stevenson. A Semi-Supervised Approach To Learning Relevant Protein-Protein Interaction Articles.
In Proceedings of the Second BioCreAtIvE Challenge Workshop, Madrid, Spain, 2007.
This paper describes an Information Extraction system that can be used
to identify articles containing protein-protein interactions. The
approach relies on the automatic acquisition of dependency tree based
patterns which can be used to identify these interactions and
consequently select relevant documents. Evaluation shows an F-Score
performance of approximately 64%.
Improving Semi-Supervised Acquisition of Relation Extraction Patterns
Mark A. Greenwood and Mark Stevenson. Improving Semi-Supervised Acquisition of Relation Extraction Patterns.
In Proceedings of the Information Extraction Beyond The Document Workshop (COLING/ACL 2006), Sydney, Australia, 2006.
This paper presents a novel approach to the semi-supervised learning of Information Extraction patterns. The method
makes use of more complex patterns than previous approaches and determines their similarity using a measure inspired by recent
work using kernel methods (Culotta and Sorensen, 2004). Experiments show that the proposed similarity measure outperforms
a previously reported measure based on cosine similarity when used to perform binary relation extraction.
Comparing Information Extraction Pattern Models
Mark Stevenson and Mark A. Greenwood. Comparing Information Extraction Pattern Models.
In Proceedings of the Information Extraction Beyond The Document Workshop (COLING/ACL 2006), Sydney, Australia, 2006.
Several recently reported techniques for the automatic acquisition of Information Extraction (IE) systems have used dependency
trees as the basis of their extraction pattern representation. These approaches have used a variety of pattern
models (schemes for representing IE patterns based on particular parts of the dependency analysis). An appropriate model
should be expressive enough to represent the information which is to be extracted from text without being overly complicated.
Four previously reported pattern models are evaluated using existing IE evaluation corpora and three dependency
parsers. It was found that one model, linked chains, could represent around 95% of the information of interest without generating
an unwieldy number of possible patterns.
Learning Information Extraction Patterns using WordNet
Mark Stevenson and Mark A. Greenwood. Learning Information Extraction Patterns using WordNet. In Proceedings of the
3rd International Conference of the Global WordNet Association (GWA'06), Jeju Island, Republic of Korea, 2006.
Information Extraction (IE) systems often use patterns to identify
relevant information in text but these are difficult and
time-consuming to generate manually. This paper presents a new
approach to the automatic learning of IE patterns which uses WordNet
to judge the similarity between patterns. The algorithm starts with a
small set of sample extraction patterns and uses a similarity metric,
based on a version of the vector space model augmented with
information from WordNet, to learn similar
patterns. This approach is found to perform better than a previously
reported method which relied on information about the distribution of
patterns in a corpus and did not make use of WordNet.
A Semantic Approach to IE Pattern Induction
Mark Stevenson and Mark A. Greenwood. A Semantic Approach to IE Pattern Induction. In Proceedings
of the 43rd Annual Conference of the Association of Computational Linguistics, 2005.
Paper
This paper presents a novel algorithm for the acquisition of Information Extraction patterns.
The approach makes the assumption that useful patterns will have similar meanings to those
already identified as relevant. Patterns are compared using a variation of the standard vector
space model in which information from an ontology is used to capture semantic similarity.
Evaluation shows this algorithm performs well when compared with a previously reported
document-centric approach.
Bioinformatics
Using Prior Information Attained From The Literature To Improve Ranking In Genome-Wide Association Studies
Mattias Johansson, Yaoyong Li, John Wakefield, Mark A. Greenwood, Thomas Heitz, Ian Roberts, Hamish Cunningham, Paul Brennan, Angus Roberts, James Mckay. Using Prior Information Attained From The Literature
To Improve Ranking In Genome-Wide Association Studies. Presented at the 59th Annual Meeting of The American Society of Human Genetics,
Honolulu, Hawaii, October 2009.
Advances in high-throughput genotyping have made it technically possible to analyze hundreds of thousands of
single nucleotide polymorphisms (SNPs) across the whole genome. Using this technology it is now feasible to
conduct genome-wide association studies (GWAS) aiming to investigate the majority of common genetic variation
and relate it to some phenotypic differences, often to risk of some disease. Whilst the price of GWAS assays
are decreasing rapidly, conducting a GWAS is still a very expensive exercise, typically requiring genotyping
several thousands of subjects at several hundreds of euros per sample in order to gain sufficient statistical
power to distinguish the true association signals from the background noise. Recognizing that a large
proportion of GWAS findings reside near potential candidate genes for many of the investigated phenotypes,
we here explore means to incorporate prior information attained from the literature to improve ranking in GWAS.
We use this information to assign a crude prior probability of association for each SNP. The prior probabilities
are thereafter integrated with the association result from the GWAS and the SNPs are re-ranked according to
Baysian false-discovery probability (BFDP). We show that this methodology improves the ranking substantially
for many known susceptibility loci with examples from studies on lung cancer and cancer of the upper aero digestive
ract (UADT). We have implemented this methodology in a web application where a user can specify a list of keywords
and receive priors for all SNPs of interest. These priors can thereafter be used to rank the SNPs according to the BFDP.
Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System
Mark A. Greenwood, Mark Stevenson, Yikun Guo, Henk Harkema, and Angus Roberts. Automatically Acquiring a Linguistically
Motivated Genic Interaction Extraction System. In Proceedings of the 4th Learning Language in Logic Workshop (LLL05),
Bonn, Germany, 2005.
This paper describes an Information Extraction (IE) system to identify genic
interactions in text. The approach relies on the automatic acquisition of patterns
which can be used to identify these interactions. Performance is evaluated on
the Learning Language in Logic (LLL-05) workshop challenge task.
Miscellaneous
SUPPLE: A Practical Parser for Natural Language Engineering Applications
Robert Gaizauskas, Mark Hepple, Horacio Saggion, Mark A. Greenwood and Kevin Humphreys.
SUPPLE: A Practical Parser for Natural Language Engineering Applications. In Proceedings of the 9th
International Workshop on Parsing Technologies (IWPT '05), 2005, Vancouver Canada.
Paper
We describe SUPPLE, a freely-available, open source natural language parsing system, implemented in Prolog, and designed
for practical use in language engineering (LE) applications. SUPPLE can be run as a stand-alone application, or as a component
within the GATE General Architecture for Text Engineering. SUPPLE is distributed with an example grammar that has
been developed over a number of years across several LE projects. This paper describes the key characteristics of the parser
and the distributed grammar.
SUPPLE: A Practical Parser for Natural Language Engineering Applications
Robert Gaizauskas, Mark Hepple, Horacio Saggion, Mark A. Greenwood and Kevin Humphreys.
SUPPLE: A Practical Parser for Natural Language Engineering Applications.
Technical report CS--05--08, Department of Computer Science, University of Sheffield, 2005.
Paper
We describe SUPPLE, a freely-available, open source natural language parsing system,
which is implemented in Prolog, and which has been designed for practical use in language
engineering applications. SUPPLE can be run as a stand-alone application, but is also available
as a component within the GATE General Architecture for Text Engineering. The description
covers the SUPPLE parsing approach and the grammar formalism used. SUPPLE is distributed
with a particular example grammar, which has been developed over a number of years through
use in numerous language engineering projects, and the key characteristics of this grammar
are also described.
Implementing A Vector Space Document Retrieval System
Mark A. Greenwood. Implementing a Vector Space Document Retrieval System.
Department of Computer Science, The University of Sheffield, UK, December 2001.
This paper describes the implementation of a vector space document retrieval system and the evaluation
of this system using the CACM document collection. The system is evaluated using both traditional techniques i.e.
precision-recall curves, and more modern techniques which are founded in the way people use Internet search engines.
Both evaluation methods clearly show that retrieval systems, which make use of stemming and stop word removal,
are superior to those that do not. Also, the TF.IDF algorithm of ranking documents is more successful than simply
using Term Frequency. A flaw in the original implementation of the system, which leads to some query words being
counted multiple times, is also evaluated with promising results. This shows that the vector space model in its
traditional form is not necessarily the best document retrieval technique currently available.