Academic Publications

My publications are listed chronologically (newest first) within the topic areas I have been actively researching. Currently those areas are bioinformatics, information extraction and question answering as well as a few miscellaneous papers which don't quite fit anywhere else.

Question Answering

A Data Driven Approach to Query Expansion in Question Answering

Leon Derczynski, Jun Wang, Robert Gaizauskas and Mark A. Greenwood. A Data Driven Approach to Query Expansion in Question Answering. In Proceedings of the 2nd Workshop on Information Retrieval for Question Answering (IR4QA), 2008.

Paper,
Automated answering of natural language questions is an interesting and useful problem to solve. Question answering (QA) systems often perform information retrieval at an initial stage. Information retrieval (IR) performance, provided by engines such as Lucene, places a bound on overall system performance. For example, no answer bearing documents are retrieved at low ranks for almost 40% of questions. In this paper, answer texts from previous QA evaluations held as part of the Text REtrieval Conferences (TREC) are paired with queries and analysed in an attempt to identify performance-enhancing words. These words are then used to evaluate the performance of a query expansion method. Data driven extension words were found to help in over 70% of difficult questions. These words can be used to improve and evaluate query expansion methods. Simple blind relevance feedback (RF) was correctly predicted as unlikely to help overall performance, and an possible explanation is provided for its low value in IR for QA.

Evaluation of Automatically Reformulated Questions in Question Series

Richard Shaw, Ben Solway, Robert Gaizauskas and Mark A. Greenwood. Evaluation of Automatically Reformulated Questions in Question Series. In Proceedings of the 2nd Workshop on Information Retrieval for Question Answering (IR4QA), 2008.

Paper,
Having gold standards allows us to evaluate new methods and approaches against a common benchmark. In this paper we describe a set of gold standard question reformulations and associated reformulation guidelines that we have created to support research into automatic interpretation of questions in TREC question series, where questions may refer anaphorically to the target of the series or to answers to previous questions. We also assess various string comparison metrics for their utility as evaluation measures of the proximity of an automated system’s reformulations to the gold standard. Finally we show how we have used this approach to assess the question processing capability of our own QA system and to pinpoint areas for improvement.

The University of Sheffield's TREC 2006 Q&A Experiments

Mark A. Greenwood, Mark Stevenson and Robert Gaizauskas. The University of Sheffield's TREC 2006 Q&A Experiments. In Proceedings of the 15th Text REtrieval Conference, 2006.

Paper

Open-Domain Question Answering

Mark A. Greenwood. Open-Domain Question Answering. PhD Thesis, Department of Computer Science, The University of Sheffield, 2006.

Abstract, Thesis

The University of Sheffield's TREC 2005 Q&A Experiments

Robert Gaizauskas, Mark A. Greenwood, Mark Hepple, Henk Harkema, Horacio Saggion and Atheesh Sanka. The University of Sheffield's TREC 2005 Q&A Experiments. In Proceedings of the 14th Text REtrieval Conference, 2005.

Final Paper, Workbook Paper

Our entries in the TREC 2005 QA evaluation continue the experiments carried out as part of TREC 2004 and hence we report work on multiple approaches to both the main and document ranking tasks. As well as continuing with our separate approaches we have concentrated common tasks and resources to allow for better more principled comparison of our approaches.

Information Retrieval for Question Answering a SIGIR 2004 Workshop

Robert Gaizauskas, Mark Hepple and Mark A. Greenwood. Information Retrieval for Question Answering a SIGIR 2004 Workshop. In SIGIR Forum, 38(2), 2004.

Paper

For this workshop, we solicited papers that addressed any aspect of how the first retrieval stage of QA can be adapted to improve overall system performance, suggesting possible topics such as: parameterizations/optimizations of specific IR systems for QA; studies of query formation strategies suited to QA; different uses of IR for factoid vs. non-factoid questions; utility of term matching constraints, e.g. term proximity, for QA; analyses of passage retrieval vs full document retrieval for QA; analyses of boolean vs ranked retrieval for QA; impact of IR performance on overall QA performance; named entity pre-processing of questions or collections; corpus pre-processing to create corpus-specific thesauri for question expansion; evaluation measures for assessing IR for QA. In total 16 papers were submitted to the workshop of which 10 were selected for presentation following peer review by three reviewers per paper. Reviewing was conducted by a programme committee consisting of the three organisers plus a further 13 well-known researchers active in the area. The workshop was well attended with approximately 30 participants. The day consisted of ten half hour paper presentations, including questions, and concluded with a lively open discussion session in the final hour.

The University of Sheffield's TREC 2004 Q&A Experiments

Robert Gaizauskas, Mark A. Greenwood, Mark Hepple, Ian Roberts and Horacio Saggion. The University of Sheffield's TREC 2004 Q&A Experiments. In Proceedings of the 13th Text REtrieval Conference, 2004.

Paper

The experiments detailed in this paper are a continuation of the experiments started as part of the work undertaken in preparation for participation in the TREC 2003 QA evaluations. Our main experiments for TREC 2004 were concerned with investigating: a) alternative approaches to information retrieval (IR) for question answering b) alternative approaches to answer extraction for list and factoid questions, and c) alternative approaches answering definitional or `other' questions. In each of these three areas we have developed two principal alternatives, each of which has variants. Given the TREC limit of three test runs per site, we have not been able to evaluate properly all combinations of these approaches. Consequently, the systems we submitted only give a partial picture of work carried out, and further evaluation is underway. In the following we describe the major alternatives we have been exploring in these three areas and present the formal test results for the system combinations we submitted.

Using Pertainyms to Improve Passage Retrieval for Questions Requesting Information About a Location

Mark A. Greenwood. Using Pertainyms to Improve Passage Retrieval for Questions Requesting Information About a Location. In Proceedings of the Workshop on Information Retrieval for Question Answering (SIGIR 2004), Sheffield, UK, July 29th, 2004.

Paper, Slides

This paper explores a method of query formulation for the expansion of natural language questions requesting information about a location, such as "What is the literacy rate in Cuba?". The questions are expanded to form standard information retrieval queries using location pertainym relationships mined from WordNet. Results over the relevant questions from the Text REtrieval Conference (TREC) question answering test sets suggest that selective application of this method produces significantly better performance than using the unaltered questions as queries to an information retrieval engine.

Exploring the Performance of Boolean Retrieval Strategies for Open Domain Question Answering

Horacio Saggion, Robert Gaizauskas, Mark Hepple, Ian Roberts and Mark A. Greenwood. Exploring the Performance of Boolean Retrieval Strategies for Open Domain Question Answering. In Proceedings of the Workshop on Information Retrieval for Question Answering (SIGIR 2004), Sheffield, UK, July 29th, 2004.

Paper
This paper is concerned with systematically exploring and evaluating a range of possible boolean retrieval strategies for use within a Question Answering (QA) system. We firstly set out two evaluation metrics - coverage and recall - which are specifically designed for use in evaluating retrieval performance in a QA context, and apply these measure in quantifying the performance of some standard ranked retrieval systems for this purpose. We then consider a series of possible boolean retrieval strategies for use in QA, which concern the way that boolean queries are generated from questions to retrieve passages relevant to finding the question's answer, and evaluate their performance. This line of research should ultimately lead to an increased understanding of how best to formulate retrieval strategies for QA and of which component methods can usefully contribute to such strategies.

A Pattern Based Approach to Answering Factoid, List and Definition Questions

Mark A. Greenwood and Horacio Saggion. A Pattern Based Approach to Answering Factoid, List and Definition Questions. In Proceedings of the 7th RIAO Conference (RIAO 2004), Avignon, France, 26-28 April, 2004.

Paper, Slides

Finding textual answers to open-domain questions in large text collections is a difficult problem. In this paper we concentrate on three types of questions: factoid, list, and definition questions and present two systems that make use of regular expression patterns and other devices in order to locate possible answers. While the factoid and list system acquires patterns in an off-line phase using the Web, the definition system uses a library of patterns identified by corpus analysis to acquire knowledge in an on-line stage for each new question. Results are reported over the question sets from the question answering track of the 2002 and 2003 Text REtrieval Conference (TREC).

AnswerFinder: Question Answering from your Desktop

Mark A. Greenwood. AnswerFinder: Question Answering from your Desktop. In Proceedings of the 7th Annual Colloquium for the UK Special Interest Group for Computational Linguistics (CLUK '04), University of Birmingham, UK, 6-7 January, 2004.

Paper, Slides

For many years Internet search engines have made a valiant attempt at quickly finding documents relevant to a users query. As the size of the Internet continues to grow, however, these search engines are returning more and more documents for a single query, leaving the user to wade through a vast amount of text. What is required are a new range of systems that are not only easy to use but are capable of returning just the answer to the user's question. The AnswerFinder application outlined in this paper attempts to meet both of these requirements.

The University of Sheffield's TREC 2003 Q&A Experiments

Robert Gaizauskas, Mark A. Greenwood, Mark Hepple, Ian Roberts, Horacio Saggion and Matthew Sargaison. The University of Sheffield's TREC 2003 Q&A Experiments. In Proceedings of the 12th Text REtrieval Conference, 2003.

Final Paper, Workbook Paper

The systems entered by the University of Sheffield in the question answering track of previous TRECs have been developments of the system first entered in TREC 8 (Humphreys et al., 1999). Although a range of improvements have been made to the system over the last four years (Scott and Gaizauskas, 2000; Greenwood et al., 2002), none has resulted in a significant performance increase. For this reason it was decided to approach the TREC 2003 evaluation more as a learning experience than as a forum in which to promote a particular approach to QA. We view this as the beginning of a process that will lead to much fuller appreciation of how to build more effective QA systems.

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering

Mark A. Greenwood and Robert Gaizauskas. Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering. In Proceedings of the Workshop on Natural Language Processing for Question Answering (EACL03), pages 29-34, Budapest, Hungary, April 14, 2003.

Paper, Slides

This paper explores one particular limitation common to question answering systems which operate by using induced surface matching text patterns - namely the problems concerned with question specific words appearing within the induced answer extraction pattern. We suggest a solution to this problem by generalising the learned answer extraction patterns to include named entity classes.

The slides given above were presented at the workshop on the 14th of April, however, an alternative version of the presentation was given to the Dialogue and Question Answering reading group on the 10th of April as a practice presentation.

The University of Sheffield TREC 2002 Q&A System

Mark A. Greenwood, Ian Roberts and Robert Gaizauskas. The University of Sheffield TREC 2002 Q&A System. In Proceedings of the 11th Text REtrieval Conference, 2002.

Final Paper, Workbook Paper

The system entered by the University of Sheffield in the question answering track of TREC 2002 represents a significant development over the Sheffield system entered into TREC-8 and TREC-9, although the underlying architecture remains the same. The essence of the approach is to pass the question to an information retrieval (IR) system which uses it as a query to do passage retrieval against the text collection. The top ranked passages output from the IR system are then passed to a modified information extraction (IE) system. Syntactic and semantic analysis of these passages, along with the question, is carried out to identify the "sought entity" from the question and to score potential matches for this sought entity in each of the retrieved passages. The potential matches are then combined or discarded based on a number of criteria. The highest scoring match is then proposed as the answer to the question.

Question Answering - First Year Report

Mark A. Greenwood. Question Answering. First Year PhD Progress Report, Department of Computer Science, The University of Sheffield, UK, October 2002.

Report

This document reports the progress made during the first year of my studies into question answering, mainly concerned with work undertaken to allow us to take part in TREC 2002. The document also includes a comprehensive review of the history of question answering, covering early work in areas such as natural language interfaces to databases and reading comprehension systems. The report concludes with an outline of a proposal for the next two years of my studies, which aim to show the uses and benefits of natural language techniques to the field of question answering by examining them against the backdrop of a simply pattern matching system, similar in idea to those which have recently been shown to be highly successful.

This report was accepted as fulfilment of my responsibilities as a first year PhD student at a transfer viva meeting on the 23^rd of October 2002.

Information Extraction

Dependency Pattern Models for Information Extraction

Mark Stevenson and Mark A. Greenwood. Dependency Pattern Models for Information Extraction. Research on Language and Computation, 7(1):13-39, 2009.

Several techniques for the automatic acquisition of Information Extraction (IE) systems have used dependency trees to form the basis of an extraction pattern representation. These approaches have used a variety of pattern models (schemes for representing IE patterns based on particular parts of the dependency analysis). An appropriate pattern model should be expressive enough to represent the information which is to be extracted from text without being overly complex. Previous investigations into the appropriateness of the currently proposed models have been limited. This paper compares a variety of pattern models, including ones which have been previously reported and variations of them. Each model is evaluated using existing data consisting of IE scenarios from two very different domains (newswire stories and biomedical journal articles). The models are analysed in terms of their ability to represent relevant information, number of patterns generated and performance on an IE scenario. It was found that the best performance was observed from two models which use the majority of relevant portions of the dependency tree without including irrelevant sections.

Saxon: An Extensible Multimedia Annotator

Mark A. Greenwood, José Iria and Fabio Ciravegna. Saxon: An Extensible Multimedia Annotator. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC), Marrakech, Marocco, 2008.

Paper

This paper introduces Saxon, a rule based document annotator that is capable of processing and annotating several document formats and media, both within and across documents. Furthermore, Saxon is readily extensible to support other input formats due to both it's flexible rule formalism and the modular plugin architecture of the Runes framework upon which it is built. In this paper we introduce the Saxon rule formalism through examples aimed at highlighting it's power and flexibility.

Doris: Managing Document-Based Knowledge in Large Organisations via Semantic Web Technologies

Rhavish Bhagdev, Jonathan Butters, Sam Chapman, Aba-Sah Dadzie, Mark A. Greenwood, José Iria and Fabio Ciravegna. Doris: Managing Document-Based Knowledge in Large Organisations via Semantic Web Technologies. In Proceedings of the 6th International Semantic Web Conference and the 2nd Asian Semantic Web Conference, 2007.

Paper

The acquisition, sharing and reuse of knowledge is a prime challenge in large organisations. Doris is a framework for defining Knowledge Management applications based on Semantic Web technologies that enables flexible means of capturing knowledge and of searching and exploring the knowledge and the documents where it is contained. Applications of Doris are employed in the aerospace domain at Rolls-Royce plc and in the domain of exploring and searching archives about London of the 18th century.

A Task-Based Comparison of Information Extraction Pattern Models

Mark A. Greenwood and Mark Stevenson. A Task-Based Comparison of Information Extraction Pattern Models. In Proceedings of the ACL2007 Workshop on Deep Linguistic Processing, Prague, Czech Republic, 2007.

Paper

Several recent approaches to Information Extraction (IE) have used dependency trees as the basis for an extraction pattern representation. These approaches have used a variety of pattern models (schemes which define the parts of the dependency tree which can be used to form extraction patterns). Previous comparisons of these pattern models are limited by the fact that they have used indirect tasks to evaluate each model. This limitation is addressed here in an experiment which compares four pattern models using an unsupervised learning algorithm and a standard IE scenario. It is found that there is a wide variation between the models’ performance and suggests that one model is the most useful for IE.

A Semi-Supervised Approach To Learning Relevant Protein-Protein Interaction Articles

Mark A. Greenwood and Mark Stevenson. A Semi-Supervised Approach To Learning Relevant Protein-Protein Interaction Articles. In Proceedings of the Second BioCreAtIvE Challenge Workshop, Madrid, Spain, 2007.

Paper

This paper describes an Information Extraction system that can be used to identify articles containing protein-protein interactions. The approach relies on the automatic acquisition of dependency tree based patterns which can be used to identify these interactions and consequently select relevant documents. Evaluation shows an F-Score performance of approximately 64%.

Improving Semi-Supervised Acquisition of Relation Extraction Patterns

Mark A. Greenwood and Mark Stevenson. Improving Semi-Supervised Acquisition of Relation Extraction Patterns. In Proceedings of the Information Extraction Beyond The Document Workshop (COLING/ACL 2006), Sydney, Australia, 2006.

Paper

This paper presents a novel approach to the semi-supervised learning of Information Extraction patterns. The method makes use of more complex patterns than previous approaches and determines their similarity using a measure inspired by recent work using kernel methods (Culotta and Sorensen, 2004). Experiments show that the proposed similarity measure outperforms a previously reported measure based on cosine similarity when used to perform binary relation extraction.

Comparing Information Extraction Pattern Models

Mark Stevenson and Mark A. Greenwood. Comparing Information Extraction Pattern Models. In Proceedings of the Information Extraction Beyond The Document Workshop (COLING/ACL 2006), Sydney, Australia, 2006.

Paper

Several recently reported techniques for the automatic acquisition of Information Extraction (IE) systems have used dependency trees as the basis of their extraction pattern representation. These approaches have used a variety of pattern models (schemes for representing IE patterns based on particular parts of the dependency analysis). An appropriate model should be expressive enough to represent the information which is to be extracted from text without being overly complicated. Four previously reported pattern models are evaluated using existing IE evaluation corpora and three dependency parsers. It was found that one model, linked chains, could represent around 95% of the information of interest without generating an unwieldy number of possible patterns.

Learning Information Extraction Patterns using WordNet

Mark Stevenson and Mark A. Greenwood. Learning Information Extraction Patterns using WordNet. In Proceedings of the 3rd International Conference of the Global WordNet Association (GWA'06), Jeju Island, Republic of Korea, 2006.

Paper

Information Extraction (IE) systems often use patterns to identify relevant information in text but these are difficult and time-consuming to generate manually. This paper presents a new approach to the automatic learning of IE patterns which uses WordNet to judge the similarity between patterns. The algorithm starts with a small set of sample extraction patterns and uses a similarity metric, based on a version of the vector space model augmented with information from WordNet, to learn similar patterns. This approach is found to perform better than a previously reported method which relied on information about the distribution of patterns in a corpus and did not make use of WordNet.

A Semantic Approach to IE Pattern Induction

Mark Stevenson and Mark A. Greenwood. A Semantic Approach to IE Pattern Induction. In Proceedings of the 43rd Annual Conference of the Association of Computational Linguistics, 2005.

Paper
This paper presents a novel algorithm for the acquisition of Information Extraction patterns. The approach makes the assumption that useful patterns will have similar meanings to those already identified as relevant. Patterns are compared using a variation of the standard vector space model in which information from an ontology is used to capture semantic similarity. Evaluation shows this algorithm performs well when compared with a previously reported document-centric approach.

Bioinformatics

Using Prior Information Attained From The Literature To Improve Ranking In Genome-Wide Association Studies

Mattias Johansson, Yaoyong Li, John Wakefield, Mark A. Greenwood, Thomas Heitz, Ian Roberts, Hamish Cunningham, Paul Brennan, Angus Roberts, James Mckay. Using Prior Information Attained From The Literature To Improve Ranking In Genome-Wide Association Studies. Presented at the 59th Annual Meeting of The American Society of Human Genetics, Honolulu, Hawaii, October 2009.

Advances in high-throughput genotyping have made it technically possible to analyze hundreds of thousands of single nucleotide polymorphisms (SNPs) across the whole genome. Using this technology it is now feasible to conduct genome-wide association studies (GWAS) aiming to investigate the majority of common genetic variation and relate it to some phenotypic differences, often to risk of some disease. Whilst the price of GWAS assays are decreasing rapidly, conducting a GWAS is still a very expensive exercise, typically requiring genotyping several thousands of subjects at several hundreds of euros per sample in order to gain sufficient statistical power to distinguish the true association signals from the background noise. Recognizing that a large proportion of GWAS findings reside near potential candidate genes for many of the investigated phenotypes, we here explore means to incorporate prior information attained from the literature to improve ranking in GWAS. We use this information to assign a crude prior probability of association for each SNP. The prior probabilities are thereafter integrated with the association result from the GWAS and the SNPs are re-ranked according to Baysian false-discovery probability (BFDP). We show that this methodology improves the ranking substantially for many known susceptibility loci with examples from studies on lung cancer and cancer of the upper aero digestive ract (UADT). We have implemented this methodology in a web application where a user can specify a list of keywords and receive priors for all SNPs of interest. These priors can thereafter be used to rank the SNPs according to the BFDP.

Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System

Mark A. Greenwood, Mark Stevenson, Yikun Guo, Henk Harkema, and Angus Roberts. Automatically Acquiring a Linguistically Motivated Genic Interaction Extraction System. In Proceedings of the 4th Learning Language in Logic Workshop (LLL05), Bonn, Germany, 2005.

Paper, Slides

This paper describes an Information Extraction (IE) system to identify genic interactions in text. The approach relies on the automatic acquisition of patterns which can be used to identify these interactions. Performance is evaluated on the Learning Language in Logic (LLL-05) workshop challenge task.

Miscellaneous

SUPPLE: A Practical Parser for Natural Language Engineering Applications

Robert Gaizauskas, Mark Hepple, Horacio Saggion, Mark A. Greenwood and Kevin Humphreys. SUPPLE: A Practical Parser for Natural Language Engineering Applications. In Proceedings of the 9th International Workshop on Parsing Technologies (IWPT '05), 2005, Vancouver Canada.

Paper
We describe SUPPLE, a freely-available, open source natural language parsing system, implemented in Prolog, and designed for practical use in language engineering (LE) applications. SUPPLE can be run as a stand-alone application, or as a component within the GATE General Architecture for Text Engineering. SUPPLE is distributed with an example grammar that has been developed over a number of years across several LE projects. This paper describes the key characteristics of the parser and the distributed grammar.

SUPPLE: A Practical Parser for Natural Language Engineering Applications

Robert Gaizauskas, Mark Hepple, Horacio Saggion, Mark A. Greenwood and Kevin Humphreys. SUPPLE: A Practical Parser for Natural Language Engineering Applications. Technical report CS--05--08, Department of Computer Science, University of Sheffield, 2005.

Paper
We describe SUPPLE, a freely-available, open source natural language parsing system, which is implemented in Prolog, and which has been designed for practical use in language engineering applications. SUPPLE can be run as a stand-alone application, but is also available as a component within the GATE General Architecture for Text Engineering. The description covers the SUPPLE parsing approach and the grammar formalism used. SUPPLE is distributed with a particular example grammar, which has been developed over a number of years through use in numerous language engineering projects, and the key characteristics of this grammar are also described.

Implementing A Vector Space Document Retrieval System

Mark A. Greenwood. Implementing a Vector Space Document Retrieval System. Department of Computer Science, The University of Sheffield, UK, December 2001.

Paper, Software

This paper describes the implementation of a vector space document retrieval system and the evaluation of this system using the CACM document collection. The system is evaluated using both traditional techniques i.e. precision-recall curves, and more modern techniques which are founded in the way people use Internet search engines. Both evaluation methods clearly show that retrieval systems, which make use of stemming and stop word removal, are superior to those that do not. Also, the TF.IDF algorithm of ranking documents is more successful than simply using Term Frequency. A flaw in the original implementation of the system, which leads to some query words being counted multiple times, is also evaluated with promising results. This shows that the vector space model in its traditional form is not necessarily the best document retrieval technique currently available.