Each project below indicates its possible supervisors, identified by their initials. One of these supervisors will be assigned to you if you are allocated that project. The supervisors and their initials are as follows:
| RJG |   :   | Rob Gaizauskas |
| MRH | : | Mark Hepple |
| MS | : | Mark Stevenson |
| YW | : | Yorick Wilks |
| POOL-1: | Cross-lingual Information Retrieval |   | [ MS YW] |
| POOL-2: | Computerised Determination of Disputed Authorship |   | [ MS YW] |
| POOL-3: | Evaluation functions for Word Sense Disambiguation |   | [ MS] |
| POOL-4: | Automatic Detection of Plagiarism |   | [ MS YW] |
| POOL-5: | Disambiguation of Gene and Protein Names in Biological Texts |   | [ MS] |
| POOL-6: | Adaptive Information Extraction by Machine Learning |   | [ MS RJG] |
| POOL-7: | Semantic Similarity Metrics |   | [ MS YW] |
| POOL-8: | An Automatic Text Summariser |   | [ MS RJG YW] |
| POOL-9: | Question Answering against Large Text Collections |   | [ RJG] |
| POOL-10: | Building a Citation Graph from Biomedical Research Papers |   | [ RJG YW] |
| POOL-11: | A Web Client-Server Architecture to Support Advanced Text Search |   | [ RJG] |
| POOL-12: | Producing Biographical Summaries from Web Pages |   | [ RJG YW] |
| POOL-13: | A JAVA Tool for Transformation-based Learning |   | [ MRH] |
| POOL-14: | Discovery of New Meanings in Text |   | [ MS] |
| POOL-15: | Machine Learning Techniques for Dialogue Act Recognition |   | [ MRH YW] |
| POOL-16: | Handling Unknown Words in Part-of-Speech Tagging |   | [ MRH YW] |
| POOL-17: | Machine Learning Techniques for Email Filtering and Text Categorisation |   | [ YW] |
| POOL-18: | Evaluation of on-line MT systems |   | [ YW] |
| POOL-19: | Acquisition of Verb Selectional Preferences from Text |   | [ YW] |
| POOL-20: | Machine Learning Techniques for Word Sense Disambiguation |   | [ YW] |
"Did Homer really write the Iliad and the Odyssey or were they written by another blind Greek of the same name?". Ever since antiquity, issues of establishing genuine authorship have exercised scholars, with certain cases, such as those of Shakespeare and Bacon or of the Federalist Papers in the U.S., becoming widely known. The issue is also important in an area known as "forensic linguistics", where proof of authorship may have important legal or criminal consequences (e.g. was the suicide note really written by the alleged suicide?). With the advent of computer text processing and the wide availability of electronic texts, it has become possible to carry out computerised analysis of aspects of lexical and stylistic usage to attempt to determine whether it is more likely that a disputed text is the product of one author or another.
This project will review the literature on computerised techniques for establishing disputed authorship, pick one of these techniques to implement, implement it and evaluate it against a standard set of disputed texts, such as the Federalist papers.
Prerequisites: No previous knowledge of text processing or NLP is necessary, though an interest in language is clearly an advantage.
Software: Java or Perl. Hardware: Suns or PCs.
With ever more electronic text being created by word processors and ever wider access to electronic text via the Internet, wider incidence of plagiarism was inevitable and is now occurring. Higher education institutions charged on the one hand with embracing new technology and widening access through increased participation and use of distance learning, and on the other hand with maintaining quality and standards, need tools to help combat this form of fraud. Computerised techniques that analyse lexical and phrasal features of texts can help to identify likely incidents of plagiarism and draw tutors attention to texts that should be more closely examined to determine whether plagiarism has or has not occurred.
This project will review the literature on computerised techniques for detecting plagiarism, pick one of these techniques to implement, implement it and evaluate it against a range of texts known to be related in various ways.
Prerequisites: No previous knowledge of text processing or of NLP is necessary, though an interest in language is clearly an advantage.
Software: Java or Perl and Tcl/Tk. Hardware: Suns or PCs.
One approach to this problem has been to apply machine learning techniques to decide whether a particular name is one of a gene or protein. This project would reimplement these techniques and apply them to an existing database of containing text from biology journals.
See supervisor for further reading.
Prerequisites: Interest in language and/or text processing.
Information Extraction (IE) is an automatic method for locating important facts in electronic documents to meet the information needs of specific users. For example, large corporations often employ staff to monitor newspapers, etc, for reports of commercially significant events, e.g. a pharmaceutical company might be interested in announcements of new drugs by its competitors. In such a case, an IE system could be developed to automatically identify and record such facts from an electronic source, such as a newswire. Most current IE systems, however, are based on complex Natural Language Processing (NLP) technologies, and the porting of such systems from one IE domain to another, e.g. from drug announcements to company merger announcements, is time-consuming and requires an IE expert.
In recent work, systems have been developed to perform relatively simple IE tasks that avoid these problems. Such so-called adaptive IE systems require only a collection of example documents in which the desired information has been identified and annotated, e.g. using XML mark-up, and apply machine learning techniques to this training data to learn how to identify corresponding information in new, unseen, documents. An example application for this kind of IE system might be identifying the speaker, title, time and location of seminars from a collection of seminar announcement messages.
This project will review the literature on adaptive IE, and will implement and evaluate adaptive IE systems based on one or more machine learning algorithms, most probably using available software implementing these machine learning methods.
Software: Java, C/C++ or Perl. Hardware: Suns or PCs.
Use a Web search tool such as Google and you will find that when you are presented with a set of candidate documents satisfying your search, each has associated with it a very brief summary. These are frequently crucial to your decision as to whether to download and read the full document. In some cases the summaries are remarkably good, in other cases abysmally bad. Not only is there a need for good single document summarisation; increasingly there is a need for multi-document summarisation. The web has made publishing so easy that frequently multiple, similar documents are returned from searches. Rather than reading each of them for what is largely similar information, a fused summary which eliminates redundant information, but preserves differences would be highly useful.
This project will implement a single or multi-document summariser. Techniques for automatically generating summaries have been studied for some years, so the project will begin by reviewing existing approaches and settling upon one.
Prerequisites: Interest in language and/or text processing.
Resources: Software: Java or Perl. Hardware: Suns or PCs.
Prerequisites: Interest in language and/or text processing.
While search engines return documents in response to a user query, the new technology of open domain question answering (QA) attempts to return precise answers to specific questions. For example to the question "How tall is the Eiffel Tower?" a search engine will return a set of pages which the user must read to determine an answer, while a QA system will return a precise answer, e.g. "324 metres".
In 1999 the US National Institute of Standards and Technology (NIST) introduced, as part of the Text REtrieval Conference (TREC) an open international evaluation exercise for question answering systems which has run annually since then. The Sheffield NLP group has taken part most years and is likely to again in 2005.
Most QA systems, including those developed at Sheffield, involve the use of a conventional search engine to retrieve a set of texts deemed likely to contain an answer to a question and then use a second component, an answer extraction component, to identify which segments of the returned texts actually are the answer.
This project will review the literature on existing QA approaches, and will then undertake some research task that falls within the QA area, as agreed with the supervisor. This task might involve implementing a QA system based on fairly simple techniques, or could alternatively be based around the existing Sheffield QA system. Two possibile avenues of investigation are:
Software: By agreement with the supervisor.
A more traditional form of link between documents is the citation, as found in scholarly or academic writing. In academic papers authors cite other authors' work on which they have relied or from which they want to distinguish themselves. This form of linking can also be used to build a directed network of interconnections and to rank papers based on the number of votes (inward links) they have.
The aim of the project will be to build a directed graph representing citation linkages in a limited domain - biomedical research. The resulting graph will be used either to present the user with a visual structure to assist in navigating the literature, or to support a literature search tool with ranking based on citation links. The papers containing the citations will need to be retrieved by crawling web sites which contain downloadable academic papers, downloading the papers, converting them to text format, if they are in pdf, and then parsing out the references from the papers.
Suitable for ACS/ASE students
The Departments of Computer Science and Journalism together with the UK Press Association are currently collaborating on a research project, called the Electronic Cub Reporter, whose aim is to build an advanced search system to support journalists in the task of gathering and writing background for breaking news stories. The project is integrating conventional search engine technology with novel summarisation and natural language analysis tools and needs to embed these technologies in an interface which will allow journalists to carry out complex searches over a large text archive (~20GB of newswire stories constituting the entire Press Association E-Archive from 1993 till the present).
In this project the student will work with Cub Reporter researchers who are building the language analysis tools and studying journalists' information seeking behaviour. The aim of the project is to design and build a web-based client-server system to support searching over the PA archive, as indexed and analysed by the language analysis tools. This will involve, for example:
Software: Probably Java Servlet technology, JSP, Tomcat and JDBC; possibly PHP.
Such information is frequently distributed across multiple documents, many of which will repeat facts found in others. It would be useful to be able to produce a single coherent, accurate and non-redundant summary of key details about a person, with links to source documents.
Such a system has been built by researchers at Columbia University and the MITRE Corporation and this project will begin by studying their work, as well as related work in information extraction and multi-document summarisation. It will then proceed to re-implement the Columbia/MITRE approach, with any appropriate modifications.
Prerequisites: Interest in language and/or text processing.
Resources: Software: Java or Perl. Hardware: Suns or PCs.
Existing tools in the NLP group will be available for use in the project.
Transformation-based Learning (TBL) is an intuitively simple rule-based machine learning approach which has been applied to a wide range of tasks in Natural Language Processing (NLP). The central process of this learning approach acquires a sequence of transformation rules (TRs), which are context dependent `correction' rules, that apply in turn to modify an initial guess at the correct annotation of some text, so that it better approximates truth.
NLP tasks to which TBL has been applied include part-of-speech tagging, robust grammatical analysis, word sense disambiguation and dialogue act recognition. Commonly, researchers investigating the use of TBL for some task have coded their own implementation of the approach tailored to the given purpose, and these implementations often cannot readily be reused for other tasks. Clearly, there are benefits to be gained from the availability of generic implementations of TBL, that can flexibly be applied to different learning problems. Currently, systems that seek to address this need have been developed in C++ (fnTBL) and in Prolog (mu-TBL).
The aim of this project is to develop a generic tool for TBL in the JAVA programming language. An implementation in JAVA is sought with a view both to the portability of the resulting system and to its future development beyond the lifetime of the immediate project. The project will start by looking at uses of TBL for various tasks, and by examining the existing generic TBL implementations, with a view to elaborating the requirements that the system should fulfill to produce an optimally generic/reusable tool. The subsequent JAVA implementation work will be aimed at producing a system which not only realises a reasonable subset of these requirements, but which is also well-designed so as to allow for future development. Ultimately, the aim is to produce an effective OpenSource generic tool for TBL which can genuinely facilitate the work of researchers in NLP and other fields.
Prerequisites: This project would suit an Advanced Masters student, who has strong JAVA programming skills, and a good understanding of software design principles.
Software: Java. Hardware: Suns or PCs.
Words can be used in many different ways, for example "drinks" can mean either "take in liquid" (The children like to drink cola) or "consume alcohol" (We were drinking all night). The possible meanings of words are listed in dictionaries and it is assumed that all potential meanings are included. However, words can also be used in new, often unexpected, ways which may not be included in the dictionary. For example, "My car drinks gasoline." These word occurrences are known as novel senses and may not appear in the dictionary.
This project would use known Natural Language Processing techniques to identify when a word is being used in a novel sense. An existing system for analysing the meanings of words (such as the ones suggested by Yarowsky (1992) or McCarthy et. al. (2004)) would be implemented and their output anaylsed to identify when words are being used in news ways.
Any suitable programming language (e.g. Java, C/C++ or Perl) running on PCs or UNIX.
Yarowsky, D. (1992) "Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora" In Proceedings of the 14th International Conference on Computational Linguistics (COLING-92)
McCarthy, D., Koeling, R., Weeds, J. and Carroll, J. (2004) "Finding predominant senses in untagged text" In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics.
For additional reading see supervisor.
Detailed study of dialogue, for example in conversational exchanges, has led to the proposal of a fixed set of dialogue acts, which concisely characterise a speaker's intention in producing a particular utterance or statement. Examples of dialogue acts are SUGGEST, INFORM, ACCEPT, REJECT, and so on. Recognising dialogue acts is crucial to effective automatic analysis of dialogue. Dialogue act recognition is a challenging task, however, because often the dialogue act cannot be directly inferred from a literal interpretation of an utterance. (For example, I may say "It's cold in here", apparently a simple statement of fact, with the intent of causing you to close the window, so that my utterance is an indirect form of request.)
A number of dialogue corpora are available in which utterances have been manually annotated for their dialogue act. These resources have been used as training materials with a number of machine learning approaches to produce dialogue act recognition systems that automatically determine the dialogue act of an utterance using cues such as the word and phrases appearing within the utterance, as well as properties of the preceding utterances.
This project will review the literature on automatic dialogue act recognition, and will implement an approach for performing this task, train it on available data, and evaluate its effectiveness.
Prerequisites: No previous knowledge of NLP is necessary, though an interest in language is clearly an advantage.
Software: Java, C/C++ or Perl. Hardware: Suns or PCs.
Part-of-speech taggers process natural language texts and assign to each word a word class or part-of-speech tag, such as NOUN or VERB. This is an important first step for subsequent analysis in many natural language processing applications. The task is difficult as many words have more than one part-of-speech and what is the correct one to assign will depend on the context of use (e.g. 'study' in "I study/VERB French" vs. He is in the study/NOUN"). Part-of-speech taggers are commonly produced by applying machine learning methods to a corpus of pretagged training data.
A key obstacle to successful part-of-speech tagging is the quite frequent presence in texts of words that were not seen in a tagger's training data, and which may not even occur in a common dictionary. Consequently, practical part-of-speech taggers require a component for handling unknown words, which might assign a single most-probable tag to an unknown word, or perhaps a ranked list of several most-probable tag alternatives (allowing the tagger to chose amongst them). Various clues may be used in making this assignment, most particularly the presence of various "affixes", i.e. word prefixes and suffixes. For example, words ending in -ed are likely to be past tense or past participle verb forms. Another possible clue is capitalisation, i.e. since an uppercase-initial word is commonly a proper name in English text.
The project will start by reviewing existing approaches to handling unknown words in part-of-speech tagging. One or more approaches will then be implemented and evaluated, using one of the available corpora of pretagged text, such as the British National Corpus, as a basis for training and evaluation
Prerequisites: No previous knowledge of NLP is necessary, though an interest in language is clearly an advantage.
Software: Java, C/C++ or Perl. Hardware: Suns or PCs.
Increasingly email is being used by businesses to advertise their wares and by other organisations to promote their causes. While some of this may be of interest to a user, most of it is not. To many already burdened with high levels of essential work-related email, `junk mail' is endangering the utility of the medium and sorely trying their patience. This scenario illustrates the need for effective automatic text categorisation, by which texts are automatically assigned to one or more `categories'. In this case, the categories would be `junk mail' and `non-junk mail', and a system that could automatically identify `junk mail' messages would allow them to be filtered and disposed of, without the user ever needing to see them. Further possible uses of automatic text categorisation are easily provided. For example, a businessman might want financial news reports distinguished from other news, so that he could focus on the former. A parent might want a system to prevent their child accessing material on the internet that had violent or sexually explicit content.
This project will review approaches used for automatic text categorisation, and will implement and evaluate one or more methods in some application context.
Prerequisites: No previous knowledge of NLP is necessary, though an interest in language is clearly an advantage.
Software: Java, C/C++ or Perl. Hardware: Suns or PCs.
Most verbs impose some form of `selectional restrictions' on the phrases which can occur around them. For instance, the subject of `sleep' is likely to be animate, unless the usage is metaphorical. So, babies sleep, cats sleep, but ideas do not sleep. However, cities sleep -- at least metaphorically -- which means that rarely do we have absolute restrictions -- just preferences. Once these selectional preferences have been acquired they can be used to help in parsing or in speech recognition: if we are uncertain as to which of two words is the subject of a particular verb, we choose the one whose semantic class is most preferred; if we are uncertain as to which of two words we have `heard' we pick the one which is most preferred.
There has been considerable recent interest in acquiring verb selectional restrictions or preferences automatically from corpora. Typically this involves crudely syntactically analysing the text around all instances of a verb in a corpus and then using a semantic classification hierarchy such as Wordnet in building a model of what semantic classes occur as subject, object, etc. of the verb.
This project will look at re-implementing one of the previously investigated approaches in the area.
Prerequisites: No previous knowledge of NLP is necessary, though an interest in language is clearly an advantage.
Software: Java, C/C++ or Perl. Hardware: Suns or PCs.
Many natural language words have multiple senses: `bridge' for example can refer to a structure spanning a river, to a card game, to a piece of dental work, or to a part of a ship. Automatically assigning the correct sense to a word in a given context is a challenging task, known as `disambiguation', and one of significance for practical natural language processing systems.
The various approaches that have been tried for this task have in common that they rely on the words that appear in the surrounding context of sense-ambiguous words. For example, context words such as `play', `deal', `win', `lose', etc, provide evidence that an instance of `bridge' belongs to the word's card game sense. Aside from this common feature, however, the various approaches tried differ in many regards, including the specific machine learning techniques used, whether learning is `supervised' or `unsupervised' (i.e. relies on manually disambiguated training data), and whether additional knowledge sources are exploited (e.g. WordNet, a semantic network encoding information about the relations between word meanings).
This project will review the literature on approaches to word sense disambiguation, and will then implement and evaluate one or more algorithms for performing this task.
Prerequisites: Interest in language and/or text processing.
Software: Java, C/C++ or Perl. Hardware: Suns or PCs.