Studying Natural Language Processing

I am keen to hear from motivated students who are intersted in studying for a PhD in Natural Language Processing. I am able to supervise projects on a wide range of topics in Natural Language Processing. Sugegsted topics are listed below and I am also interested in hearing potential student's own ideas for projects.

Applying for a PhD You can apply by following the instructions in this guide and putting my name down as preferred supervisor. Please contact me beforehand to discuss your application to ensure that I am a suitable supervisor and that your application is as strong as possible.

Choosing a topic You will need to submit a research proposal with your application. A strong proposal is important for a good application. You can use one of the topics at the bottom of this page or suggest your own. If you do want to suggest your own then discuss it with me before submitting your application to make sure it's suitable. It is possible to change the research topic after the start of the PhD if you are accepted.

Funding a PhD Details about PhD fees and possible funding schemes can be found here. Obtaining PhD funding for students from outside the EU is particularly competitive. Students from outside the EU have normally obtained funding for their studies from own governments or other institutions.

Not ready for a PhD yet? If you are interested in Natural Language Processing but don't want to do a PhD you might be interested in these MSc courses: Computer Science with Speech and Language Processing and Data Analytics.

Suggested PhD topics

I have the following research topics available and would also like to hear from students who have their own ideas for research areas.

What kind of person would write that?

Individuals differ in many ways, some are introverted while others are extroverts. Researchers have found that some personality types can be predicted from simple linguistic cues, for example extraversion correlated with higher usage of words with positive emotional connotations and greater number of social references (Pennebaker and King, 1999). More recently, work has been carried out on applying these ideas to automatically classify an individual's personality type based solely on analysis of documents they have authored, including emails and blogs (Gill et al, 2006; Iacobelli et. al., 2011). The aim of this project would be to build upon this work to develop techniques for profiling an individual's personality from documents they write, particularly on social media.

Gill, A., Oberlander, J. and Austin, E. (2006) Rating e-mail personality at zero acquaintance. Personality and Individual Differences, 40, 497--507.
Iacobelli, F., Gill, A.J., Nowson, S. and Oberlander, J. (2011) Large scale personality classification of bloggers. In Affective Computing and Intelligent Interaction 2011: Lecture Notes in Computer Science 6975, 568--577. Memphis, Tn, Oct 9-12 2011.
Pennebaker, J.W. & King, L.A. (1999). Linguistic styles: Language use as an individual difference. Journal of Personality and Social Psychology, 77, 1296-1312.

Exploring Large Document Collections

Information Retrieval has traditionally focussed on the problem of information lookup by developing systems that identify the potential answers to a user's information need as efficiently and accurately as possible (e.g. internet serach engines). However, more recently the importance of other types of interaction have been identified, including exploratory search (Marchionini, 2006). This is useful in situations where the user is unfamiliar with the content of a document collection since they may be unsure about what information is available and may not be familiar enough with the vocabulary to form search queries.

Topic models have recently become a popular method for analysing the content of document collections by modelling them using latent variables known as "topics". Topic models have also proved to be useful for exploratory search but their effectiveness is limited by the fact that topics can be difficult to interpret. The aim of this project would be to build on techniques for representing topics that have already been developed within the Sheffield NLP group (Aletras and Stevenson, 2013) and use them to create exploratory search interfaces.

N. Aletras and M. Stevenson (2013) Representing Topics Using Images. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 158--167, Atlanta, Georgia
Marchionini, G. (2006) Exploratory Search: From Finding to Understanding Communications of the ACM 49(4):41-49.
Newman, Baldwin, Cavedon, Karimi, Martinez, Zobel (2010). Visualizing document collections and search results using topic mapping. Journal of Web Semantics.

Plagiarism Detection

With ever more electronic text being created by word processors and ever wider access to electronic text via the Internet, wider incidence of plagiarism was inevitable and is now occurring. Higher education institutions charged on the one hand with embracing new technology and widening access through increased participation and use of distance learning, and on the other hand with maintaining quality and standards, need tools to help combat this form of fraud. Computerised techniques that analyse lexical and phrasal features of texts can help to identify likely incidents of plagiarism and draw tutors attention to texts that should be more closely examined to determine whether plagiarism has or has not occurred.

This project will develop new techniques for automatically identifing plagiarism. Several different types of plagiarism can occur, for example when the original document has been translated from another language or rewitten to avoid detection. The project will focus on a subset of these types.

P. Clough and M. Stevenson. Developing A Corpus of Plagiarised Short Answers (2010) Clough, P. and Stevenson, M. Language Resources and Evaluation

Word Sense Disambiguation

Words can be used in many different ways, for example "drinks" can mean either "take in liquid" (The children like to drink cola) or "consume alcohol" (We were drinking all night). Automatically assigning the correct sense to a word in a given context is a challenging task, known as Word Sense Disambiguation (WSD), and one of significance for practical natural language processing systems.

Although WSD has been widely studied there are still a variety of open problems including:

  • Making use of topic information. Topic information is extremely valuable for WSD. A commonly used example of a word that is ambiguous is ''bank'' which can mean river bank or money bank. If it known that a document discusses finance then it is much more likely that any occurrences of ''bank'' it contains will mean money bank than edge of river.
  • New sense detection. Words can be used in new, often unexpected, ways. For example, "My car drinks gasoline." These word occurrences are known as novel senses and may not appear in the dictionary. Identifying them automatically would be useful for improving WSD performance and automatically creating dictionaries.
  • WSD in the biomedical domain. Documents in a particular domain may contain words and other terms with several possible meanings. For example, in biomedical documents the word "cold" can mean (at least) "low temperature" and "virus". These documents also contain other forms of ambiguity such as abbreviations with several possible expansions.

This project will choose one of these areas to explore and develop new techniques for WSD.

M. Stevenson, Y. Guo, R. Gaizauskas, and D. Martinez. (2008) "Disambiguation of biomedical text using diverse sources of information" BMC Bioinformatics, 9(Suppl 11):S7.

M. Stevenson and Y. Wilks. (2001) "The Interaction of Knowledge Sources in Word Sense Disambiguation" Computational Linguistics 27(3):321-349.

Information Extraction

Information Extraction (IE) is an automatic method for locating important facts in electronic documents to meet the information needs of specific users. For example, large corporations often employ staff to monitor newspapers, etc, for reports of commercially significant events, e.g. a pharmaceutical company might be interested in announcements of new drugs by its competitors. In such a case, an IE system could be developed to automatically identify and record such facts from an electronic source, such as a newswire. Most current IE systems, however, are based on complex Natural Language Processing (NLP) technologies, and the porting of such systems from one IE domain to another, e.g. from drug announcements to company merger announcements, is time-consuming and requires an IE expert.

In recent work, systems have been developed to avoid these problems. Such so-called adaptive IE systems require only a collection of example documents in which the desired information has been identified and apply machine learning techniques to this training data to learn how to identify corresponding information in new, unseen, documents. This project would extend these approaches by, for example, applying them to new IE problems or using novel machine learning techniques.

A Semantic Approach to IE Pattern Induction (2005) Stevenson, M. and Greenwood, M. (2005) Proceedings of the 43rd Meeting of the Association for Computational Linguistics (ACL-05), Ann Arbour, Michigan (PDF)