Rob Gaizauskas : UG Projects 2024-25

Email: Rob Gaizauskas

Project Titles:

RJG-UG-1:Exploring Deep Learning for Named Entity Recognition
RJG-UG-2:Gamifying the Collection of Argument Tree Data
RJG-UG-3:Using Deep Learning for Post-OCR Correction in 19th Century British Newspaper Text
RJG-UG-4:Developing a Neural Dialogue State Tracking System for Task Based Dialogues
RJG-UG-5:Exploring the Use of Static and Contextual Word Embeddings for Dense Information Retrieval
RJG-UG-6:Using LLMs to Generate Patient-Friendly Medical Imaging Reports


[next][prev][Back to top]

RJG-UG-1:   Exploring Deep Learning for Named Entity Recognition

Background

Named entity recognition (NER) is the identification and classification of sequences of tokens in a natural language text that denote specific instances of classes of entities that are given names. Examples of such entity classes are people (e.g. "Boris Johnson", "Vladimir Putin"), locations ("London", "Kyiv"), organisations ("Google", "International Business Machines Corporation"). Conventionally, other designators such as those for times ("14:27, 12/02/19 GMT", "Tuesday, October 5th") and for monetary amounts ("£ 532.49", "$2.3 billion") are also counted as named entities. Application-specific classes of names may also need to be recognised, for example, gene and protein names, ship and vessel names, movie and book names, and so on.

NER is a long-standing and central problem in natural language processing and many techniques have been developed over the years to tackle it. These range from manually authored rule-based systems developed in the 1990s, through feature-engineered statistical learning systems, which dominated the field from the mid-1990s until the mid-2010's, on to various deep-learning/neural net-based approaches which are currently achieving the state-of-the-art performance figures. Many benchmark datasets exist, such as the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition dataset.

Since NER is such a classic and well-understood problem, it is a great task to use whilst exploring deep learning techniques for sequence labelling. The aim of this project is exactly that -- to develop an understanding of deep learning techniques for sequence labelling by designing, implementing and evaluating several state-of-the-art deep learning approaches to NER, such as recurrent neural networks, long short term memory (LSTM) networks and transformers.

Project Description

This project will proceed by:

Prerequisites

An interest in natural language processing and machine learning and Python programming skills are the only prerequisites for the project.

Initial Reading and Useful Links


[next][prev][Back to top]

RJG-UG-2:   Gamifying the Collection of Argument Tree Data

Background

People love to argue. Not in the sense of having rows or quarrelling, but in the sense of debating a proposition such as "Marijuana should be legalised" or "Ukraine should become a member of NATO". The internet is filled with discussion of these sorts of issues, much of it unconstrained in social media conversations, but in some cases more structured, for example on sites like idebate, kialo.com or Reddit's Change my View, where at least arguments pro or con a particular proposition are distinguished.

Argument mining is a relatively new subarea of Natural Language Processing (NLP) that investigates techniques for automatically identifying and extracting argumentative structures from natural language text. Like most areas of modern NLP, argument mining is dominated by machine learning approaches. Such approaches require labelled data -- target examples -- to learn from. In the case of argument mining this amounts to arguments labelled with the sort of structural information we'd ideally like to extract. Unfortunately, getting such data is very hard. Essentially it involves specifying an agreed scheme for annotating argumentative discourses, such as exchanges in social media, and then getting humans to annotate a set of such discourses manually. But this is time consumming, error-prone and expensive if annotators need to be paid.

So, how to get data? There are some sites, like kialo.com, that go some ways towards organising online argumentative discourse so that the argument structure is evident. But such sites are reluctant to give researchers access to their data, as such uses do not form part of the terms and conditions site users have signed up to. Furthermore, what they provide is only part of what is really required.

One approach to gathering data for machine learning, based on the influential paper Games with a Purpose by Luis von Ahn, is to design an on-line game that captures the training data we want as a side-effect. Following this approach, some NLP researchers have already developed games designed to capture data for training systems to do specific tasks. For example, Poesio et al. (2013), describe a game called Phrase Detectives which captures data about which expressions in a text co-refer with which earlier expressions in the same text, e.g. in the text "Johnson was hammered in the local election results. He did not seem dismayed." the word "he" corefers (i.e. refers to the same thing as) the word "Johnson". In the game, users get points for correctly identifying coreferring expressions and compete with each others to get high scores.

The goal of the current project will be to explore whether or not a "game with a purpose" (GWAP) can be designed to capture the structure of an argument as developed by two or more participants.

Project Description

The aim of this project is to design, implement and evaluate a GWAP that will capture argument structure as two or more people engage in an argument. For example, participants will need to specify which propositions are conclusions and which are premises and which propositions support or attack other proposistions, or the inferences between propositions. The participants must be motivated to take part, so some sort of competitive elements must be introduced and a record kept (e.g. a leaderboard) of how well participants are doing in relation to each other.

Tbe intention is that the argument structure should be described via a graphical structure, where the conclusion is the root node, leaf nodes are premises and intermediate nodes are propositions that are claimed to follow from other propositions that are closer to the leaves. Branches between the nodes will be of two types, one indicating support and one indicating an attack. Participants will be able to add new nodes supporting existing nodes or add nodes attacking existing nodes or attacking inferences between nodes.

The project will involve:

Prerequisites

No mandatory requirements but an interest in web design and programming, human computer interaction, machine learning, and natural language processing useful.

Initial Reading and Useful Links


[next][prev][Back to top]

RJG-UG-3:   Using Deep Learning/LLMs for Post-OCR Text Correction of Scanned 19th Century British Newspaper Text

Background

This project will take place within the context of a collaboration between the Departments of Computer Science and History at the University of Sheffield. The Department of History has in previous projects created a database, called the Digital Panopticon, of "life archives" of convicts in late 18-th to early 20th century in Britain and Australia. They now wish to supplement the information in the Digital Panopticon by automatically extracting information about crimes and police court trials from English newspapers of the period and linking it to the relevant records in the Digital Panopticon.

This information extraction/record linkage project poses numerous interesting technical challenges. However, one initial problem threatens to undermine the whole enterprise: the quality of the text derived via optical character recognition (OCR) from images of historic newspapers is often so poor that applying higher level analysis, such as determining whether a case reported in a newspaper story is about the same person as one logged in the Digital Panopticon, is virtually impossible.

It turns out that this problem of poor quality OCR'ed historic documents is very widespread, i.e. not just limited to 19th British newspapers, but affecting many areas of research involving historic documents. The initial rush to digitise historic documents to support humanities research took place when OCR systems were much less accurate than they are today (though they are still far from perfect when working with the challenges that historic documents present, such as fading, damaged paper, out-dated fonts and language). While re-doing the whole OCR pipeline is in theory be possible, in practice it may not be economically viable or access to the original manuscripts or even to images of the manuscripts may not be possible. In such cases it is interesting to investigate how well modern large language models (LLMs) can correct existing noisy OCR'ed texts.

Because of the wide interest in the humanities in solving this problem we have been successful in bidding for support from the Centre for Machine Intelligence at Sheffield with the result that one of the Centre's AI Research Engineering (AIRE) team is also working on this project. Thus, in addition to input from the project supervisor any student working on this project will benefit from input from one of AIRE team.

Project Description

The aim of this project is to explore whether and how the quality of optical character recognition over 19th century British newspapers can be improved using deep learning/LLM techniques. As inputs we have a large volume of "noisy" 19th century British newspaper texts which have been OCR'ed from images of the original paper documents. The goal is to see whether the quality of the output OCR text can be substantially improved, as at present it is a significant issue for the overall project (described above), which relies on accuracy of the OCR process.

The project will proceed by:

Prerequisites

Interest in neural networks and language processing. No mandatory module prerequisities, but any or all of Machine Learning and Text Processing and NLP are useful.

Initial Reading and Useful Links


[next][prev][Back to top]

RJG-UG-4:   Developing a Neural Dialogue State Tracking System for Task-Based Dialogues

Background

Task-based dialogues are dialogues in which a human user interacts with an automated system using natural language in order to carry out a task such as booking a ticket (e.g. train, airline, theatre), finding a restaurant, cooking a dish, getting customer support and so on. Such systems stand in contrast to chatbots, which are designed to carry on an intelligent conversation, but not necessarily to help a user complete a fixed task. With the rapid improvement in speech recognition technology, task-based spoken language dialogue systems are becoming increasingly common in very many application areas.

Dialogue state tracking (DST) is the task of accurately identifying the sequence of information constraints or requests a user's utterances convey within a task-based dialogue. At any point in a dialogue, the system relies on its representation of the dialogue state to determine what it should do next, e.g. what precisely what information to request from the user or what information to supply to the user. For example (from [1]):

Here the lines starting "dialogue state" show what the dialogue state tracker should produce given the dialogue so far.

Dialogue state tracking has become the focus of a series of shared task challenges, which have run from 2013 to 2020. These challenges precisely specify a DST task, provided training and test data, and defined evaluation metrics to allow participating systems' performance on the task to be measured and compared. Other datasets designed to facilitate DST research have also been created, including the MultiWoz dataset which contains over 10,000 task-based dialogues from seven domains.

Many approaches to DST have been investigated ranging from rule-based to neural approaches. Following broader trends in NLP, neural approches are currently atttracting the most attention and appear the most promising.

Project Description

The aim of this project is to design, build and evaluate a neural dialogue state tracking system most likely using the MultiWoz [2] dataset for training and testing.

The project will proceed by:

Prerequisites

Interest in neural networks and language processing. No mandatory module prerequisities, but modules on Machine Learning and Text Processing are useful.

Initial Reading and Useful Links

  1. Balaraman, Vevake, Seyedmostafa Sheikhalishahi and Bernardo Magnini. Recent Neural Methods on Dialogue State Tracking for Task-Oriented Dialogue Systems: A Survey. In Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue , 239--251, 2021. [pdf]
  2. Budzianowski, Pawel, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Stefan Ultes, Osman Ramadan and Milica Gasic}. MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , 5016--5026, 2018. [pdf]
  3. Williams, Jason, Antoine Raux and Matthew Henderson. The Dialog State Tracking Challenge Series: A Review. Dialogue & Discourse , 2016. [pdf]
  4. Jurafsky, Daniel and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. The 2nd ed. is available as an e-book from the University Library. The 3rd edition in draft form (much more up to date) is available at: Speech and Language Processing (3rd ed. draft) See especially, in the 3rd edition, Chapter 15 on Chatbots and Dialogue Systems and Chapters 5-12 for background on neural methods in NLP.

[prev][Back to top]

RJG-UG-5:   Exploring the Use of Static and Contextual Word Embeddings for Dense Information Retrieval

Background

Information retrieval (IR), more accurately called document retrieval, is the task of retrieving documents relevant to a user query from a potentially very large collection of documents. It is the technology underlying search engines such as Google, Bing and DuckDuckGo.

For decades the dominant model in information retrieval has been the vector space model in which both documents and queries are represented as vectors and relevance between queries and documents is typically measured using the cosine similarity measure, which treats the cosine of the angle between query and document vectors as a measure of their similarity.

In this standard model, the vectors representing the documents and queries are high dimensional term vectors, where each position i in the vector for document d corresponds to a term ti in the vocabulary of the document collection and the value assigned to position i in the vector is the term weighting assigned to ti in d. This term weighting varies according to the specific scheme chosen, but a very common term weighting scheme is the so-called tfidf scheme, where the term weighting is the product of the term frequency (tf) and inverse document frequency (idf) of ti in d. More precisely tf is a measure of how often the term ti occurs in d and idf is the reciprocal of the document frequency of ti, which is the count of how many documents in the collection ti occurs in (so idf is a measure of the dispersion of ti across the collection, capturing the intuition that terms that occur everywhere should be weighted less than those that occur only in a narrow subset of documents).

This approach suffers from one major defect. This is that it captures query-document similarity solely in terms of term overlap between the document and query and not in terms of meaning similarity. Documents that express the same meaning but use different words to express it will not be deemed similar (so, e.g., there is no similarity between the terms "take-over" and "acquisition", as these are separate words). Furthermore documents that use the same words but where the words have different senses in the different documents will be deemed similar even though they are not (so documents about cranes the wading birds may be confused with those about cranes in the construction machinery).

Starting with the introduction of word embeddings as a means for representing word meaning [1,4] and following on with the introduction of contextual embeddings which model the meaning of words in context [1,3], novel representations of words as dense vectors derived from neural nets have become available. These representations underlie the recent advances in NLP such as ChatGPT. Their usage in Generative AI has been impressive and highly publicised. However, while such representations hold out the prommise of addressing the major defect of the conventional vector space model for IR identified above, their application to this problem remains the subject of significant research as it is not obvious how they may best be utilised for the IR task.

This project will explore how static and contextual word embeddings, such as those derived from transformers, may be used for information retrieval and determine whether they can significantly improve retrieval performance as compared to the conventional vector space model.

Project Description

The aim of this project is to design, build and evaluate several IR systems that use static and contextual word embeddings and to compare them with each other and with a conventional vector space model system using a test collection of queries and documents for which gold standard relevance judgements are available, such as the MS Marco dataset. It will likely use the Pyserini toolkit [3], which is an extensible platform for research on both conventional and dense retrieval methods containing existing implementations of both, as a starting point.

The project will proceed by:

Prerequisites

Interest in neural networks and language processing. No mandatory module prerequisities, but any or all of Machine Learning and Text Processing and NLP are useful.

Initial Reading and Useful Links

  1. Jurafsky, Daniel and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. The 2nd ed. is available as an e-book from the University Library. The 3rd edition in draft form (much more up to date) is available at: Speech and Language Processing (3rd ed. draft) See especially, in the 3rd edition, Chapter 14 on Question Answering and Information Retrieval and Chapters 5-12 for background on neural methods in NLP.
  2. Le, Quoc and Tomas Mikolov. Distributed Representations of Sentences and Documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 , II-1188-II-1196, 2014. [pdf]
  3. Lin, Jimmy, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep and Rodrigo Nogueira. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021) , 2356--2362, 2021. [pdf]
  4. Mikolov, Tomas, Ilya Sutskever, Kai Chen, Gregory S. Corrado and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States. , 3111--3119, 2013. [pdf]
  5. Onal, Kezban Dilek, Ye Zhang, Ismail Sengor Altingovde, Md Mustafizur Rahman, Pinar Karagoz, Alex Braylan, Brandon Dang, Heng-Lu Chang, Henna Kim, Quinten McNamara, Aaron Angert, Edward Banner, Vivek Khetan, Tyler McDonnell, An Thanh Nguyen, Dan Xu, Byron C. Wallace, Maarten de Rijke and Matthew Lease. Neural information retrieval: at the end of the early years. Information Retrieval Journal 21(2), 111--182, 2018. [pdf]
  6. Yang, Wei, Haotian Zhang and Jimmy Lin. Simple Applications of BERT for Ad Hoc Document Retrieval. arXiv:1903.10972, 2019. [pdf]
  7. Zhao, Wayne Xin, Jing Liu, Ruiyang Ren and Ji-Rong Wen. Dense Text Retrieval based on Pretrained Language Models: A Survey. arXiv:2211.14876, 2022. [pdf]

[prev][Back to top]

RJG-UG-6:   Using LLMs to Generate Patient-Friendly Medical Imaging Reports

Background

This project will take place within the context of a collaboration between the School of Computer Science and the Clinical Medicine Division of the School of Medicine and Population Health at the University of Sheffield.

Medical imaging experts, e.g. radiologists, take scans of patients, such as x-rays or MRI scans. Once the scan has been carried out, the radiologist examines the scan and writes a report detailing what they have seen in the scan and offering an interpretation of these details. This report is then sent to the patient's GP who explains and further interprets the report for the patient, in language the patient can understand. The radiologist's report is written assuming it is to be read by a GP or another clinical expert, and hence uses lots of medical language and assumes a level medical knowledge one would expect of a clinical professional.

Recent legislation pertaining to a patient's right to access their medical record is now giving patients direct access to radiological reports of any scans they have had. However, while it may be a good thing that patients can access their records, for the most part they cannot fully understand them or place them in context. Many turn to the internet to attempt to understand their reports and this can lead to significant anxiety, as they may not find the correct information on the internet or be able to understand it or relate it to what is going on in their specific case.

Thus, there is a strong case for generating patient-friendly summaries of medical imaging reports that will explain the reports without medical jargon and with appropriate contextual information. While ideally these would be written by the radiologist in addition to their report for the GP, this is simply not possibly due to the shortage of radiologists and the extreme workload pressure they are already under.

Large language models (LLMs) have shown impressive abilities to summarise complex documents and to generate text in various styles. An obvious idea therefore suggests itself: can we use LLMs to automatically generate accurate, simplified, patient-friendly summaries of medical imaging reports? This project will be part of an effort to answer this question.

Project Description

The project will proceed by:

Prerequisites

Interest in neural networks and language processing. No mandatory module pre- or corequisities, though the Reinforcement Learning and Text Processing modules could be useful.

Initial Reading and Useful Links

  1. Jurafsky, Daniel and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. The 2nd ed. is available as an e-book from the University Library. The 3rd edition in draft form (much more up to date) is available at: Speech and Language Processing (3rd ed. draft) See especially, in the 3rd edition, Chapter 14 on Question Answering and Information Retrieval and Chapters 5-12 for background on neural methods in NLP.
  2. Tunstall, Lewis, Leandro von Werra and Thomas Wolf. Natural Language Processing with Transformers, Revised Edition. O'Reilly, 2022.
  3. Hou, SL., Huang, XK., Fei, CQ. et al. A Survey of Text Summarization Approaches Based on Deep Learning. J. Comput. Sci. Technol. 36, 633–663 (2021). [pdf]
  4. Gao, Yunfan, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang and Haofen Wang. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997v5, 2024. [pdf]
  5. Zhao, Penghao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, Bin Cui. Retrieval-Augmented Generation for AI-Generated Content: A Survey. arXiv:2402.19473, 2024. [pdf]

Last modified April 23 2024 by Rob Gaizauskas