UG Projects 2025-26 Rob Gaizauskas <!-- <link type="text/css" rel="stylesheet" href="file:///Users/rob/work/teach/supervision/project-descriptions/Descriptions_MSC_15

Rob Gaizauskas : UG Projects 2025-26

Project Titles:

RJG-UG-1: Aspect-based Sentiment Analysis Using LLMs
RJG-UG-2: Developing a Neural Dialogue State Tracking System for Task Based Dialogues
RJG-UG-3: Exploring the Use of Static and Contextual Word Embeddings for Dense Information Retrieval
RJG-UG-4: Developing a Voice-Controlled Robotic Arm
RJG-UG-5: Build and Train a Mini-Transformer from Scratch
RJG-UG-6: Vision Language Models for Describing Human Actions in Images
RJG-UG-7: Using LLMs to Generate Patient-Friendly Medical Imaging Reports

[next][prev][Back to top]

RJG-UG-1: Aspect-based Sentiment Analysis using LLMs

TLDR: In this project you will learn how to design, build and evaluate a sentiment analysis system that uses Large Language Models to automatically read on-line customer product and service reviews and generate actionable intelligence for the companies delivering these products and services.

Background

As commerce has moved to the Web there has been an explosion of on-line customer-provided content reviewing the products (e.g cameras, phones, televisions) and services (e.g. restaurants, hotels) that are on offer. There are far too many of these for anyone to read and hence there is tremendous potential for software that can automatically identify and summarise customers' opinions on various aspects of these products and services. The output of such automatic analysis would be of tremendous benefit both to consumers trying to decide which product or service to purchase and to product and service providers trying to improve their offerings or understand the strengths and weaknesses of their competitors.

By aspects of products and services we mean the typical features or characteristics of a product or service that matter to a customer or are likely to be commented on by them. For example, for restaurants we typically see diners commenting on the quality of the food, the friendliness or speed of service, the price, the ambience or atmosphere of the restaurant, and so on. The automatic identification of aspects of products or services in customer reviews and the determination of the customer sentiment with respect to them are tasks that natural language processing researchers have been studying for over a decade now.

As is common in the field, shared task challenges -- community-wide efforts to define a task, develop data resources and metrics for training and testing and run a friendly competition to help identify the most promising approaches -- have appeared addressing the tasks of aspect identifcation and sentiment determination. SemEval, an annual forum for the evaluation of language processing technologies across a range of tasks involving understanding some aspect of natural language, ran challenges on Aspect-Based Sentiment Analysis (ABSA) in 2014, 2015 and 2016. Subsequent to this all, a wide range of further challenges have been run looking at variants of the task and data setting, including additional languages, social media data, complex multi-aspect sentential data, and so on.

Initially, most computational approaches to the task used statistical, feature-engineering based machine learning technology. But with the advent of deep learning, and subsequently LLMs, state-of-the-approaches now tend to use LLMs and these will form the focus of the current project.

Project Description

This project will begin by reviewing:

the fundamentals of deep learning for natural language processing and how to use pretrained LLMS for NLP tasks
existing frameworks for analysing sentiment and opinion in text
existing datasets for training and evaluating ABSA systems
existing algorithms for aspect-based sentiment analysis
existing natural language processing and machine learning models, tools and toolkits, e.g. HuggingFace, Tensorflow, Pytorch, Keras, SciKit Learn, etc.

Following this review, one or more approaches to ABSA will be chosen and implemented, building on existing ML and NLP resources. The resulting algorithm(s) will be evaluated using SemEval and possibly more recent datasets and compared against existing state of the art results. Refinements will be made to the algorithms, as far as time permits. Another line of possible work is to consider how best to summarise and present the results of an ABSA system to end users.

Prerequisites

Some machine learning and Python programming skills are the only prerequisites for the project (such as those acquired in the Data Drive Computing module), along with an interest in natural language processing and machine learning.

Initial Reading and Useful Links

Liu, Bing. Sentiment Analysis and Opinion Mining. Morgan and Claypool Publishers , 2012. Available at: pdf
Pang, Bo and Lillian Lee. Opinion Mining and Sentiment Analysis. Found. Trends Inf. Retr. 2 , 1--135, 2008. Draft available at: pdf
Wikipedia article on SemEval
Aspect-Based Sentiment Analysis (ABSA) in SemEval2016
Jurafsky, Daniel and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. The 3rd edition, which you should use, is available in draft form at : Speech and Language Processing (3rd ed. draft) See Chapters 5-12 for background on neural methods in NLP; Chapters 4 and 22 are relevant for the task of sentiment analysis.

Contact supervisor for further references.

[next][prev][Back to top]

RJG-UG-2: Developing a Neural Dialogue State Tracking System for Task-Based Dialogues

TLDR: In this project you will learn how to design, build and evaluate a system for tracking the information state in a task-based dialogue, a dialogue in which a conversational agent interacts with a human user in natural language to assist the user in completing a fixed task, such as booking a restaurant or renewing their car insurance.

Background

Task-based dialogues are dialogues in which a human user interacts with an automated system using natural language in order to carry out a well-defined task such as booking a ticket (e.g. train, airline, theatre), finding a restaurant, cooking a dish, getting customer support, and so on. Such systems stand in contrast to chatbots, which are designed to carry on an intelligent conversation, but not necessarily to help a user complete a fixed task. With the rapid improvement in speech recognition technology, task-based spoken language dialogue systems are becoming increasingly common in very many application areas.

Dialogue state tracking (DST) is the task of accurately identifying the sequence of information constraints or requests a user's utterances convey within a task-based dialogue. At any point in a dialogue, the system relies on its representation of the dialogue state to determine what it should do next, e.g. what precisely what information to request from the user or what information to supply to the user. For example (from [1]):

------------------------------------------------------------------------ User: Hello, I'm looking for a restaurant with fair prices dialogue state: inform(price range = moderate) System: There are 31 places with moderate prices. Can you please tell me what kind of food you would like? ------------------------------------------------------------------------ User: Well, I want to eat in the North, what's up that way? dialogue state: inform(price range = moderate, area = north) System: I have two options that fit that description, Golden Wok chinese restaurant and The Nirala which serves Indian food. Do you have a preference? ------------------------------------------------------------------------ User: Can I have the address and phone number for the Golden Wok chinese restaurant? dialogue state: inform(price range = moderate) request(address, phone number)

Here the lines starting "dialogue state" show what the dialogue state tracker should produce given the dialogue so far.

Dialogue state tracking became the focus of a series of shared task challenges, which ran from 2013 to 2020. These challenges precisely specify a DST task, provided training and test data, and defined evaluation metrics to allow participating systems' performance on the task to be measured and compared. Other datasets designed to facilitate DST research have also been created, including the MultiWoz dataset which contains over 10,000 task-based dialogues from seven domains.

Many approaches to DST have been investigated ranging from rule-based to neural approaches. Following broader trends in NLP, neural approches are currently atttracting the most attention and appear the most promising.

Project Description

The aim of this project is to design, build and evaluate a neural dialogue state tracking system, most likely using the MultiWoz [2] dataset for training and testing.

The project will proceed by:

reviewing existing approaches and tools for dialog state tracking both conventional and deep learning approaches [1,3,5];
reviewing existing toolkits and libraries for deep learning, such as TensorFlow, PyTorch, Keras and Huggingface;
reviewing and selecting a dataset for training and testing
designing, implementing and testing one or more approaches neural dialogue state tracking using the dataset selected for the project;
evaluating the system(s) developed in the preceeding step.

Prerequisites

Interest in neural networks and language processing. No mandatory module prerequisities, but modules on Machine Learning and Text Processing are useful.

Initial Reading and Useful Links

Balaraman, Vevake, Seyedmostafa Sheikhalishahi and Bernardo Magnini. Recent Neural Methods on Dialogue State Tracking for Task-Oriented Dialogue Systems: A Survey. In Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue , 239--251, 2021. [pdf]
Budzianowski, Pawel, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Stefan Ultes, Osman Ramadan and Milica Gasic}. MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , 5016--5026, 2018. [pdf]
Williams, Jason, Antoine Raux and Matthew Henderson. The Dialog State Tracking Challenge Series: A Review. Dialogue & Discourse , 2016. [pdf]
Jurafsky, Daniel and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. The 3rd edition, which you should use, is available in draft form at: Speech and Language Processing (3rd ed. draft) See especially, in the 3rd edition, Chapter 15 on Chatbots and Dialogue Systems and Chapters 5-12 for background on neural methods in NLP.
Feng, Yujie, Zexin Lu, Bo Liu, Liming Zhan, and Xiao-Ming Wu. 2023. Towards LLM-driven Dialogue State Tracking. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 739–755, Singapore. Association for Computational Linguistics. [pdf]

Contact supervisor for further references.

[next][prev][Back to top]

RJG-UG-3: Exploring the Use of Static and Contextual Word Embeddings for Dense Information Retrieval

TLDR: In this project you will learn how to design, build and evaluate search engines using the latest neural document representations and large scale dense vector clustering and matching algorithms, the core technologies underlying RAG: Retrieval-Augmented Generation.

Background

Information retrieval (IR), more accurately called document retrieval, is the task of retrieving documents relevant to a user query from a potentially very large collection of documents. It is the key computational technology underlying search engines such as Google, Bing and DuckDuckGo.

For decades, the dominant model in information retrieval has been the vector space model in which both documents and queries are represented as vectors and relevance between queries and documents is typically measured using the cosine similarity measure, which treats the cosine of the angle between query and document vectors as a measure of their similarity.

In this standard model, the vectors representing the documents and queries are high dimensional term vectors, where each position i in the vector for document d corresponds to a term t_i in the vocabulary of the document collection and the value assigned to position i in the vector is the term weighting assigned to t_i in d. This term weighting varies according to the specific scheme chosen, but a very common term weighting scheme is the so-called tfidf scheme, where the term weighting is the product of the term frequency (tf) and inverse document frequency (idf) of t_i in d. More precisely tf is a measure of how often the term t_i occurs in d and idf is the reciprocal of the document frequency of t_i, which is the count of how many documents in the collection t_i occurs in (so, idf is a measure of the dispersion of t_i across the collection, capturing the intuition that terms that occur everywhere should be weighted less than those that occur only in a narrow subset of documents).

This approach suffers from one major defect. This is that it captures query-document similarity solely in terms of term overlap between the document and query and not in terms of meaning similarity. Documents that express the same meaning but use different words to express it will not be deemed similar. So, for example, there is no similarity between the terms "take-over" and "acquisition", as these are separate words, and thus queries using one of these words will not match documents containing the other, when using the vector space approach. Furthermore, documents that use the same words, but where the words have different senses in the different documents, will be deemed similar even though they are not (so documents about cranes, the wading birds, may be confused with those about cranes, the construction machinery).

Starting with the 2013 introduction of word embeddings as a means for representing word meaning [1,4] and following on with the introduction of contextual embeddings which model the meaning of words in context [1,3], novel representations of words as dense vectors derived from neural nets have become available. These representations underlie all the recent advances in NLP, such as LLMs. Their usage in Generative AI has been impressive and highly publicised. However, while such representations hold out the promise of addressing the major defect of the conventional vector space model for IR identified above, their application in IR has taken a while and remains the subject of significant research. Nonetheless, new approaches to, and uses of, document retrieval that exploit these word and sentence embeddings are emerging, most notably in the area now called RAG: Retrieval-Augmented Generation. RAG has become a key technology for organisations who want to use Generative AI, but need systems to that have access to information that is more up to date, or more specialised than that available in general pretrained models, such as GPT or ChatGPT.

This project will adopt a scientific approach to exploring how static and contextual word and sentence embeddings, such as those derived from transformers, may be used for information retrieval and determine whether they can significantly improve retrieval performance as compared to the conventional vector space model.

Project Description

The aim of this project is to design, build and evaluate several IR systems that use static and contextual word embeddings and to compare them with each other an with a conventional vector space model system using a test collection of queries and documents for which gold standard relevance judgements are available, such as the MS Marco dataset. It will likely use the Pyserini toolkit [3], which is an extensible platform for research on both conventional and dense retrieval methods containing existing implementations of both, as a starting point.

The project will proceed by:

reviewing existing approaches for IR, both conventional models such as the vector space model and more recent deep learning approaches which exploit static and contextual word embeddings;
reviewing existing models for obtaining static and contextual word embeddings, such as Word2vec and BERT-based models, and toolkits/libraries which may be used to get these embeddings,such as Gensim and HuggingFace.
identifying one or more relevant IR test collections, such as the MS MARCO passage corpus
designing, implementing and testing a conventional vector space IR model, an approach based on static word embeddings and an approach based on contextual embeddings, using existing tools and toolkits.
evaluating and comparing the system(s) developed in the preceeding step.

Prerequisites

Interest in neural networks and language processing. No mandatory module prerequisities, but any or all of Machine Learning and Text Processing and NLP are useful.

Initial Reading and Useful Links

Jurafsky, Daniel and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. The 3rd edition, which you should use, is available in draft form at: Speech and Language Processing (3rd ed. draft) See especially, in the 3rd edition, Chapter 14 on Question Answering and Information Retrieval and Chapters 5-12 for background on neural methods in NLP.
Le, Quoc and Tomas Mikolov. Distributed Representations of Sentences and Documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 , II-1188-II-1196, 2014. [pdf]
Lin, Jimmy, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep and Rodrigo Nogueira. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021) , 2356--2362, 2021. [pdf]
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Gregory S. Corrado and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States. , 3111--3119, 2013. [pdf]
Onal, Kezban Dilek, Ye Zhang, Ismail Sengor Altingovde, Md Mustafizur Rahman, Pinar Karagoz, Alex Braylan, Brandon Dang, Heng-Lu Chang, Henna Kim, Quinten McNamara, Aaron Angert, Edward Banner, Vivek Khetan, Tyler McDonnell, An Thanh Nguyen, Dan Xu, Byron C. Wallace, Maarten de Rijke and Matthew Lease. Neural information retrieval: at the end of the early years. Information Retrieval Journal 21(2), 111--182, 2018. [pdf]
Yang, Wei, Haotian Zhang and Jimmy Lin. Simple Applications of BERT for Ad Hoc Document Retrieval. arXiv:1903.10972, 2019. [pdf]
Zhao, Wayne Xin, Jing Liu, Ruiyang Ren and Ji-Rong Wen. Dense Text Retrieval based on Pretrained Language Models: A Survey. arXiv:2211.14876, 2022. [pdf]

Contact supervisor for further references.

[next][prev][Back to top]

RJG-UG-4: Developing a Voice-Controlled Robotic Arm

TLDR: Using with one of the Engineering Faculty's robotic arms, in this project you will learn how to integrate speech, language, vision and robotics technologies to build and evaluate a voice-driven control system to allow a human to instruct a robot arm to pick and place objects in its environment.

Background

As cobots -- collaborative robots -- become safer and cheaper and hence more common, the need arises for a control interface that allows untrained, everyday users to interact with and instruct these devices to carry out various tasks. For example, a chemist might want to instruct a robot to mix together two potentially hazardous chemicals inside a safe enclosure; a disabled person might want to instruct robot to place something within their reach; a surgeon might want to instruct a robot nurse to pass him a scalpel. The most natural way for humans to communicate is via spoken human language. Given advances in spoken language technologies, can we now build voice controlled robots to begin to address the vision outlined above of humans collaborating with robots in real world environments to carry out simple tasks?

Project Description

The aim of this project is to explore the challenges of integrating speech, language, vision and robotics technologies to build and evaluate a voice-driven control system to allow a human to instruct a robot arm to pick and place objects in its environment. The project will centre on the use of one of the robot arms available in the Engineering Faculty, which include the KUKA youbot and the UR10 CB3. These robots are compatible with ROS and have models in Gazebo, a simulator environment. The project will use an open source speech recognition model, such as Whisper. The approach to vision is flexible, ranging from relying on use of barcodes or RFID tags to identify objects in the environment to sophisticated vision-language models, that can produce richer natural language descriptions of visual scenes. The core control challenge is translating natural language instructions, as derived from the recognised speech, into an executable plan for the robot. Recent advances in using Large Language Models (LLMs) to translate natural language instructions into executable robot plans [2] suggest this is a promising way to address this problem and will be investigated in the project.

The project will proceed by:

reviewing existing literature on voice-controlled robotic systems, open source speech recognition models, vision-language models and LLMs for robot control
determining which robot arm will be used and setting up the software environment to control the arm and to simulate the environment
designing the task environment, e.g. selecting the objects and obstacles, if any, to include in the environment, specifying the task, e.g. moving certain objects from one place to another, and verifying the objects are perceivable by the robot's sensors and graspable by its acuators
selecting which models and software libararies will be used for vision, speech recognition, and robot control
integrating and adapting the models and libaries chosen
evaluating the resulting system across a range of similar tasks and, ideally, users.

Prerequisites

Interest in AI and confidence/willingness to address the low-level software engineering challenges of working with many different packages and frameworks to integrate the many components necessary here into a working system.

Initial Reading and Useful Links

Jiaqi Wang, Enze Shi, Huawen Hu, Chong Ma, Yiheng Liu, Xuhui Wang, Yincheng Yao, Xuan Liu, Bao Ge, Shu Zhang, Large language models for robotics: Opportunities, challenges, and perspectives, Journal of Automation and Intelligence, Volume 4, Issue 1, 2025, Pages 52-64, ISSN 2949-8554. Link
Jurafsky, Daniel and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. The 3rd edition, which you should use, is available in draft form at: Speech and Language Processing (3rd ed. draft) See especially, in the 3rd edition, Chapter 14 on Question Answering and Information Retrieval and Chapters 5-12 for background on neural methods in NLP.

Contact supervisor for further references.

[next][prev][Back to top]

RJG-UG-5: Build and Train a Mini-Transformer from Scratch

TLDR: In this project will you learn how to build, train and evaluate a small-scale transformer model from scratch.

Background

As Jurafsky and Martin say in their landmark textbook [1], "Transformer-based large language models have completely changed the field of speech and language processing." This is no understatement: virtually every major development in speech and language processing, not to mention in many other areas of AI such as vision language models and robot planning, has been informed by the transformer architecture, since its introduction by Vaswani et al. in 2017 [2].

Nowadays, access to transformer-based models has been made easy through libraries like Hugging Face. These allow developers to build applications around transformer-based models without having to learn how they work, or, in many cases, having to train them at all or only minimally fine-tune them. However, as computer scientists, many of us naturally want to understand these things, and not simply take them on faith. Understanding them allows us to better appreciate their strengnths and weakness, lets us deploy them more competently and creatively, and puts us in a position where we can think creatively about what the next big step beyond transformers might be. As our University's motto says: Rerum Cognoscere Causas. And there is no better way to understand an algorithm or set of complementary algorithms than to reimplement them oneself.

While most pre-trained LLMs are huge (billions of parameters) and trained on colossal datasets, there is no reason why a small-scale transfomer should not be built, which will serve to educate its developer in all the underlying principals and techniques. Furthermore, given the notoriously high energy/computing costs of training and performing inference on very large open source and commercial LLMs, there is a growing field of study focussing on smaller scale and energy efficient LLMs, to promote greener, sustainable AI (see, e.g. [3]).

Project Description

The aim of this project is to build, train and evaluate a small-scale transformer model from scratch. Starting from neural net building blocks, such as those found in the PyTorch library, the goal will be build and train a transformer-based language model, ideally supporting both a causal language modelling (like the GPT family) and bidirectional language modelling (like the BERT family of encoders). These models should be small enough that they can be trained and perform inference on consumer-grade hardware, such as the laptops most students are likely to have, using either the CPU or GPUs if available.

The project will proceed by:

studying the mathematical foundation of transformers, until the key principals are clearly understood, e.g. Chapters 5-12 in [1] reviewing available deep learning libraries/frameworks that support neural net development projects, such as Tensorflow, PyTorch and Keras
reviewing existing on-line guides to "build your own transformer" to get ideas about how to started on building the transformer
based on the two preceeding steps, selecting the development framework and general approach and installing all relevant software on the development machine
selecting and acquiring one or more training datasets developing and testing the transformer, ideally both a causal and bidirectional variant, to ensure it is working properly -- both the training and inference components
identifying one or more benchmark tasks, test sets and existing transformer-based models (e.g. from Hugging Face) to compare with
evaluating the developed transformer(s) using the chosen task(s) and test(s) and comparing the results with existing models

Prerequisites

Keenness to understand transformers from the ground up and the mathematical skills to do so. Knowledge of Python programming and machine learning basics helpful.

Initial Reading and Useful Links

Jurafsky, Daniel and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. The 3rd edition, which you should use, is available in draft form at: Speech and Language Processing (3rd ed. draft) See Chapters 5-12 for background on neural methods in NLP.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems , 6000-6010, 2017.
Rehman, Tohida et al. How Green are Neural Language Models? Analyzing Energy Consumption in Text Summarization Fine-tuning. https://arxiv.org/abs/2501.15398v2, 2025.
Sayed, Ebad. Building a Transformer from Scratch: A Step-by-Step Guide. 2024.
Anjilakshetri. Build Your Own Transformer : A Complete Step-by-Step Implementation Guide. 2025.

Contact supervisor for further references.

[next][prev][Back to top]

RJG-UG-6: Vision Language Models for Describing Human Actions in Images

TLDR: In this project you will learn about the latest generation of vision language models (VLMs), through evaluating and fine-tuning them for the task of describing of human actions in images.

Background

While object detection algorithms have become quite good at identifying and localising instances of object types in images, the task of classifying human action types in images is much harder, and computer vision has traditionally struggled with this challenge. This capability is an important aspect of artificial intelligence in general and has specific application in, e.g. image retrieval and classification.

In the past few years vision language models (VLMs) have risen to prominence for their remarkable abilities to generate descriptions of images, answer questions about images and in some case to generate images from textual descriptions [1]. A natural question to ask is: how good are these new VLMs at classifying actions in images and in identifying role players participating in these images?

Over the years, the computer vision community has created various datasets that focus on different aspects of this challenge, including Imsitu, HICO, the HL dataset and the Visual Genome. These datasets support both the training and evaluation of VLMs, though one issue here is the extent which these datasets can reliably be used for evaluation of open source or commercial VLMs because they have been scraped and using in training by the creators of pretrained VLMs.

Project Description

The aims of this project are:

to investigate how well current VLMs can recognise human actions in images as well as identify the key role players in these images (e.g. recognise an action not just as riding, but as, e.g. man riding bicycle through forest) by evaluating them against various benchmark datasets and addressing various evaluation tasks (e.g. question answering vs classification)
to explore whether fine-tuning these models can improve their performance
as a stretch goal, to investigate how well current VLMs can generate images of human actions with appropriate role players

The project will proceed by:

reviewing existing literature on VLMs (how they work and how they can be fine-tuned) and on existing evaluation benchmarks for assessing
selecting which VLMs and datasets will be investigated in the project, downloading these or arranging for them to be used locally (this may involve registering for the University's HPC facilities); one outcome could be the decision to create a new evaluation dataset
conducting baseline experiments with the models and evaluation datasets "out of the box"
designing and implementing an approach to fine-tuned selected models on selected datasets
evaluating the resulting fine-tuned models and comparing them to baseline results
(stretch) exploring VLMs capabilities to generate images and evaluate the result [3]

Prerequisites

Initial Reading and Useful Links

Florian Bordes, et al. (2024). An Introduction to Vision-Language Modeling. https://arxiv.org/abs/2405.17247.
Mark Yatskar, Luke Zettlemoyer and Ali Farhadi (2016). Situation Recognition: Visual Semantic Role Labeling for Image Understanding. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.5534-5542.
Shin, J., Hassan, N., Miah, A. S. M., & Nishimura, S. (2025). A Comprehensive Methodological Survey of Human Activity Recognition Across Diverse Data Modalities. Sensors, 25(13), 4028. https://doi.org/10.3390/s25134028.
Pratyusha Sharma, et al. (2025). A Vision Check-up for Language Models. https://arxiv.org/abs/2401.01862.

Contact supervisor for further references.

[prev][Back to top]

RJG-UG-7: Using LLMs to Generate Patient-Friendly Medical Imaging Reports

TLDR: In this project you will learn how to design, build and evaluate a text simplification and summarisation system that uses Large Language Models to read technical medical imaging reports written by clinical radiologists for a patient's GP and re-expresses the key findings in them in simplified language understandable by the patient themselves.

Background

This project will take place within the context of a collaboration between the School of Computer Science and the Clinical Medicine Division of the School of Medicine and Population Health at the University of Sheffield.

Medical imaging experts, e.g. radiologists, take scans of patients, such as x-rays or MRI scans. Once the scan has been carried out, the radiologist examines the scan and writes a report detailing what they have seen in the scan and offering an interpretation of these details. This report is then sent to the patient's GP who explains and further interprets the report for the patient, in language the patient can understand. The radiologist's report is written assuming it is to be read by a GP or another clinical expert, and hence uses lots of medical language and assumes a level medical knowledge one would expect of a clinical professional.

Recent legislation pertaining to a patient's right to access their medical record is now giving patients direct access to radiological reports of any scans they have had. However, while it may be a good thing that patients can access their records, for the most part they cannot fully understand them or place them in context. Many turn to the internet to attempt to understand their reports and this can lead to significant anxiety, as they may not find the correct information on the internet or be able to understand it or relate it to what is going on in their specific case.

Thus, there is a strong case for generating patient-friendly summaries of medical imaging reports that will explain the reports without medical jargon and with appropriate contextual information. While ideally these would be written by the radiologist in addition to their report for the GP, this is simply not possibly due to the shortage of radiologists and the extreme workload pressure they are already under.

Large language models (LLMs) have shown impressive abilities to summarise complex documents and to generate text in various styles. An obvious idea therefore suggests itself: can we use LLMs to automatically generate accurate, simplified, patient-friendly summaries of medical imaging reports? This project will be part of an effort to answer this question.

Project Description

The project will proceed by:

gaining essential background understanding of how LLMs work and of the toolkits and libraries available for using them [1,2]
reviewing existing approaches to document summarisation using LLMs, e.g. [2, Chp. 8,3]
gathering a collection of medical imaging reports and associated patient-friendly summaries (colleagues in Clinical Medicine have agreed to help with this)
designing, implementing and testing one or more transformer-based summarisation systems to generate patient friendly summaries of medical imaging reports
evaluating and comparing the system(s) developed in the preceeding step.
optionally, given time, explore how the systems developed might be improved by using Retrieval Augmented Generation techniques [4,5], where a static external knowledge source relevant to the domain (e.g. Wikipedia, NHS web pages or other reliable digital medical knowledge sources designed to clearly explain medical topics) could be leveraged to improve the accuracy and comprehensibility of the generated summaries.

Prerequisites

Interest in neural networks and language processing. No mandatory module pre- or corequisities, though the Reinforcement Learning and Text Processing modules could be useful.

Initial Reading and Useful Links

Jurafsky, Daniel and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. The 2nd ed. is available as an e-book from the University Library. The 3rd edition in draft form (much more up to date) is available at: Speech and Language Processing (3rd ed. draft) See especially, in the 3rd edition, Chapter 14 on Question Answering and Information Retrieval and Chapters 5-12 for background on neural methods in NLP.
Tunstall, Lewis, Leandro von Werra and Thomas Wolf. Natural Language Processing with Transformers, Revised Edition. O'Reilly, 2022.
Hou, SL., Huang, XK., Fei, CQ. et al. A Survey of Text Summarization Approaches Based on Deep Learning. J. Comput. Sci. Technol. 36, 633–663 (2021). [pdf]
Gao, Yunfan, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang and Haofen Wang. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997v5, 2024. [pdf]
Zhao, Penghao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, Bin Cui. Retrieval-Augmented Generation for AI-Generated Content: A Survey. arXiv:2402.19473, 2024. [pdf]

Last modified September 3 2025 by Rob Gaizauskas

RJG-UG-1:	Aspect-based Sentiment Analysis Using LLMs
RJG-UG-2:	Developing a Neural Dialogue State Tracking System for Task Based Dialogues
RJG-UG-3:	Exploring the Use of Static and Contextual Word Embeddings for Dense Information Retrieval
RJG-UG-4:	Developing a Voice-Controlled Robotic Arm
RJG-UG-5:	Build and Train a Mini-Transformer from Scratch
RJG-UG-6:	Vision Language Models for Describing Human Actions in Images
RJG-UG-7:	Using LLMs to Generate Patient-Friendly Medical Imaging Reports