Yoshi Gotoh - research

research

Retrieving space-time relations in video

AlHarbi and Gotoh (2015).
A unified spatio-temporal human body region tracking approach to action recognition.
Neurocomputing.

AlHarbi and Gotoh (2015).
Describing spatio-temporal relations between object volumes in video streams.
AAAI - TrBA workshop, Austin.

AlHarbi and Gotoh (2013).
Spatio-temporal human body segmentation from video stream.
CAIP, York.

Video sequence analysis

Development of a 3D-SIFT (2D space and time) that are able to extract a highly distinctive features, robust against spatial and temporal changes in video;
Investigation of most suitable features representation for a video stream in manifold using a dimensionality reduction technique;
Development of alignment techniques for multiple video streams, serving as a baseline for further applications in video sequence analysis such as event detection, video similarity, content copy detection and repetitive content detection;
A study for instance search techniques when a query is in the form of a short video clip.

AlGhamdi and Gotoh (2020).
Graph-based topic models for trajectory clustering in crowd videos.
Machine Vision and Applications, vol.31.

AlGhamdi and Gotoh (2018).
Graph-based correlated topic model for trajectory clustering in crowded videos.
WACV, Lake Tahoe.

AlGhamdi and Gotoh (2014).
Manifold matching with application to instance search based on video queries.
ICISP, Cherbourg.

Natural language description of video streams

This work addresses generation of natural language descriptions for human actions, behaviour and their relations with other objects observed in video streams. The work starts with the implementation of conventional image processing techniques to extract high level features (HLFs) from individual video frames. They may be `keywords', such as a particular object and its position/moves, used for the semantic indexing task in video retrieval. These features are converted into natural language descriptions using a context free grammar (CFG). Although feature extraction processes are erroneous at various levels, approaches are explored to put them together for producing coherent descriptions. The expected contributions are:

Hand annotations of video clips consisting of key phrases, a short title, and a full description, and their comprehensive analysis to investigate human's interests;
Development of a template based approach to creating natural language descriptions from a set of identified HLFs using a CFG;
Scalability study, involving approaches to producing coherent descriptions for (1) potentially erroneous and missing HLFs and (2) video streams in different genres;
Development of a scheme to generate a coherent and compact descriptions for video streams;
Identification and expression of spatial and temporal relations between humans and objects present in videos;
Applications of generated natural language descriptions, such as video classification and summarisation.

AlHarbi and Gotoh (2017).
Natural language descriptions for human activities in video streams.
INLG, Santiago de Compostela.

Khan and Gotoh (2017).
Generating natural language tags for video information management.
Machine Vision and Applications, vol.28.

Khan, AlHarbi and Gotoh (2015).
A framework for creating natural language descriptions of video streams.
Information Sciences, vol.303.

Statistical summarisation of spoken language

This study is concerned with development of summarisation technologies from news broadcast. It builds on our earlier projects relating to the structured transcription of news broadcasts and spoken document retrieval. The study focuses on research issues specific to handling speech, rather than written texts. The principal contributions are listed below:

Development of maximum entropy (ME) modelling approaches to feature selection and broadcast speech segmentation at various levels (eg, story, sentence), and study in cascading multiple systems;
Investigation of speaker independent prosodic features for information extraction from audio;
A portability study in summarisation techniques from printed to broadcast news and the development of a speaker-role based news classification scheme, leading to the development of a multi-stage compaction approach to broadcast news summarisation;
Data collection and annotation made on over 40 hours of news broadcasts, including named entities (NEs), sentence and story boundaries, extractive and abstractive summaries;
A study in human subjectivity, leading to the development of a cross comprehension test for the relative evaluation of machine generated summaries.

Kolluru and Gotoh (2009).
On the subjectivity of human authored summaries.
Natural Language Engineering, vol.15, no.2.

Christensen, Gotoh and Renals (2008).
A cascaded broadcast news highlighter.
IEEE Transactions on Audio, Speech and Language Processing, vol.16, issue 1.

Kolluru and Gotoh (2007).
Speaker role based structural classification of broadcast news stories.
Interspeech, Antwerp.

Information access in speech

Gotoh and Renals (2000).
Information extraction from broadcast news.
Philosophical Transactions of the Royal Society of London, series A, vol.358, issue 1769.

Gotoh and Renals (2000).
Variable word rate n-grams.
ICASSP, Istanbul.

Gotoh and Renals (1999).
Topic-based mixture language modelling.
Natural Language Engineering, vol.5, no.4.

Speech processing

Gotoh, Hochberg and Silverman (1998).
Efficient training algorithms for HMM's using incremental estimation.
IEEE Transactions on Speech and Audio Processing, vol.6, issue 6.

Gotoh and Silverman (1996).
IncrementalML estimation of HMM parameters for efficient training.
ICASSP, Atlanta.

Adcock, Gotoh, Mashao and Silverman (1996).
Microphone-array speech recognition via incremental MAP training.
ICASSP, Atlanta.