Retrieving space-time relations in video
-
This study addresses a sturdy region tracking method, instead of the
conventional space-time interest point feature based techniques,
demonstrating that region descriptors can be attained for the action
classification task.
A cutting-edge human detection method is applied to generate a model
incorporating generic object foreground segments.
Non-human objects are also included, interacting with a human in a video
scene to capture the action semantically.
-
AlHarbi and Gotoh (2015).
A unified spatio-temporal human body region tracking approach to action recognition.
Neurocomputing. -
AlHarbi and Gotoh (2015).
Describing spatio-temporal relations between object volumes in video streams.
AAAI - TrBA workshop, Austin. -
AlHarbi and Gotoh (2013).
Spatio-temporal human body segmentation from video stream.
CAIP, York.
Video sequence analysis
-
This study involves two fundamental problems - video representation and
video similarity measurement.
Robust and accurate representation of video streams, invariant to scale,
location, orientation changes, is explored in the spatial and the
temporal domains.
It is then used for finding similarity among multiply video streams.
The planned contributions are the following:
- Development of a 3D-SIFT (2D space and time) that are able to extract a highly distinctive features, robust against spatial and temporal changes in video;
- Investigation of most suitable features representation for a video stream in manifold using a dimensionality reduction technique;
- Development of alignment techniques for multiple video streams, serving as a baseline for further applications in video sequence analysis such as event detection, video similarity, content copy detection and repetitive content detection;
- A study for instance search techniques when a query is in the form of a short video clip.
-
AlGhamdi and Gotoh (2020).
Graph-based topic models for trajectory clustering in crowd videos.
Machine Vision and Applications, vol.31. -
AlGhamdi and Gotoh (2018).
Graph-based correlated topic model for trajectory clustering in crowded videos.
WACV, Lake Tahoe. -
AlGhamdi and Gotoh (2014).
Manifold matching with application to instance search based on video queries.
ICISP, Cherbourg.
Natural language description of video streams
-
Digital images and videos collection has increased exponentially in the
recent years as more and more data is available in the form of personal
photo albums, handheld camera videos, feature films and multi-lingual
broadcast news videos, presenting visual data ranging from unstructured
to highly structured.
There is a need for qualitative filtering to find relevant information
according to user requirements.
Additionally time constraints enforce one to be selective when accessing
information needed.
Such a distillation process requires comprehensive information
processing including categorisation and summarisation of multimedia
resources.
One approach to addressing this issue is to convert them into a more
accessible form such as human language.
Most of previous studies were related to semantic indexing of video
using keywords.
However it is often difficult with keywords alone to represent relations
between various entities and events in video.
An interesting extension to a keyword based scheme is natural language
textual description in a syntactically and semantically correct
formulation.
They can clarify context between keywords by capturing their relations.
- Hand annotations of video clips consisting of key phrases, a short title, and a full description, and their comprehensive analysis to investigate human's interests;
- Development of a template based approach to creating natural language descriptions from a set of identified HLFs using a CFG;
- Scalability study, involving approaches to producing coherent descriptions for (1) potentially erroneous and missing HLFs and (2) video streams in different genres;
- Development of a scheme to generate a coherent and compact descriptions for video streams;
- Identification and expression of spatial and temporal relations between humans and objects present in videos;
- Applications of generated natural language descriptions, such as video classification and summarisation.
-
AlHarbi and Gotoh (2017).
Natural language descriptions for human activities in video streams.
INLG, Santiago de Compostela. -
Khan and Gotoh (2017).
Generating natural language tags for video information management.
Machine Vision and Applications, vol.28. -
Khan, AlHarbi and Gotoh (2015).
A framework for creating natural language descriptions of video streams.
Information Sciences, vol.303.
This work addresses generation of natural language descriptions for human actions, behaviour and their relations with other objects observed in video streams. The work starts with the implementation of conventional image processing techniques to extract high level features (HLFs) from individual video frames. They may be `keywords', such as a particular object and its position/moves, used for the semantic indexing task in video retrieval. These features are converted into natural language descriptions using a context free grammar (CFG). Although feature extraction processes are erroneous at various levels, approaches are explored to put them together for producing coherent descriptions. The expected contributions are:
Statistical summarisation of spoken language
-
Automatic text summarisation is a difficult scientific problem.
It has been investigated for many years; there now exist systems that
can produce fluent summaries in specific domains.
However, their quality is still lower than human authored summaries.
Speech summarisation is an even more challenging task since it involves
additional issues such as the handling of automatic speech recognition
(ASR) errors and the need to segment the audio (or ASR transcription) by
speaker, topic, or sentence.
In comparison to the volume of text summarisation research, speech
summarisation is still a new, developing area with a relatively small
number of works.
The available linguistic resources are also sparse.
- Development of maximum entropy (ME) modelling approaches to feature selection and broadcast speech segmentation at various levels (eg, story, sentence), and study in cascading multiple systems;
- Investigation of speaker independent prosodic features for information extraction from audio;
- A portability study in summarisation techniques from printed to broadcast news and the development of a speaker-role based news classification scheme, leading to the development of a multi-stage compaction approach to broadcast news summarisation;
- Data collection and annotation made on over 40 hours of news broadcasts, including named entities (NEs), sentence and story boundaries, extractive and abstractive summaries;
- A study in human subjectivity, leading to the development of a cross comprehension test for the relative evaluation of machine generated summaries.
-
Kolluru and Gotoh (2009).
On the subjectivity of human authored summaries.
Natural Language Engineering, vol.15, no.2. -
Christensen, Gotoh and Renals (2008).
A cascaded broadcast news highlighter.
IEEE Transactions on Audio, Speech and Language Processing, vol.16, issue 1. -
Kolluru and Gotoh (2007).
Speaker role based structural classification of broadcast news stories.
Interspeech, Antwerp.
This study is concerned with development of summarisation technologies from news broadcast. It builds on our earlier projects relating to the structured transcription of news broadcasts and spoken document retrieval. The study focuses on research issues specific to handling speech, rather than written texts. The principal contributions are listed below:
Information access in speech
-
Simple statistical models underlie many successful applications of
speech and language processing.
The most accurate document retrieval systems are based on unigram
statistics.
The acoustic model of virtually all speech recognition systems is based
on stochastic finite state machines referred to as hidden Markov models
(HMMs).
The language (word sequence) model of state-of-the-art large vocabulary
speech recognition systems uses an n-gram model ([n-1]th order Markov
model).
Two important features of these simple models are their trainability and
scalability: in the case of language modelling, model parameters are
estimated from corpora.
These approaches have been extensively investigated and optimised for
speech recognition, in particular, resulting in systems that can perform
certain tasks with a high degree of accuracy.
More recently, similar statistical finite state models have been
developed for spoken language processing applications beyond direct
transcription to enable, for example, the production of structured
transcriptions which may include punctuation or content annotation.
-
Gotoh and Renals (2000).
Information extraction from broadcast news.
Philosophical Transactions of the Royal Society of London, series A, vol.358, issue 1769. -
Gotoh and Renals (2000).
Variable word rate n-grams.
ICASSP, Istanbul. -
Gotoh and Renals (1999).
Topic-based mixture language modelling.
Natural Language Engineering, vol.5, no.4.
Speech processing
-
Typically, parameter estimation for a hidden Markov model (HMM) is
performed using an expectation-maximization (EM) algorithm with the
maximum-likelihood (ML) criterion.
The EM algorithm is an iterative scheme which is well-defined and
numerically stable, but convergence may require a large number of
iterations.
For speech recognition systems utilising large amounts of training
material, this results in long training times.
This work presents an incremental estimation approach to speed-up the
training of HMMs without any loss of recognition performance.
The algorithm selects a subset of data from the training set, updates
the model parameters based on the subset, and then iterates the process
until convergence of the parameters.
The advantage of this approach is a substantial increase in the number
of iterations of the EM algorithm per training token which leads to
faster training.
In order to achieve reliable estimation from a small fraction of the
complete data set at each iteration, two training criteria are studied;
ML and maximum a posteriori (MAP) estimation.
Experimental results show that the training of the incremental
algorithms is substantially faster than the conventional (batch) method
and suffers no loss of recognition performance.
Furthermore, the incremental MAP based training algorithm improves
performance over the batch version.
-
Gotoh, Hochberg and Silverman (1998).
Efficient training algorithms for HMM's using incremental estimation.
IEEE Transactions on Speech and Audio Processing, vol.6, issue 6. -
Gotoh and Silverman (1996).
IncrementalML estimation of HMM parameters for efficient training.
ICASSP, Atlanta. -
Adcock, Gotoh, Mashao and Silverman (1996).
Microphone-array speech recognition via incremental MAP training.
ICASSP, Atlanta.