Yoshi Gotoh


research

Retrieving space-time relations in video

Video sequence analysis

    This study involves two fundamental problems - video representation and video similarity measurement. Robust and accurate representation of video streams, invariant to scale, location, orientation changes, is explored in the spatial and the temporal domains. It is then used for finding similarity among multiply video streams. The planned contributions are the following:
    • Development of a 3D-SIFT (2D space and time) that are able to extract a highly distinctive features, robust against spatial and temporal changes in video;
    • Investigation of most suitable features representation for a video stream in manifold using a dimensionality reduction technique;
    • Development of alignment techniques for multiple video streams, serving as a baseline for further applications in video sequence analysis such as event detection, video similarity, content copy detection and repetitive content detection;
    • A study for instance search techniques when a query is in the form of a short video clip.

  • AlGhamdi and Gotoh (2020).
    Graph-based topic models for trajectory clustering in crowd videos.
    Machine Vision and Applications, vol.31.
  • AlGhamdi and Gotoh (2018).
    Graph-based correlated topic model for trajectory clustering in crowded videos.
    WACV, Lake Tahoe.
  • AlGhamdi and Gotoh (2014).
    Manifold matching with application to instance search based on video queries.
    ICISP, Cherbourg.

Natural language description of video streams

    Digital images and videos collection has increased exponentially in the recent years as more and more data is available in the form of personal photo albums, handheld camera videos, feature films and multi-lingual broadcast news videos, presenting visual data ranging from unstructured to highly structured. There is a need for qualitative filtering to find relevant information according to user requirements. Additionally time constraints enforce one to be selective when accessing information needed. Such a distillation process requires comprehensive information processing including categorisation and summarisation of multimedia resources. One approach to addressing this issue is to convert them into a more accessible form such as human language. Most of previous studies were related to semantic indexing of video using keywords. However it is often difficult with keywords alone to represent relations between various entities and events in video. An interesting extension to a keyword based scheme is natural language textual description in a syntactically and semantically correct formulation. They can clarify context between keywords by capturing their relations.

    This work addresses generation of natural language descriptions for human actions, behaviour and their relations with other objects observed in video streams. The work starts with the implementation of conventional image processing techniques to extract high level features (HLFs) from individual video frames. They may be `keywords', such as a particular object and its position/moves, used for the semantic indexing task in video retrieval. These features are converted into natural language descriptions using a context free grammar (CFG). Although feature extraction processes are erroneous at various levels, approaches are explored to put them together for producing coherent descriptions. The expected contributions are:

    • Hand annotations of video clips consisting of key phrases, a short title, and a full description, and their comprehensive analysis to investigate human's interests;
    • Development of a template based approach to creating natural language descriptions from a set of identified HLFs using a CFG;
    • Scalability study, involving approaches to producing coherent descriptions for (1) potentially erroneous and missing HLFs and (2) video streams in different genres;
    • Development of a scheme to generate a coherent and compact descriptions for video streams;
    • Identification and expression of spatial and temporal relations between humans and objects present in videos;
    • Applications of generated natural language descriptions, such as video classification and summarisation.

  • AlHarbi and Gotoh (2017).
    Natural language descriptions for human activities in video streams.
    INLG, Santiago de Compostela.
  • Khan and Gotoh (2017).
    Generating natural language tags for video information management.
    Machine Vision and Applications, vol.28.
  • Khan, AlHarbi and Gotoh (2015).
    A framework for creating natural language descriptions of video streams.
    Information Sciences, vol.303.

Statistical summarisation of spoken language

    Automatic text summarisation is a difficult scientific problem. It has been investigated for many years; there now exist systems that can produce fluent summaries in specific domains. However, their quality is still lower than human authored summaries. Speech summarisation is an even more challenging task since it involves additional issues such as the handling of automatic speech recognition (ASR) errors and the need to segment the audio (or ASR transcription) by speaker, topic, or sentence. In comparison to the volume of text summarisation research, speech summarisation is still a new, developing area with a relatively small number of works. The available linguistic resources are also sparse.

    This study is concerned with development of summarisation technologies from news broadcast. It builds on our earlier projects relating to the structured transcription of news broadcasts and spoken document retrieval. The study focuses on research issues specific to handling speech, rather than written texts. The principal contributions are listed below:

    • Development of maximum entropy (ME) modelling approaches to feature selection and broadcast speech segmentation at various levels (eg, story, sentence), and study in cascading multiple systems;
    • Investigation of speaker independent prosodic features for information extraction from audio;
    • A portability study in summarisation techniques from printed to broadcast news and the development of a speaker-role based news classification scheme, leading to the development of a multi-stage compaction approach to broadcast news summarisation;
    • Data collection and annotation made on over 40 hours of news broadcasts, including named entities (NEs), sentence and story boundaries, extractive and abstractive summaries;
    • A study in human subjectivity, leading to the development of a cross comprehension test for the relative evaluation of machine generated summaries.

  • Kolluru and Gotoh (2009).
    On the subjectivity of human authored summaries.
    Natural Language Engineering, vol.15, no.2.
  • Christensen, Gotoh and Renals (2008).
    A cascaded broadcast news highlighter.
    IEEE Transactions on Audio, Speech and Language Processing, vol.16, issue 1.
  • Kolluru and Gotoh (2007).
    Speaker role based structural classification of broadcast news stories.
    Interspeech, Antwerp.

Information access in speech

    Simple statistical models underlie many successful applications of speech and language processing. The most accurate document retrieval systems are based on unigram statistics. The acoustic model of virtually all speech recognition systems is based on stochastic finite state machines referred to as hidden Markov models (HMMs). The language (word sequence) model of state-of-the-art large vocabulary speech recognition systems uses an n-gram model ([n-1]th order Markov model). Two important features of these simple models are their trainability and scalability: in the case of language modelling, model parameters are estimated from corpora. These approaches have been extensively investigated and optimised for speech recognition, in particular, resulting in systems that can perform certain tasks with a high degree of accuracy. More recently, similar statistical finite state models have been developed for spoken language processing applications beyond direct transcription to enable, for example, the production of structured transcriptions which may include punctuation or content annotation.

  • Gotoh and Renals (2000).
    Information extraction from broadcast news.
    Philosophical Transactions of the Royal Society of London, series A, vol.358, issue 1769.
  • Gotoh and Renals (2000).
    Variable word rate n-grams.
    ICASSP, Istanbul.
  • Gotoh and Renals (1999).
    Topic-based mixture language modelling.
    Natural Language Engineering, vol.5, no.4.

Speech processing

    Typically, parameter estimation for a hidden Markov model (HMM) is performed using an expectation-maximization (EM) algorithm with the maximum-likelihood (ML) criterion. The EM algorithm is an iterative scheme which is well-defined and numerically stable, but convergence may require a large number of iterations. For speech recognition systems utilising large amounts of training material, this results in long training times. This work presents an incremental estimation approach to speed-up the training of HMMs without any loss of recognition performance. The algorithm selects a subset of data from the training set, updates the model parameters based on the subset, and then iterates the process until convergence of the parameters. The advantage of this approach is a substantial increase in the number of iterations of the EM algorithm per training token which leads to faster training. In order to achieve reliable estimation from a small fraction of the complete data set at each iteration, two training criteria are studied; ML and maximum a posteriori (MAP) estimation. Experimental results show that the training of the incremental algorithms is substantially faster than the conventional (batch) method and suffers no loss of recognition performance. Furthermore, the incremental MAP based training algorithm improves performance over the batch version.

  • Gotoh, Hochberg and Silverman (1998).
    Efficient training algorithms for HMM's using incremental estimation.
    IEEE Transactions on Speech and Audio Processing, vol.6, issue 6.
  • Gotoh and Silverman (1996).
    IncrementalML estimation of HMM parameters for efficient training.
    ICASSP, Atlanta.
  • Adcock, Gotoh, Mashao and Silverman (1996).
    Microphone-array speech recognition via incremental MAP training.
    ICASSP, Atlanta.