Machine Learning Methods for Personalised, Abstractive Summarisation of Consumer-Generated Media
Public Abstract
The success of Web 2.0 and CGM is based on tapping into the social nature of human interactions, by making it possible for people to voice their opinion, become part of a virtual community and collaborate remotely. If we take micro-blogging as an example, the growth in Twitter visits between 2008 and 2009 was over 1,000% and it is projected that by 2010 around 10% of all internet users will be on Twitter. This unprecedented rise in the volume and importance of online content has resulted in companies and individuals spending ever increasing amounts of time trying to keep up with relevant CGM. It is estimated that 700 person hours per year is the absolute minimum that companies and public services need to spend on CGM monitoring, online user engagement, and discovery of new information. This fellowship is about helping people to cope with the resulting information overload, through automatic methods that are capable of adapting to individual's information seeking goals and summarising briefly the relevant media and thus supporting information interpretation and decision making.
Automatic text summarisation is key to our goal and consists of compressing the meaning of text documents while preserving the relevant information contained within them. While there has been a lot of research on well-authored texts such as news, summarisation of social media is still in its infancy, with research focused on product reviews. A key experimental finding has been that due to the characteristics of social media (product reviews in particular) it is better first to abstract the relevant information from the different documents and sites and then to use natural language generation to create a fluent text based on this information.
In this fellowship I will investigate and evaluate new machine learning methods for personalised, abstractive multi-document summarisation across different social media. For example, diachronic summaries that combine Twitter posts, blog articles, and Facebook wall messages on a given topic. In contrast to previous work, we will pursue an inter-disciplinary approach, which will help us study the social dimension of CGM summarisation and establish actual user needs. The second research challenge is that the algorithms need to be robust in the face of this noisy, jargon-full and dynamic content, as well as needing models capable of representing the contradictory and strongly temporal nature of CGM. A key novel contribution of our work is personalising the summaries, based on a model of user interests, goals, and social context. Issues such as trustworthiness, privacy, and online communities (with their hubs and authorities) will also play an important role. The fourth research challenge is to generate personalised abstractive summaries that can help users with sensemaking and content interpretation.
An exciting element of my research will be in studying the different kinds of summaries that are useful for a variety of real users (companies, journalists, and the general public) through multi-disciplinary collaborations with the Press Association, British Telecom, the Oxford Internet Institute, and Sheffield's Department of Journalism. A key project deliverable will be a publicly available browser plugin that provides easy access to the automatically generated summaries. This will allow me to evaluate the project results with real users, on a large scale. It will also provide a new evaluation challenge for the Natural Language Generation community, as researchers will be able to compare their summarisers against those delivered by our open-source algorithms.
Last but not least, the fellowship covers not only foundational multi-disciplinary research but it also tests the results in several Digital Economy pilot experiments involving commercial partners (The Press Association, British Telecom, Fizzback).
Publications
- K. Bontcheva, H. Cunningham, I. Roberts, A. Roberts, V. Tablan, N. Aswani, G. Gorrell. GATE Teamware: a web-based, collaborative text annotation framework. Language Resources and Evaluation. Springer Netherlands. In Press. Draft PDF
- Cunningham H, Tablan V, Roberts A, Bontcheva K (2013) Getting More Out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics. PLoS Comput Biol 9(2): e1002854. doi:10.1371/journal.pcbi.1002854, http://tinyurl.com/gate-life-sci/
- V. Tablan, I. Roberts, H. Cunningham, K. Bontcheva. GATECloud.net: a Platform for Large-Scale, Open-Source Text Processing on the Cloud. Philosophical Transactions of the Royal Society A, 371(1983), 2013 doi:10.1098/rsta.2012.0071.
- K. Bontcheva, D. Rout. Making Sense of Social Media through Semantics: A Survey. Semantic Web - Interoperability, Usability, Applicability. IOS Press. In Press. Pre-print
- D. Damljanovic, M. Agatonovic, H. Cunningham, K. Bontcheva. Improving habitability of natural language interfaces for querying ontologies with feedback and clarification dialogues. Web Semantics: Science, Services and Agents on the World Wide Web. Volume 19. pages 1-21. 2013. PDF
- V. Tablan, K. Bontcheva, I. Roberts, H. Cunningham, M. Dimitrov. AnnoMarket: An Open Cloud Platform for NLP. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL'2013). PDF
- K. Bontcheva, L. Derczynski, A. Funk, M.A. Greenwood, D. Maynard, N. Aswani. TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2013). PDF, Download TwitIE
- M. Sabou, K. Bontcheva, A. Scharl, M. Föls. Games with a Purpose or Mechanised Labour?: A Comparative Study. Proceedings of the 13th International Conference on Knowledge Management and Knowledge Technologies. ACM. September 2013. PDF
- M. Brucato, L. Derczynski, H. Llorens, K. Bontcheva, C.S. Jensen. Recognising and Interpreting Named Temporal Expressions. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2013). PDF
- L. Derczynski, A. Ritter, S. Clark, K. Bontcheva. Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2013). PDF, Download the POS tagger
- D. Rout, K. Bontcheva, M. Hepple. Reliably Evaluating Summaries of Twitter Timelines. Analyzing Microtext. AAAI 2013 Spring Symposium. March 25-27, Stanford, California, 2013. PDF
- D. Rout, D. Preotiuc-Pietro, K. Bontcheva, T. Cohn. Where’s @wally? A Classification Approach to Geolocating Users Based on their Social Ties. 24th ACM Conference on Hypertext and Social Media, 1–3 May, Paris, France. 2013. PDF
- L. Derczynski, D. Maynard, N. Aswani, K. Bontcheva. Microblog-Genre Noise and Impact on Semantic Annotation Accuracy. 24th ACM Conference on Hypertext and Social Media, 1–3 May, Paris, France. 2013. PDF
- M. Sabou, K. Bontcheva, A. Scharl. Crowdsourcing research opportunities: Lessons from Natural Language Processing. I-KNOW, September 2012. PDF
- D. Damljanovic and K. Bontcheva. Named Entity Disambiguation using Linked Data. Proceedings of the 9th Extended Semantic Web Conference (ESWC 2012), Heraklion, Greece, May 2012. Poster session. PDF
- D. Maynard, K. Bontcheva, D. Rout. Challenges in developing opinion mining tools for social media. In Proceedings of @NLP can u tag #user_generated_content?! Workshop at LREC 2012, May 2012, Istanbul, Turkey. PDF
- M. A. Greenwood, N. Aswani, K. Bontcheva. Reputation Profiling with GATE. CLEF (Online Working Notes/Labs/Workshop). 2012. PDF
- K. Bontcheva, H. Cunningham. Semantic Annotations and Retrieval: Manual, Semi-automatic and Automatic Generation. Domingue, John; Fensel, Dieter; Hendler, James A. (Eds.). Handbook of Semantic Web Technologies. 2011.