This directory contains data and software that allow corpora containing examples of two ambiguities from the biomedical domain (abbreviations and gene names) to be constructed. ########################### Quick start ############################# * The script must be run on a computer that is connected to the internet. * To create the Abbrev.100, Abbrev.200 and Abbrev.300 data sets run the following commands: % perl recreateCorpus.pl abbrcorpus Abbrev.100_data % perl recreateCorpus.pl abbrcorpus Abbrev.200_data % perl recreateCorpus.pl abbrcorpus Abbrev.300_data This should create the directories "Abbrev.100/", "Abbrev.200/" and "Abbrev.300/". * To recreate the non-reduced corpus, containing all examples of ambiguous abbreviations downloaded from Medline, run this command: % perl recreateCorpus.pl abbrcorpus Abbrev_data.txt which should create a directory called "Abbrev/". However, note that this corpus will take a long time to create and should only be run at appropriate times (see details below). * The Gene.100, Gene.200 and Gene.300 data sets can be created by running these commands: % perl recreateCorpus.pl genecorpus Gene.100_data % perl recreateCorpus.pl genecorpus Gene.200_data % perl recreateCorpus.pl genecorpus Gene.300_data Note that we have observed occasions where documents that have previously been available via Medline have become unavailable. These occasions seem to be very rare and have only been observed for abstracts in the non-reduced corpora (defined in the Corpus_data file). However, this means that it may not be possible to reproduce a corpus exactly so the number of abstracts for a particular abbreviation may be (slightly) lower than those quoted in the paper. ############################ Details ############################### The contents of this directory consists of (1) data files and (2) software. ## Data files The data files defined the Pubmed abstracts of the documents which form each corpus and information which allows them to be adapted. Each line in the data file defines a document in the corpus. The format of each line is as follows: PubMed ID . AmbiguousTerm . Start offset . End offset . Sense For example, in the file Abbrev.100_data.txt the line: 10155299.ACE.568.603.M2 states that the abstract with PubMed ID 10155299 is an example of the ambiguous abbreviation ACE. The expansion of this abbreviation lies between the offsets 568 and 603. Consequently the abstract can be converted into the format used in the corpus by replacing the text between those offsets with the abbrviation (ACE in this case). The sense of ths abbreviation is "M2". The data files included are: Abbrev.100_data.txt, Abbrev.200_data.txt, Abbrev.300_data.txt, Abbrev_data.txt, Gene.100_data.txt, Gene.200_data.txt and Gene.300_data.txt. ## Software * recreateCorpus.pl The Perl script recreateCorpus.pl reads through one of the data files and creates a corpus of ambiguous genes or abbreviations based on the information it contains. Each abstract is downloaded from PubMed and processed. In the case of abbreviations this processing involves identifying the relevant expansion and replacing it with the abbreviation. For genes this may involve replacing the gene name with a synonym. The text is formatted into a similar format as used by the NLM-WSD corpus (Weeber et. al., 2001). The script must be provided with a data file and a flag indicating whether the file contains gene names or abbreviations. There are two valid values fo this flag: abbrcorpus and genecorpus. The script generates a corpus from the data file in a directory whose name is derived from the name of the data file. Example usage: perl recreateCorpus.pl abbrcorpus Abbrev.100_data.txt This will create a directory called "Abbrev.100/" containing the corpus. This script has two prerequisites: LWP::Simple and XML::Simple. This script makes use of the Entrez Programming Utilities provided by the National Library of Medicine (for further details see http://www.ncbi.nlm.nih.gov/sites/entrez.) Note that users of these utilities, and therefore this script, are asked to comply with various user requirements including the following: * Run retrieval scripts on weekends or between 9 pm and 5 am Eastern Time weekdays for any series of more than 100 requests. * Make no more than one request every 3 seconds. * NLM does not claim the copyright on the abstracts in PubMed; however, journal publishers or authors may. NLM provides no legal advice concerning distribution of copyrighted materials, consult your legal counsel. Note that creation of the three reduced abbreviation corpora or the gene name corpora (using the Abbrev.100_data, Abbrev.200_data and Abbrev.300_data files) are unlikely to cause problems to the NLM's server. * expansionsFromUMLS.pl and downloadCorpus.pl These scripts can be used to create a corpus for any set of abbreviations that are defined in the UMLS Metathesaurus. The expansionsFromUMLS.pl script reads in a list of abbreviations and generates the possible expansions contained in the UMLS Metathesaurus by consulting the LRABR table. The output of this script can be passed to the downloadCorpus.pl script which will retrieve examples of each expansion from Medline (via Entrez). The expansionsFromUMLS.pl script is provided with the path to the LRABR table in the UMLS Metathesaurus (via a command line switch) and a file containing a list of abbreviations. For example: % perl expansionsFromUMLS.pl -u /home/nlp/data/UMLS/UMLS_2008AB/2008AB/LEX/LRABR ExpansionsExample.txt > MyAbbrev_data.txt (N.B. If the UMLS Metathesaurus is not available on your system it can be obtained from http://www.nlm.nih.gov/research/umls/) This file can then be passed to downloadCorpus.pl. For example: % perl ./downloadCorpus.pl -d MyAbbrev/ MyAbbrev_data.txt This should create a corpus for the abbreviations listed in MyAbbrev_data.txt and save them in the directory MyAbbrev/ The downloadCorpus.pl file has a number of dependencies: LWP::Simple, Algorithm::ChooseSubsets and XML::Simple # References Weeber, M. and Mork, J. and Aronson, A. (2001) "Developing a Test Collection for Biomedical Word Sense Disambiguation" Proceedings of AMAI Symposium, pages 746-50