University of Sheffield
Department of Computer Science

COM 334: Language Engineering

Notes, Project and Tutorial Exercises
Professor Yorick Wilks
yorick@dcs.shef.ac.uk

My office:

Room 135, Regent Court

See me by appointment with secretary (Christine Parkes) at 2221811.
Pick up readings from Gillian Callaghan's office - Room 152 (opposite 135).

Lectures: Thursday 2.10-3, SG-LT09 - St George's Complex
Friday 9.00-10.50, PC-SRB57C - Portobello Centre:
Normally, Thursday will be the class and Friday will be the lecture.
Tutorial Assistants: Tomas By (t.by@dcs.shef.ac.uk)
Course web page: http://www.dcs.shef.ac.uk/~yorick/com334.html
Assessment: Project 40%, Exams 60% Project can be handed in at any time until the end of week 10. Reading weeks 11 and 12.

The Course in general terms

The purpose of this NLP course is to describe recent developments in the field of natural language processing and, in particular, to present the current contrast and opposition between continued use of the symbolic methods (described last term) and the revived use of statistical and other data-orientated methods. Machine translation and Information Extraction will be used as the two "base" technologies within which to describe a range of separable language analysis modules. Some things you learned in NLPI if you took it will prove useful--such as a grammar rule formalisms for some of this term's week 6, but basically the two courses (NLPI and II) are quite independent in terms of content, so do not worry if you did NOT take the first course.

In the notes for each week, I will star one or two references (that also contain other references of course, for further reading), and you should try at least to read those. Simply reading these notes and attending the lectures will NOT be enough to pass the exams--you should be familiar with most of what is in the reading pack--that is, most of the starred references.

There will be a project (that can be completed by programming or an essay), and there will be questions based on the week's lectures for discussion at the tutorial.

The course will contain at least the following topics:

1. Introduction: machine translation (MT).

The role of MT in the history of NLP: MT as the original problem and test bed for NLP and computational linguistics, both traditional-symbolic and, more recently, statistical-empirical. Why MT is a problem: a range of traditional symbolic approaches to the problem.

2. The revival of statistical methods in NLP: the case of MT

The return of methods originally derived for speech recognition to a range of text problems including MT: the degree of success obtained and the prospects for the future.

3. Some statistical methods

Discussion of individual NLP techniques including sentence alignment algorithms, n-gram statistics, part of speech taggers, the derivation of bilingual lexicons, preference structures, etc.

4. The empirical induction of syntactic rules and syntactic structures

This key method is an old one but has had a recent revival with the possibility of very large-scale computation over text corpora. New methods include the so-called "inside-outside" algorithm for comparing partial structures and Sampson's use of simulated annealing for parsing without a traditional grammar.

5. Lexical extraction

Processes for the quasi-automatic derivation of large-scale lexicons for NLP from a range of sources including corpora and machine-readable dictionaries. Building hierarchies; the issue of inference in lexical structures.

6. The use of large-scale lexicons

Recent work on large-scale sense resolution in texts using such lexicons: subject- dependent codings and simulated annealing. Connectionist methods with lexicons.

7. Information extraction

Recent methods for extending traditional information retrieval so as to extract content directly from texts.

Material may be added later in the course and these lecture notes handed out again in a revised form. It is hoped that a pack of the listed readings will be handed out within the first two weeks.

Books that contain a lot of the background are:

E. Charniak, Statistical Language Learning, MIT Press, 1993 (Chapters on probability theory are included in the readings).

W. Hutchins and H. Somers, An Introduction to Machine Translation, Academic Press, 1992. [This book IS in print again, no matter what Blackwells say].

J. Allen. Natural Language Understanding. New Edition. (Chapter 7 on probability theory is included in the readings).

Y. Wilks, L. Guthrie and B. Slator, Electric Words: dictionaries, computers and meanings, MIT Press 1996.

If you can afford it you should get hold of a copy of the first as it contains much useful background on the statistical background to some of t he course and its application.

***********************

Week 1
Introduction: NLP, AI & MT

This term we shall concentrate on "Language Engineering Systems", which are a development of the artificial intelligence approach to NLP but also include strong data-driven, empirical, elements.

The historical role of NLP in artificial intelligence (AI) in general, sometimes nearer, sometimes further from the centre: in Japan in it very close to the centre, here further away. The old historical origins of AI are theorem proving and machine translation (MT).

The notion of an interlingual codings for MT and their origins in AI.

The central role of MT: the original and still the test-bed for all NLP theories. Reminders of why MT has proved so hard: some of the phenomena, including word-senses, reference and prepositional phrase attachment.. The role of knowledge representation and processing in it : contrast extreme positions in the history of NLP (Colby and Winograd).

Bar Hillel's "proof" of the impossibility of MT; the closeness of his position to that of classic AI--the argument is the same (that knowledge is needed for MT) but he differs from AI workers on the possibility of doing it.

MT systems are now widespread in the world, some work better than others. The classic system is the US system SYSTRAN, which works with massive lexicons and little syntax.

The recent return of an old speculation: that it might be possible to do MT by only statistical means: the CANDIDE system at IBM (NY).

CANDIDE is not in fact a purely statistical system and has not yet beaten SYSTRAN in open competition. Some speculation about how far it might develop.

*Y. Wilks. (1991) Machine Translation. Entry in the Encyclopedia of Artificial Intelligence (ed.) Shapiro. New York: Wiley.

You will find extensive material in the Hut chins & Somers book. Description of interlingual coding systems can be found in the Parsing English chapters of:

E. Charniak and Y. Wilks, 1978, Computational Semantics, North Holland.

Week 2
Machine translation: symbolic and statistical

This week we shall move on to give some basic MT terminology and then to contrast standard symbolic approaches to MT (exemplified by the oldest surviving commercial system, the US SYSTRAN system) with the more recent attempts to do machine translation by pure statistics.

Basic MT terminology is needed: systems are DIRECT, TRANSFER or INTERLINGUAL (or PIVOT). We discussed last week the ways in which the last type related directly, in the history of NLP, to the kinds of semantic/conceptual system you discussed last terms (e.g. the early work of Schank, Wilks and others). We look directly at the "interlingual hypothesis" and ask how easy it is to refute. We contrast the notion of machine-aided translation with what Bar Hillel was trying to prove impossible.

MT systems are now widespread in the world, some work better than others. The classic system is the US system SYSTRAN, which works with massive lexicons and little syntax. This system has been developed and evaluated or over thirty years, and is now in a state where it cannot be edited or debugged any more, but is still being extended to new languages!

We then turn to the recent return of an old speculation: that it might be possible to do MT by only statistical means: the CANDIDE system at IBM (NY). This notion has been discussed since the earliest times of MT but the computational power, and the statistical algorithms, were not then available to test it efficiently. SYSTRAN is a most peculiar piece of program, and has more AI features than its detractors admit. It is in regular use by the CEC and many major companies, as well as on Minitel in France.

CANDIDE is not in fact a purely statistical system and has not yet beaten SYSTRAN in open competition. Some speculation about how far it might develop. SYSTRAN, however much people affect to despise it, remains the MT system that all other systems must beat to be taken seriously.

*Y. Wilks. (1992) MT Developments in the US. (a critique of the IBM approach to MT).

Week 3
Statistical machine translation

This week we look in some detail at the MT system developed over the last four years at IBM, New York. It is called CANDIDE and claimed to be a purely statistical system---in that it was initially said by its developers----Jelinek, Brown and Mercer--- to need no linguistics, no AI, not even any foreign language speakers. From their point of view, knowledge of language is simply knowledge of the probabilities of symbol transitions; there is nothing else and the whole of AI/linguistics is, as they put it, "snake oil"!

CANDIDE is not in fact a purely statistical system and has not yet beaten SYSTRAN in open competition--and SYSTRAN, for all its faults, is the one you have to beat to be doing serious MT, as we saw last week Some speculation about how far it might develop----and in particular the ways in which IBM have been adding types of symbolic knowledge to their system so that it is in fact now a statistical-symbolic hybrid.

There are distinct types of algorithm involved in the original IBM process --- now supplemented by other more familiar ones, as we shall see --- each of which has variants and developments elsewhere:

1) the tagging of text automatically with part-of-speech type (work begun at Lancaster in the 1970s and bought by IBM!);

2) the alignment of sentences across large amounts of parallel text;

3) the trigram model of the source text (which you were simulating in the tutorial last week);

4) the correspondence model between the parts of the aligned sentences across the two languages (corresponding in a rough way to the transfer component of a "classic" MT system);

5) the trigram-based generation model of the target language.

We shall look at variations of modules (1) and (2) later in the term: IBMs revival of empirical, or corpus based, linguistics has had an enormous influence on computational linguistics, way outside MT proper. Other, symbolic systems, have begun to add statistical modules to their system so that they too are now hybrids, coming from the other side, as it were.

*P. Brown, S & V Della Pietra, J. Lafferty and R. Mercer, (1992) Analysis, Statistical Transfer and Synthesis in Machine Translation. Proc. Fourth Conference on Theoretical Issues in MT, Montreal.

*P. Brown, J. Cocke, S & V Della Pietra, F. Jelinek, R. Mercer and P. Roosin. (1990) A statistical approach to Machine Translation. Computational Linguistics.

The concepts in Charniak (1993) chapters 2, 3, and possibly 4 will be useful here, particularly for those with no background at all in statistics and their application to language processing. I have also put in the readings a chapter on basic probability theory from James Allen's Natural Language Understanding--some may find this easier than Charniak.

Week 4
Alignment algorithms

Alignment algorithms, one of the sub-modules of IBM and now many NLP systems that use parallel corpora in a number of languages, are essential to make those corpora yield useful information---alignment simply means putting sentences of one corpus (language A) into relationship with the corresponding sentences of the corpus for language B---do not imagine for a moment this can be done by aligning sentences in simple order A1-B1, A2-B2 etc.!!

IBM pioneered a method that used only calculations of sentence length in terms of words, and then ordering sentences by that measure within a hard-boundary segment, i.e. one you could trust, like a paragraph (on the assumptions that sentences in different paragraphs would NOT align).

Church and colleagues at AT&T experimented with a variant on this technique that used character lengths of sentences for the ordering. This method was combined in a separate pass by locating corresponding word pairs (between the languages) over "trustworthy" regions such as paragraphs. If word associations between the languages were found that were very strong over the whole text (taken by paragraphs) then those links were used to align individual sentences within paragraphs.

A variant on this due to Simard used not the links found by association but simple character overlap between cognates that the two languages shared e.g. "government" and (In French) "government". This was just as good as the association method but is of course only possible between very close languages.

An additional technique suggested by Kay and implemented by Catizone et al. was to prune a general search for association links--first across firm boundaries like paragraphs and then applying it to the boundaries in question like sentences- ---- by using a bilingual dictionary between the languages in electronic form and to base an initial alignment of sentences on word pair s that were unique equivalents between the languages (if a word in language A has many equivalents in Language B it may still be of some use).

*W. Gale & K. Church, (1989) A program for aligning sentences in bilingual corpora.

*R. Catizone, G. Russell, and S. Warwick. Deriving Translation data from Bilingual Texts. Proc. European Conf. on Computational Linguistics. 1989.

Week 5
Taggers

A concrete example of a rule based tagger in action: one of the oldest and best (Cherry's PARTS programs at AT&T). The details are all in the paper handed out this week but the basic idea is very simple: a stop list is given of the so called "syncategorematic" words of English (i.e. those that aren't parts of infinite classes, like nouns). It is assumed (reasonably) that these are strong clues to the parts of speech of words in their neighborhood in sentences. Special rules are then added to locate general parts of speech (e.g. nouns) in the regions of these special words. letter fix-up rules are added to deal with problem words like "that". the results are surprisingly good, about 95% without a lot of effort. This class of approach should be contrasted with the statistical approach pioneered at Lancaster of finding significant sequences of parts of speech in hand-tagged text and training a system to apply the sequences to new texts. This also needs some "stop" words to get going but has the advantage that it can be tuned to new text types.

*L. Cherry. PART: a system for assigning word classes to English text. AT&T memorandum. 1978

*K. Church. (1988) A stochastic parts program and noun phrase parser for unrestricted text. Second Conference on Applied NLP, Austin.

Week 6
Empirical grammar induction

We now turn to an example of work to obtain phrase structure automatically from texts/corpora. The aim is traditional (back to the pre-Chomskyan days of American empirical linguistics--Zellig Harris) but above all to overcome the fact the a priori grammars of recent decades do not fit the texts when used for parsing.

Brill and Marcus is a classic approach: they start with the tagged Brown corpus as input. The system acquires a CFG (context free grammar) with each rule being given a score. The grammar, when acquired, can be used (they argue) to parse and score (new) strings of POS tags. Non-terminal nodes are NOT LABELED in that parsing--you may have a noun phrase but you will not be able to call it that in the rule you have acquired which, as we shall see, is simply a relationship between POS tags. This method can be contrasted with using (a) using mutual information to acquire phrase boundaries (b) the inside-outside algorithm for phrase structure matching (c) the MDL (Minimum Description Length) algorithm for acquiring a phrase lexicon or proto-rules.

The rules to be acquired are binary branching. The basic idea: if an ordered tag pair has the same distribution as a single tag in the corpus, then the single tag substituted by the pair is a plausible binary CFG rule and the two adjacent tags form a constituent. e.g. DETERMINER +NOUN is a plausible pair (they occur frequently together) and in the same tag context as NOUN and PRONOUN:

hence

NOUN --> DET NOUN

is a plausible candidate rule.

Note that this idea differs from one which starts by finding common tag ngrams --- in that NOUN VERB is probably a common ngram but has no obvious co-substituent in context. (note that this isn't obvious either: NOUN VERB may well turn out to have significant substitutability relations with VERB, as in

"The big male (dog walks)" and "The big male (smiles)".

They concede that PRONOUN VERB and VERB distribute similarly which could be a problem (e.g. (eat) fish vs. (we eat) fish).

For every possible tag rule proposed as above,

Tx --> Ty Tz

the system computes a measure of similarity of their environments called divergence, which indicates they are substitutable if the measure is suitably low. It has been part of much recent statistical linguistics (e.g. the IBM MT system as well) that very local measures of environment are sufficient for this kind of similarity; i.e. you don't have to put the Tx and TyTz into very long strings to check the divergence measure but only a frame of two words which they compute as follows:

1. word1 Tx word2 number

and

2. word1 Ty Tz word2 number

where the two values of "number" are the number of times Tx (or Ty Tz as a string) occur between word1 and word2, measured over all the words in the corpus (NB NOT other tags). Rules are scored by considering both directions of the CFG rule being created:

probability of (left rule item given right)

and

probability of (right rule item given left)

eg.results are  given (Right) DET SING NOUN, then PROPERNOUN is the likeliest Left but given (Left) PROPERNOUN, then PROPERNOUN PROPERNOUN is the likeliest Right.

The result of this is a scored CFG, with probabilities attached to lists of leftside-given -rightsides and vice versa. The parse of a sentence is then achieved by taking the structure resulting from an application of a rule set whose product totals are lowest (adding each rule application to a total): this defines the "best set" of rules for a sentence, presumably applying only the given-Right take likeliest-Left substitution, otherwise we could never know the tag string would shorten.

Problems already suggested are:

AST VERB +IN has similar behavior to just PASTVERB and so on, though this may be a true fact about the distribution of phrasal verbs in languages like English and not a problem at all.

However, they have a complex correction function that adjusts the rule scores (derived as above) in an attempt to raise the scores of tag strings deemed TRUE CONSTITUENTS----roughly this means computing the ngram boundary between Tx and Ty and then after the pair at the junction Tx Ty and Tz which follows. The argument is that, if Tx Ty is a "true constituent" then Ty will be much more probable after Tx than Tz is after Tx Ty; they then add a correcting element to the scores for the binary rules based on this.

It seems to me that this doesn't solve the problem if scores for rules are taken only over tags and, say, "up" and "in" are always in the same tag set, because this method will not distinguish between "washed up" and "washed in" when appearing as tag strings even though the first is a natural constituent an the second not--it isn't clear from the examples whether they do distinguish prepositions and postverbs as tags. If not, then no adjustment to the probability of the rule

VERB ---> VERB PREP

will help much.

They concede that even these adjustments yield "incorrect" joinings of subjects to verbs and so have to insert an explicit correction by hand that forces verb+obj joins before subjects--which rather removes the point of the method!

This system when run acquired 41,000 rules given the Brown corpus. They pruned this et and then parsed sentences of between 5 and 14 word s with no commas or quotations etc. the results were about 62% of the total sample parsed correctly (and 72% if one adjustment is allowed); remember these are BINARY parsing not intuitive ones.

Non-terminal labels can be assigned (remember these rules give only new TERMINALS as candidates for the non-terminal labels) if intuitive collapsing rules are made up to create non terminals.

Tree-Bank Grammars

Deriving a rule set directly from a tree-banked corpus derived by hand: the Penn Tree Bank (PTB)

Techniques

* Penn Treebank rule extraction

* Compaction of the resulting rule set

* Bottom-up linguistic component extraction using the MDL principle

* Merging top-down and bottom-up approaches

Sample Text from Penn Treebank

( (S 
(NP Liberty/NP National/NP)
(VP exchanged/VBD
(NP (PP about/IN)
78.64/CD shares/NNS
(PP of/IN
(NP its/PP$ common/JJ stock/NN)))

.....

Rule Compaction

The idea is to remove all redundant}rules from the grammar. With some restrictions the order of removing does not make any difference.

Out of

NP -> NP CC NP

NP -> DT NN

NP -> DT NN CC DT NN

the last rule can be removed, because it can be substituted by the other two

Minimum Description Length Principle Devised by Jorma Rissanen in 1978, though mathematical foundations were laid out in the 60's (Kolmogorov):

The idea is to minimise:

H(D) + L(M)

e.g. entropy (length) of the data plus the length of the model

For example, a rule should not be added if the reduction in the data length is less than the length of the rule itself.

Results obtained with the MDL principle:

Lexical Learner

[ n_``_interim_''_coalition_government] -->

[_in_a_coalition_government_] --> curDL: 8250

[_the_state_department_] --> curDL: 8198

[_coalition_government] --> curDL: 8142

Grammar Learner

HC0: [ $ CD CD ] --> curDL: 153114

HC1: [ TO VB ] --> curDL: 151047

HC2: [ MD VB ] --> curDL: 149244

HC3: [ IN DT NN ] --> curDL: 14759 8

HC4: [ DT JJ NN ] --> curDL: 146197

HC5: [ DT NN ] --> curDL: 145170

HC6: [ MD RB VB ] --> curDL: 144369

Charniak and Krotov have both achieved verb large compactions (Krotov 18K rules to 2K) and good rates of success (around 80%, higher than Brill and Marcus) with a subsequent parser.

* Brill, E. & Marcus, M. (1992) Automatically acquiring phrase structure using distributional analysis. Proc. DARPA Speech and Language Workshop.

* Gruenewald, P. 1995. A minimum description length approach to grammar inference.

Charniak (1993) chapter 5 should be read (and 6 and 7 if you feel strong).

Week 7
Deriving verb patterns empirically

Levin's central hypothesis about verbs:

Levin's work is guided by the assumption that the behaviour of a verb, particularly with regards to the expression and realisation of its arguments, is to a large extent determined by its meaning. In other words, there is an interaction of its meaning and general principles of grammar. A methodological consequence of this assumption is that verb behaviour can be used effectually to probe for linguistically relevant pertinent aspects of meaning. These aspects can be used to minimize the amount of information necessary for any given verb by factoring predictable information out of lexical entries.

Native speakers can make extremely subtle judgements concerning the occurrence of various syntactic expressions. For instance, they know what diathesis alternations verbs may participate in: e.g. 'spray', 'load' and the locative alternation:

The farmer loaded apples into the cart

The farmer loaded the cart with apples.

versus verbs that allow only one diathesis:

'fill', 'cover'

*Gina filled lemonade into the glass

Gina filled the glass with lemonade

These examples derive from problems in Fillmore's case grammar, yet it is very difficult to give any non-intuitive notion of what exactly constitutes a well-formed Levin diathetic pair: it cannot be identified with sameness of meaning, structure, cases or even major participants in the action described.

Various aspects of the syntactic behaviour of verbs are tied to their meaning. This is exemplified by the verbs 'touch', 'hit', 'break', and 'cut' They are all transitive.

Relevant diathesis alternations: touch hit cut break

Conative (e.g. 'cut at the bread') N Y Y N

Body-Part Possessor Ascension Y Y Y N

(e.g. 'Sue touched Sam on the shoulder')

Middle N N Y Y

(e.g. 'Vases break easily')

These verbs are instances of classes of verbs that show this behaviour, and their members share certain aspects of meaning.

Conative: contact and motion

Body-Part Possessor Ascension: contact

Middle: causing a change of state.

'touch': a pure verb of contact

'hit': a verb of contact by motion

'cut': causing a change of state by moving something into contact

with the entity that changes state

'break': a pure verb of change of state

The hypothesis that a word's behaviour is fully semantically determined is not uncontroversial, but it is fair to state that the meaning of a verb has considerable predictive ability.

Verb classes arise because a set of verb with one or more shared meaning components show similar behaviour. The important theoretical construct is the notion of meaning component. The classification should further the isolation of meaning components, but does not take into account every property of every verb, because this would most likely result in verb classes with only one member. Choosing the right meaning components responsible for syntactic behaviour is the challenge which has to be faced.

Methodology:

a) Levin isolates verbs with similar behaviour wrt diathesis alternation on the assumption that they share at least some aspect of meaning.

b) and isolates meaning components that these verbs have in common.

Dorr and Jones' experiments on verb patterns

Levin proposed that there is a 1-1 mapping between the syntactic patterns and semantics of verbs. They created a database of Levin's verb classes and example sentences from each class and wrote a parser to extract basic syntactic patterns from the sentences in Levin's book. They then characterized each semantic class by a set of syntactic patterns, which they call a 'syntactic signature' and used the resulting database as the basis of 2 experiments, designed to discover whether the 'syntactic signatures' would tell them anything about the meaning of the verbs.

Two experiments were carried out:

1) This method takes word-sense distinctions into account by considering each occurrence of a verb individually and assigning it a single syntactic signature according to Levin class membership.

Method:

i. Automatically extracts syntactic information from the example sentences to yield the syntactic signature for the class: (this is the set of syntactic frames which the verb may take:

eg. [np, v, np], [np, v, pp(to)].

ii. Discover which Levin semantic classes have uniquely - identifying signatures. This information was taken from Levin's example sentences for each verb class and included information about NEGATIVE EXAMPLES.

They parsed the 1668 example sentences from Levin's book and found that these sentences reduce to 282 unique patterns. The 191 classes of sentences listed with each of the 191 semantic classes reduces to 189 unique syntactic signatures. 187 of them uniquely identify a semantic class, meaning that 97.9% of the classes have uniquely identifying syntactic signatures. Only 2 semantic classes do not have enough syntactic information to distinguish them uniquely. Using only positive evidence gives 88% of the semantic classes uniquely identifying syntactic signatures.

But this means only that the sets of verb frames are not identical for almost all of the verb classes. Is this surprising? Also, the syntactic frames were derived from Levin's own examples: would it be difficult to set up example sentences so that each class had a unique signature, and hence these results would be far more believable if the examples were taken from a corpus.

Another problem is Levin's negative examples: these range from the syntactically incorrect 'Tony broke at the window' to the unlikely 'Tony broke the wall with the cup'. It is clear, then, that these negative examples contain more than just syntactic information. Also, from a pragmatic point of view, there is no way of deriving negative examples from a corpus. When the negative examples are ignored the verb classes with unique syntactic signatures reduces to 88%, a more reliable indicator of the semantic information which can be derived from these syntactic signatures.

It is also very difficult to see how syntactic signatures could be employed in practice: the syntactic frame associated with a particular occurrence of a verb will, in general, not provide enough information to identify which semantic class it belongs to (since the semantic classes are identified by sets of syntactic frames) but we cannot look at the syntactic frames associated with each occurrence of a verb in a text (which would provide a set of syntactic frames) since the verb may occur in different senses in the text---this absence of any notion of wordsense in these experiments that could apply to a corpus of verb occurrences is crucial.

The syntactic frames for verbs seem very close so even if a verb occurred mainly in a single sense in a text and a few times in another sense, this other sense may introduce enough spurious data to make it impossible to classify the verb. The only way to do it would be to sense disambiguate all occurrences of the verb first but the Levin schema is meant to do this!

2) This experiment takes each verb and derives the syntactic signature associated with it. The semantic class of the verb is ignored and if a verb may occur in several semantic classes then these are lumped together. This method ignores word-sense distinctions altogether by assigning one syntactic signature to each verb, regardless of whether it occurred in multiple classes.

The experiment aims to check how close the semantic classification of Levin is to the classification made by the verb's syntactic signature. They discovered that only 6.3% of Levin's semantic classes map completely onto a syntactic signature. This is said to demonstrate that a 20-fold improvement can be achieved in deriving semantic information from syntactic cues if the syntactic cues are first divided into distinct groupings that correlate with different word senses.

The issue of "senses" of verbs is not central to Dorr's work but does need some further discussion since the work needs to be related to and distinguished from work like that of Yarowsky on one-sense- per-collocation applied to verbs. One some interpretation of what Yarowsky is claiming then this work of Dorr's is no more than a special case of his---she needs to show this is not so, so that readers may grasp the state of the field better.

The answer is important in answering the question as to whether the syntactic information is doing the verb "disambiguation" as Dorr claims, or whether WordNet is (and hence the answer to question (1)). Raskin and Nirenburg have argued (1996) that Dorr's claim is the reverse of Levin's and that she is claiming willy-nilly that syntax determines semantics for the English verb and not vice versa, as Levin certainly seems to intend.

Beth Levin, 1993, English Verb Classes and Alternations, A Preliminary Investigation. University of Chicago Press, Chicago, London.

* Bonnie Dorr & B Jones, 1996, The Role of Word Sense Disambiguation in Lexical Acquisition: Predicting Semantics from Syntactic Clues, In COLING '96.

Week 8
Lexical Extraction Programs

We are going to investigate a quite different method for getting reasonable sized NLP systems: different, that is, from the statistical analysis of corpora. Human dictionaries are normally supposed to contain a great deal of information about words: their pronunciations, syllable composition, histories, grammatical function, and definitions of meaning. can any of this be extracted from dictionaries automatically so as to produce substantial lexicons for NLP projects?

In the last fifteen years there have been a substantial number of such projects, most of them working with the publishers tapes of learners dictionaries for English such as LDOCE (Longman's dictionary of contemporary English). A n enormous amount of effort was required to get these tapes into a usable form. Then efforts were made to EXTRACT the information a machine-readable dictionary (MRD) contained: to put it in a form of what we will call a machine-TRACTABLE dictionary, one that could be used for a practical NLP application. Doing this required assumptions that we will discuss:

1) Extricability--that the information could be computationally extracted from an MRD;

2) Sufficiency: that the knowledge there was of the right kind and depth for language understanding by machine;

3) Bootstrapping: that the amount of human intervention that would be required to achieve this (by the hand coding, say, of semantic primitives without adequate definitions) would be finite in practical terms.

During the lectures we will look at some background but concentrate on the CRL lexical program with which I was associated; you should however broaden your reading with other systems.

We shall now look at other techniques for extracting information from MRD's:

1) Genus location in the definition--a general method with some special cases.

2) Genus disambiguation: the use of LDOCE semantic codes

3) Building the noun hierarchy, dealing with cycles at the top.

4) Disambiguating the whole definition: using neighborhoods--statistical basis for these (or Lesk's original method)

5) Refining the neighborhoods with subject code filtered neighborhoods--cued by the codes of other words in the definition or sentence.

NOTE that these methods of disambiguation can now be more widely applied than just to the dictionary definitions but can be general sense-resolution-in-text mechanisms.

We then describe the use of a very general statistical optimization method called simulated annealing.

I shall also discuss word-sense disambiguation (sense tagging of text) and you should read Charniak (1993) chapters 9 and 10.

* Guthrie L., Pustejovsky, J., Wilks, Y. & Slator, B. (1996) The role of lexicons in natural language processing. Comm ACM.

Electric Words is a book devoted to this topic.

*Bruce, R., Guthrie, L. and Wilks, Y. (1994) Automatic Lexical Extraction--theories and applications.

*Guthrie, L., Guthrie, J., Wilks, Y., Cowie, J., Farwell, D., Slator, B., and Bruce, R. (in press) A Research Program on Machine-Tractable Dictionaries and their Application to Text Analysis. In Zampolli (Ed.) Lexical Resources.

Some other papers in the general literature that will be important if you choose the essay option for the project:

*Boguraev, B., and Briscoe, T. (1987) Large lexicons for natural language processing: exploring the grammar coding system of LDOCE. Computational Linguistics. See also the book of papers on the lexicon edited by them (I can lend this for Xeroxing individual papers).

*Resnik, P. (to appear) WordNet and distributional analysis. In C. Weir, et al. (Eds) Statistically based NLP techniques.

Lexical papers in Wilks, Y. (Ed.) Theoretical Issues in Natural Language Processing (1991) Erlbaum.

Week 9
Sense tagging

Sense tagging of text on a large scale: why do it?

The history of computational linguistics/NLP and the history of machine translation at its core---word sense ambiguity has always been believed to be one of the fundamental intractable problems---hence sense tagging (if possible as a separable process) must help.

Immediate problem: sense tagging (like POS tagging) seems circular against the set of senses you start with, but worse since sense taxonomies may be INCOMPARABLE, as opposed to being of a different grain.

Kilgarriff's argument (1993):

Straw man --the "bank model of meaning"--held by those in sense-disambiguation in AI and elsewhere.

K argues that senses develop from core senses (cf. Givon 1997 and Briscoe/Copestake work on "grinding").

Not the same as his real argument which is about human recognition of text senses not fitting given list (as the BM model might assume they did).

His key claim: 87% of non-monosemous words in sample have at least one text occurrence that cannot be associated with a unique LDOCE sense. Hence BM refuted (yet again)!

BUT, that claim is consistent with 99% of text usage being associated by subjects with one and only one LDOCE sense!

BM long known a straw man and that the issue of novel sense was crucial.

K mainly refuted by the fact that sense-tagging procedures work at reasonable levels (albeit often for single text words, small subsets, or word-by-word procedures only).

Sanity check figure: taking the first LDOCE sense gives 62% correct sense assignment.

Historical efforts at sense-tagging:

Single words:

Lesk (overlap algorithm) 1967

McDonald-Plate (Pathfinder networks) 1988

Brown, Mercer, Jellinek (concurrence statistics) 1989

Gale, Church, Yarowsky (1992)

Yarowsky (1995)

All sentence words:

Wilks (preference over codings) 1967

Small (decision trees) 1980

Cowie and Guthrie (simulated annealing) 1992

Dagon and Itai (statistics and bilingual corpora) 1994

Ide and Veronis (Neural nets) 1994

Homographs revisited

Unclear how they differ from senses-in LDOCE about 93% of homographs differ by part o speech (not bank of course) so the notion is tightly connected to part of speech.

Wierzbicka's views (cf. Antal) of sense and the notion of the compact lexicon--reducing sense to what you can disambiguate.

Hanks: lumpers and splitters.

What does NLP need--maybe what simulated annealing can separate.

Yarowsky (and colleagues):

One sense per discourse: the claim that 94% of word senses are discourse unique. If true, this would remove, at a stroke, MTs historical excuses, and provide an additional strong heuristic on WS tagging---tag them all in a text by any method and use the fact that they must in general coincide. OR tag the first occurrence of a word and assume the rest the same with 94% confidence. Yarowsky methods seek to avoid handtagging so need some independent criterion of right sense tagging.

Yarowsky Claim 1: words appear with fewer senses in single texts than normally assumed.

Evidence: taking 97 sense types in Grolier's encyclopedia --- 7569 tokens (of those types) were monosemous in the corpus, and 6725 with between two and six senses.

"most words (both by token and type) have only one sense".

Yarowsky claim 2: in a single discourse words tend to appear with one and only one sense.

Evidence: from English-French parallel text (Canadian Hansard) tested on six (!) word pairs that do not share senses across Eng-Fr (e.g. duty --> devoir V droit), and occurred at least 1 50 times.

Training method over 100-word contexts for 60 instances of each pair member. Result vary between 82 and 100% for this method of discrimination against the bilingual self-tagging (depending on the pair). Unclear whether this confirms one-sense-per-discourse.

Yarowsky's (1991) monolingual discrimination method (without the bilingual problems):

Use of a single Roget category as a marker of sense (cf. Masterman 1966).

Grolier words (e.g. crane as bird vs. machinery). Again 100 word training contexts over Grolier and the Roget heads of all those words.

Result: over 12 (!) selected words 93% correct Roget heads assigned, checked against hand-tagged data for those words.

Again not clear if this confirms one-sense-per-discourse or not.

Yarowsky et al. hand check on Grolier:

Three authors and two colleagues review the OALD senses for nine test words and answer questions from Grolier article contexts based on a same-notsame choice. Of 54 pairs of contexts for a single Grolier article 51 shared the same sense and three did not.

No Kilgarriff problems can arise since the questions are not against the OALD set. Encyclopedia articles no surprise to anyone for single-senseness! No universal hypotheses has been confirmed.

Yarowsky (1993, 1995) On sense-per-collocation:

it is highly unlikely that the following two sentences for "plants" can both be attested in any corpus:

Plastic plants can fool you if really well made.

Plastic plants can contaminate whole regions.

Y claims this broadly true for any sensible definition of collocation (and sense). (Cf. Levin and Dorr on verbs). Cf. "Un golpe bajo".

No particular discovery procedure for the "key words" or POS needed for disambiguation. Recent 1995 work bootstrapping between his and the one-sense-per-discourse claim.

Result: between 93 and 97% on a handful of bisemous word-pairs.

Problems of evaluation:

* the tiny samples in corpus work (vs. lexicon based work --- simulated annealing);

* the fluctuating notion of word-sense (lexicon, thesaurus, bilingual correspondence, made up,...)

* Bad news from the front: if you POStag a text (97%) level and then ask of LDOCE what senses and homographs have distributions not uniquely distinguished by a POStag, the rough answer is 6% for homographs and 12% for senses. This is a mass (not single word) method with no training required (and could be improved by combination with annealing, which also requires no training).

For "intermediate representations"--cf. parsings--as opposed to full throughput systems like MT. Two contradictory beliefs: engendered by K and Y respectively:

* sense tagging cannot be done (K)

* sense tagging has been solved (Y)

But it may turn out even simpler than either guessed!

*K. Church & P. Hanks, Word Association Norms, Mutual Information and Lexicography, ACL 1989.

A. Kilgarriff (1993) Dictionary word sense distinctions, an enquiry into their nature. Computers and the Humanities.

*Y. Wilks (to appear) Senses and texts, Computers and the Humanities.

*D. Yarowsky (1992) Word sense disambiguation using statistical models of Roget's categories trained on large corpora. COLING 1992.

*D. Yarowsky, (1995) Unsupervised word sense disambiguation rivalling supervised methods, ACL 1995.

Week 10
Information extraction (IE)

Information Extraction (IE) technology is now coming on to the market and is of great interest to information end-user industries of all kinds, especially finance companies, banks, publishers as well as governments. For instance, finance companies want to know what company take-overs happened in a given time span, and want widely scattered text information reduced to a simple data base. Lloyds of London need to know of sinkings throughout the world and pay large numbers of people to locate them in daily newspapers in a wide range of languages. All these are prima facie cases for IE.

Computational linguistic techniques and theories are playing a strong role in the emerging technology of information extraction (IE), not to be confused with the more mature technology of Information Retrieval (IR), which selects a relevant subset of documents from a larger set. IE extracts information from the actual text of documents, by computer and at high speed, normally from available electronic sources such as news wires, and does so much faster than humans can. This technology is normally preceded by an IR phase, which selects a set of documents relevant to some query--normally a string of features or terms that appear in the documents. So, IE is interested in the structure of the texts, whereas one could say that, from an IR point of view, texts are just bags of words.

You can see these two ways of thinking about text information and its usefulness by thinking about finding, from the World Wide Web, what TV programmes you might want to watch in the next week--there is already a web site with text descriptions of programmes on 25 or more British TV channels; more text than most people can survey easily. On this web site you can input the channels or genre (e.g. musicals, news etc.) you are most interested in and the periods you are free to watch. You can specify up to twelve words that can help locate programmes for you, e.g. stars' or film directors' names. The web site has a conventional IR engine behind it, what we could call a boolean function of the words and genre/channel names you use. The results are already useful--and currently free---and treat the programme descriptions as bags of words.

Now suppose you also wanted to know what programmes your favourite TV critic liked: and suppose the web site also had access to the texts of recent newspapers. Now you see that an IR system cannot answer that question because it involved searching review texts for films and seeing which one are, in the structure of the texts, described in favourable terms. It is that that requires IE and a notion of text structure. In fact, such a search for programme evaluations is not a best case for IE, and I mention it only because it is an example of the kind of leisure and entertainment application that will be so important in future informatics developments ----think of the contrast between the designed uses and the actual uses of Minitel!

IE work goes back to early work by de Jong at Yale University on scripts: searching texts with a computer to fill predetermined slots in structures called scripts, or more usually templates. Film evaluations are not very script like, but the earlier examples of scenarios of ships sinking (which Lloyds of London needs to know) or the patterns of company take-overs: these are much more template/scenario like and suitable for IE techniques.

IE shares a key feature with an older, more mature, discipline, machine translation: both are associated with a strong evaluation methodology. This means that one can see, by objective standards, how well a system is doing at a chosen task and whether it is getting better with time. Our research group has developed a basic IE system (Wakao et al., 1996) which has competed within the US ARPA TIPSTER and MUC competitions and emerged in 1995 as the best IE system outside the US at such tasks as locating proper names and descriptions in texts, and determining coreferences between them. The algorithms to do this are dependent on a range of forms of linguistic information-- morphological, syntactic and semantic-- in addition to a world model.

IE itself and the current standards of evaluation and success are set out in (Lehnert & Cowie 1996), but broadly one can say that the field grew very rapidly when ARPA in the US funded competing groups to pursue IE, and based it initially round scenarios like terrorism, where the funders wanted to replace government analysts who read publicly available newspapers for terrorist events and then filled templates---when, where, a terrorist event took place, how many casualties etc. --- and it was this activity that IE was designed to automate.

Early successful systems like JASPER which was built for Reuters depended on very complex crafted templates, made up by analysts. However, the IE movement has grown by exploiting, and joining, the recent trend towards a more empirical and text based computational linguistics, that is to say by putting less emphasis on linguistic theory and trying to derive from the large volumes of text data that machines can now manipulate various levels of linguistic generalisation. A conspicuous success has been part-of-speech taggers, systems that assign one and only one part-of-speech symbol (like Proper noun, or Auxiliary verb) to a word in a running text and do so on the basis (usually) of statistical generalisations across very large bodies of text. Recent research has shown that a number of quite independent modules of analysis of this kind can be built up independently from data, usually very large electronic texts, rather than coming from either intuition or some dependence on other parts of a linguistic theory.

These independent modules, each with reasonably high levels of performance in blind tests include part-of-speech tagging, aligning texts sentence-by-sentence in different languages syntax analysis, attaching word sense tags to words in texts to disambiguate them in context and so on. That these tasks can be done relatively independently is very surprising to some who believed them all dependent sub tasks within a larger theory. These modules have been combined in various ways to perform tasks like IE as well as more traditional ones like machine translation (MT). The modules can each be evaluated separately --but they are not in the end real human tasks that people actually do, as MT and IE are. One can call the former "intermediate" tasks and the latter real or final tasks---and it is really only the latter that can be firmly evaluated against human needs ------by people who know what a translation, say, is and what it is for. The intermediate tasks are evaluated to improve performance but are only, in the end, stages on the way to some larger goal.

The empirical movement, basic texts linguistics in texts data, has another stream, which was the use in language processing of the large language dictionaries (of single languages and bilingual forms) that became available about ten years ago in electronic forms from publishers tapes. These are not textual data in the sense just used, but intuitions about meaning set out by teams of lexicographers or dictionary makers----sometimes they are wrong, but that have proved a useful resource for language processing by computer, and lexicons derived from them have played a role in actual working MT and IE systems.

What such lexicons lack is a dynamic view of a language--they are fossilised intuitions. To use a well known example: dictionaries of English first tell you "television" means a technology or a TV set, but it is nearly always used now to mean the medium itself. Modern texts are out of step with dictionaries--even modern ones. It is this kind of evidence, that for tasks like IE lexicons must be made to adapt to the texts being analysed that has led to a new, more creative wave, in IE research: the need not just to use large textual and lexical resources, but to adapt them as automatically as possible, to enable them to learn or adapt or "be tuned" to new domains and corpora, which may mean dealing with obsolescence or just dealing with the specialised vocabulary of a domain not encountered before.

One important fact to note is that what may seem to be separate information technologies are really not so---MT and IE for example are just two ways of producing information to meet peoples needs and can be combined in differing ways: for example, one could translate a document and then information extract from the result or vice-versa--which was chosen might depend on the relative strengths of the translation systems available-a simpler one might do for only translating the contents of templates and so on. This last observation emphasises that the product of IE can be seen either as a compressed, or summarised text itself, or as a form of data base (with the fillers of the template slots corresponding to conventional database fields). One can then imagine new, learning, techniques like data mining being done as a subsequent stage on the results of IE itself.

If we think along these lines we see that the underlying distinction here, between traditional IR and the newer IE , is not totally clear everywhere but can become a question of degree. Suppose parsing systems that produce syntactic and logical representations were so good, as some now believe, that they could themselves process huge corpora. One can then think of the traditional task of computer question answering in two quite different ways. The old way was to translate the question into a formalised language like SQL and use it to do IR by retrieving information from a database- as in "tell me all the IBM executives over 40 earning under %50K a year". But with the parser of large corpora one could now imagine applying the query to form an IE template and searching the WHOLE TEXT (not a data base) for all examples of such employees---both methods should produce exactly the same result starting from different information sources ( a text versus a formalised database). One can see that what we have called an IE template can be seen as a kind of frozen query that one can reuse many times on a corpus and is therefore only important when one wants stereotypical, repetitive, information back rather than one off questions one wants answered.

Tell me the height of Everest, addressed to a formalised text corpus is neither a IR nor IE but a perfectly reasonable request for information. Tell me about fungi, addressed to a text corpus with an IR system, will produce a set of relevant documents but no particular answer. Tell me what films my favourite movie critics likes, addressed to the right text corpus , is undoubtedly IE as we saw. The needs and the resources available determine the techniques that are relevant.

At Sheffield we are working on two applications of IE systems funded among European LRE projects: one, AVENTINUS is in the classic IE tradition, seeking information on individuals concerning security, drugs and crime, and using classic templates. the other ECRAN, a more research orientated project, searches movie and financial databases as it is exploiting the notion we mentioned of tuning a lexicon so as to have the right contents, senses and so on to deal with new domains and relations unseen before.

In all this, and with the advent of speech research products and the multimedia associated with the Web it is still important to keep in mind how much of our cultural, political and business patrimony is still bound up with texts, from manuals for machines, to entertainment news to newspapers themselves. The text world is vast and growing exponentially--never be seduced by multi-media into thinking text and how to deal with it, and how to extract its content, is going to go away.

Basili, R., Pazienza, M., & P. Velardi, (1993) Acquisition of selectional patterns in sub- languages. Machine Translation, 8.

Cunningham, H., Gaizauskas, R. & Y. Wilks, (1995) GATE: a general architecture for text extraction, University of Sheffield, Computer Science Dept. Technical memorandum.

Dorr, B. & D. Jones (1996) The role of word-sense disambiguation in lexical acquisition: predicting semantics from syntactic cues, Proc. COLING96.

Granger, R. (1977) FOULUP: a program that figures out meanings of words from context. Proc. Fifth Joint Internat. Conf. on AI.

Kilgarriff, A. (1993) Dictionary word-sense distinctions: an enquiry into their nature. Computers and the Humanities, 26.

Cowie, J., Dunning, T., Guthrie L., Wilks, Y. (1994) Text processing using multilingual resources at the CRL. Literary and Linguistic Computing.

*Cowie, J. & W. Lehnert (1996) Information Extraction, in (Y. Wilks, ed.) Special NLP Issue of the Comm. ACM.

Levin, B. (1993) English verb classes and alternations, Chicago, IL.

Procter, P. et al. (1994) The Cambridge Language Survey Semantic Tagger. Technical Report, Cambridge University Press.

Pustejovsky, J. & Anick, P. (1988) On the semantic interpretation of nominals. Proc. COLING88.

Riloff, E. & J. Shoen (1995) Automatically acquiring conceptual patterns without an annotated corpus, Proc. Third Workshop on Very Large Corpora.

*Wakao, T., Gaizauskas, R. & Wilks, Y.. (1996), Evaluation of an algorithm for the recognition and classification of proper names. Proc. COLING96.

Wilks, Y. (1978) Making preferences more active, Artificial Intelligence, 11.

***********************************

Course project

The project can be either programming or an essay, OR I will consider original individual suggestions for projects at any time provided they concern some aspect of the course that interest you and you are prepared to work seriously on. Otherwise, here are three standard proposals:

i) An essay on the nature and limitations of lexical extraction programs: the large over the last ten years to extract syntactic and meaning structure automatically from machine readable dictionaries (MRDs). I expect more than a competent survey: I expect serious discussion of inherent problems in such an enterprise (i.e. are the structures extracted all circular, upon the particular MRDs from which they were extracted etc..) and of the limitations involved if such work is separated form information extracted from (non-dictionary) text corpora.

ii) Writing configuration files for the SMALLTAG tagger. The tagger is rule-based and reads in the tag set, the lexicon, and the rules from files at start-up. The rules are written in a special formalism which is documented in the manual (check the course site on the cover of this document). You will be given a tag set and a lexicon, but you have to write the rules yourself. You can change the tag set and lexicon if you like as well.

You can either write and adapt the rules by hand, or you can try to apply some machine-learning technique such as Eric Brill's transformation-based error-driven learning.

The performance of your rules is assessed by calculating the number of words that are correctly tagged (in %). There are tools available for this. Minimum requirements are that your rules tags at least 70% of the words correctly.

The following files are available, in the course directory

the program

manual in various formats

example data files (tag set, lexicon, and dummy rule file)

trial texts

evaluation scripts

some word lists (proper names etc)

When writing the rules, you should keep the following in mind:

1) If you want to say that a specific word should always get a specific tag, then add it to the lexicon; don't make a rule for it.

2) You can use (some) morphological information in the rules, for example to say that all words ending with "ly" are adverbs.

3) There is no theoretical limit to the lexicon, so you can add as much as you like, but since many words will have more than one possible tag (i.e. occur more than once in the lexicon), you need rules to disambiguate them. (You could in principle get hold of the LDB 75K word lexicon in the Department and add that to the tagger lexicon).

What you need to hand in are: a complete set of data files (tag set, lexicon, and rules) plus a document explaining the rationale behind your rules. If you decide to try the more ambitious machine-learning approach then you will need to hand in any code you write and a document explaining it rather than the rules themselves.

A reasonable write up of what you have done (as well as a result) is also essential; there will not be good marks for just rule sets and results. You should explain what you did and why, what resources you used, if additional to the ones in the basic set; and, above all, your methodology for improving the result to whatever maximum you reached. A proper project description (usually 5-10 sides of print) is essential.

iii) write CPSL rule files for the SPQR pattern matching information extraction system

The Tipster Common Pattern Specification Language (CPSL) was developed as a configuration language for IE systems, such as SPQR, developed at DCS, and Doug Appelt's Textpro ( http://www.ai.sri.com/~appelt/TextPro/). It is built on a foundation of regular expressions and operates on Tipster annotations, which are roughly functionally equivalent to HTML/SGML markup. On the course web page there are links to the SPQR software and manuals as well as relevant off-site information.

In this project you will decide on some source of natural language information. This could be a web site such as a news paper for example. Within some reasonably limited domain (e.g. traffic accidents) you will then create a formal representation that will be your desired output, and write CPSL rule files to compute it. You can expect some help in finding natural language lexicons and lists of proper names.

There will not be any well-defined criteria for success in this project, so great importance will be placed on the quality of the write-up. You must however submit your (CPSL) source files so that your results can be verified.

Here follows some suggestions for data sources. You do not have to worry about reading web pages from the web server. Simply use a browser to save them to a local file and use that file as input to SPQR.

1. Online News Most newspapers are available on the web. Specific examples include: news.bbs.co.uk, www.cnn.com, www.abcnews.com. You can decide yourself what information you will attempt to extract, but you should aim to include some complex structures such as events. If you select traffic accidents, for example, it will not be enough to recognise only the names of the people involved or the makes of the cars. The output would need to link that information to other data like the time and location of the accident.

An interesting project would be to create NewsForms (see also http://www.signiform.com/erik/pubs/newsund.htm) from free text news articles. This has the advantage that you do not need to worry about designing the output format since it is already given. You can limit yourself to one or a few of the 17 annotation types.

2. Encyclopaedia

Another suggestion is to use an encyclopaedia, either from the web (www.britannica.com, www.encyclopaedia.com) or CDROM (Encarta?). You could, for example, concentrate on biographical information and attempt to extract personal data (date and location of birth and death, occupations etc) from articles about individuals.

3. CIA factbook

A nice source of geographical (textual) information is the CIA factbook. It is already structured, but presumably not structured enough to be considered machine-readable. One possible project would be to use SPQR to extract this information into a well-defined format. Since this, at least from a NLP perspective, is easier that processing free text you would need to achieve significant coverage --- extracting all economical information about all countries, for example.

Since there is no way to evaluate your results automatically, you will have to do that by hand and include it in your write-up. It is enough if you evaluate performance for the top-level structures. In the traffic accidents example, you would need to count the number of accidents mentioned in your input, and the number that is recognised by your code. But if you, as an intermediate step, also recognise proper names which are included in the `accident' representations, you do not have to give figures for the names.

For all projects, you should hand in the write-up (on paper, not electronically) to Yorick and all code (including CPSL) to Tomas (electronically, preferably via email).

**************************

Week 2 Tutorial
Semantic/conceptual representations and interlinguas

1) Write the semantic/conceptual coding for some verb or noun in what you can remember of some system such as Schank's or Wilks' notation. It doesn't have to be exactly that notation, anything sensible will do if you can add some notes showing the meanings or primitives, cases etc., and any structural rule of interpretation, but do not add in notes what the coding means as a whole.

2) Exchange with a partner and see if they can work out what it means. What does this exercise tell you about conceptual representations, their interpretability; it and the function of an interlingua as, say, the pivot of a machine translation system?

3) Try to find someone with another native language for each group, such as Chinese, and see if they can teach sufficient (romanised) Chinese primitives to the group so that the exercise of (1+2) could be done and understood with primitives from another language entirely.

4) A Chinese speaker might indicate on the board the structure of the Chinese verbs that cover the notion of wearing something----in Chinese and Japanese the verb used depends on what s worn, hat, coat, shoes, spectacles etc. Discuss what influence that fact has on the interlinguality of notions like WEAR which is a single concept in English and most western languages.

Does this refute the "interlingual translation" hypothesis or is there a way around it?

Is such a system robust enough to translate, whether coded with a Chinese - based or an English-based concept (or concepts) for the wearing domain?

Week 3 Tutorial
n-gram models of language

If you were asked to predict the next word in English after

The man in the ---------

you might well say "moon", "blue suit" etc. and checking a large corpus of text would tell you whether what you guessed was the most probable answer. Such probabilities are calculated in terms of a chain of prior words: the longer the chain, the better the guess but the more computation is required. The simplest is a chain of two or bi-gram: for each word in a text (each word TYPE, meaning the same letter string wherever it occurs in the text) one can calculate the ranked likelihood of its successors in a corpus. So, after "the" could come almost any noun, so the bi-gram beginning "the X" in English is not very interesting, but after "rancid" can come only very few words like "fat" "oil" "margarine" etc.

1) Think about how you would calculate the bi-grams for all words in a corpus of text: you need not write a program but begin to specify one as well as you can. It need not be very efficient, but it should be a good specification. make an numerical assumptions you need (e.g. there are n words in the corpus) and assume any sub-routine modules you need without specifying them in detail.

If we go from bi-grams to tri-grams, we ask what words, in order of likelihood, can follow any given PAIR of words. That gives a better predictive model and is the usual computational limit for text statistics.

2) Begin to think about specifying a tri-gram algorithm. Would it use the algorithm of (1) as a sub-algorithm or not. Or could you use, sub-modules of (1)?

Week 4 Tutorial
Part-of-speech tagging

1) As a full group discuss what would be a reasonable set of part-of-speech tags (not more than 10 and don't forget proper nouns) for English. Do this by recalling the sorts of syntactic theory you discussed last term, since the idea of tagging text automatically is to pre-prepare it for syntactic analysis.

2) Now in small groups, hand tag a few sentences of your own choosing (or composition) using the agreed tag set.

3) See if you can spot by eye, and before any computation, any recurring tag patterns. What do those pattern strings constitute? Can you see any recurring patterns (that would be found by any program designed to find recurring tag strings) that do not correspond to any conventional syntactic constituent (as defined by, say, a phrase structure grammar)?

Week 5 Tutorial
Text alignment and bilingual dictionaries

1) Write a sentence in English and one in a foreign language (Chinese or other foreign students might help here---maybe one in each group). Write the foreign sentence in Latin letters if possible and draw alignment lines between words across the two languages corresponding to translations. How are situations explained where this seems not to be possible?

2) Now imagine you have a large parallel corpus of English an another languages divided by sentences 1-n and paragraphs 1-m. Write pseudo code for how you would now find word correspondences, by paragraph, between the languages over the whole text and rank them in order of frequency.

3) Given those results as ordered frequency lists attached to English words (using real or imaginary foreign words in examples) write pseudocode for how you would now align the sentences within the paragraphs on the basis of those frequencies.

Week 6 Tutorial
Rule driven tagging algorithms

1) Turn to the list of "stop" words at the end of the Cherry paper. The task today is to try to construct he kinds of rules that make up a Cherry tagger from scratch. First decide on a simple list of tags, something smaller than Cherry's. A minimum might be:

CN common noun; PPN propernoun; ON other noun; PN pronoun; ADV adverb; DET determiner; AUX auxiliary verb; PART participle (like "giving" or ""gone"); ADJ adjective; PVB past verb ; VB other verb; UNK unknown; and let the stop list members be represented by their own names.

2) Each group should now quickly hand tag a few sentences---maybe the first three sentences of number 1 above.

3) One group should now be assigned to tag a control text from a book, say ten sentences, before starting on the main task.

4) The rest should now set about writing tag assignment rules that will assign tags to the non-stop-list tags in the three test sentences--they should be written in production rule form, can contain disjunctions (e.g. if tag X or Y), and should assume that all non-stop-list tags in a sentence to be analyzed are initially UNK. Write the rules in some form like this:

A1...A2 X/Y B1....B3------> Z

where the As are the left context of the site to be changed, Bs the right context,

X/Y is in this case the tag X OR the tag Y (but it needn't be a disjunction of course) and Z is the tag to replace the left tag by.

5) After 15 minutes of this, apply the rules in each group to the words of the control text (which should be written on a slide or board for everyone by the control group) and see who has the best percentage correct by their rules.

6) Discuss the adequacy of the simple rule format above--are there cases it couldn't resolve (i.e. which stayed us UNK or couldn't be got right by this method)?

Week 7 Tutorial
Looking at machine readable dictionaries

1) With this sheet you should get two pages from LDOCE (Longman's Dictionary --- the one that has been most subjected to computational examination). Each group should select twenty noun entries from the pages and five verbs (if you can find them).

2) Locate the main noun in the definition for each of the twenty. Are there any cases where the headword is not a good candidate for the genus. Is this true also of the verbs you chose?

3) Do the candidates for the verb genus seem at all like the sorts of verb primitive you discussed last term in Schank's and Wilks' systems?

4) Take two of the noun definition entries for which the main noun is not a good genus candidate and draw syntactic trees for them.

5) can you see any simple algorithms that would give you the "appropriate" noun as the genus in those cases? That is, can you see a syntactic rule that would provide the word you feel IS appropriate ads the genus?

Week 8 Tutorial
More on machine readable dictionaries

You should have some reasonable English dictionary in each group for this week's work.

1) Find a noun in a sentence in some piece of writing you have--try to make it a noun with at east three senses in the dictionary you have in the group, and try to make sure the sentence has at least ten words in it.

2) Take the dictionary entry for that word and, for EACH sense definition in it assign a page: now look up all the CONTENT words in that definition and write down all the CONTENT words in their definitions (i.e. not articles, prepositions etc.). Do this once more and make a set of all the content words you find.

3) Now do this for each sense of the word, one page for each sense. On each page (one for each sense) you should now have a large set of words. make sure you can tell which page corresponds to which sense.

4) Now go back to the original text sentence with the word in and see which page has most of the words in the sentence on it. Is that page the one corresponding to the right sense of the word in that sentence?

You have now simulated Lesk's original method for resolving sense ambiguity in text with a MRD. What do you think of it?