MACHINE_TYPE in the Makefile.
Unarchive T3 into a directory and edit the Makefile. Set the
SRILM variable to point to your srilm install directory, and ensure
that SRILM_LIB is set appropriately for your architecture. Also
set the include and library location of boost, if you've installed it in
a non-standard location. You can also change the compilers, but I can't
guarantee that the code will compile except with gcc/g++. I've compiled
successfully with g++ 3.4.6, 4.0.1 and 4.1.2 for 32-bit and 64-bit platforms
(intel, amd and PPC). To build a 64-bit version, you'll have to add the flag
-m64 to CXXFLAGS, CFLAGS and LDFLAGS in
the Makefile files in the source directory and in
svm_light and svm_struct.
Now run make, which will build four binaries. These perform training,
classification and grammar extraction (in two different ways). Note that there
are a number of python scripts in the scripts directory. To run
these you'll need a moderately modern version of Python (I use 2.5, but I
think 2.3 would be sufficient).
sample directory.
There's both a training set and a testing set of ten tree-pairs each. These
are stored in penn-treebank format with one tree per line and source and
target trees on alternate lines. The trees are predicted parses from
Bikel's implcodeentation of the Collins parser. To illustrate, the final pair
of lines in demo.training are:
(S (NP (DT this)) (VP (VBZ has) (VP (VBN involved) (NP (VBG thinning) (NNS
trees)) (, ,) (S (VP (VP (VBG opening) (PRT (RP up)) (NP (NP (NNS glades)) (PP
(IN around) (NP (JJ mature) (NNS scots) (NNS pines))))) (CC and) (VP (VBG
planting) (NP (NN juniper) (CC and) (NNS blackberries))))))) (. .))
(S (S (NP (DT this)) (VP (VBD involved) (S (VP (VBG thinning) (NP (NNS
trees)))))) (, ,) (NP (NN opening)) (VP (VBZ glades) (PP (IN around) (NP (NNPS
scots) (NNS pines) (CC and) (NN planting) (NN juniper) (CC and) (NNS
blackberries)))) (. .))
In addition the file demo.align contains the word alignments between
the string yield of the source and target trees in the training set. These are
pairs of word indices (source-target) separated by spaces, counting from zero.
The alignments were produced from a larger training set using the Berkeley
aligner. Equivalently, GIZA++, or any other aligner, could be used.
For the last training pair, shown above, we have:
0-0 2-1 3-2 4-3 5-4 6-5 8-6 9-7 11-8 12-9 13-10 14-11 15-12 16-13 17-14 18-15
To launch the demo run the script runme.sh which peforms grammar
extraction, feature instantiation, model training and decoding. This produces
the output files
rules.harvest,
rules.copy,
rules.delete,
rules.epsilon,
grammar,
features,
model,
model.grammar,
model.svm,
predict. The rules.* files are the output of the grammar
extraction process. The grammar, features files are the result of
feature extraction on each of the rules. The model* files encode
the model settings and parameters after training, and predict
contains the model's predictions on the testing set.
Comparing predict to
demo.training.source, we can see that the model has learnt
to peform various modifications. Given the tiny training set, it's
unsuprising that the output is none too stellar.
harvest2harvest2 tool, which extracts the maximally general (smallest)
pairs of tree fragments which obey the word-alignment. There are various
options on how to interpret the word alignment and the types and sizes of
rules to emit.
The full set of options are as follows:
$ harvest2 --help
Allowed options:
--help produce help message
-a [ --align ] arg the alignments file in Pharaoh format
-p [ --parse ] arg the parse tree file - each pair of
lines contain source, target pairs
-s [ --source ] arg source parse tree file name
-t [ --target ] arg target parse tree file name
--raw emit raw rules before generalisation
-i [ --identity ] specify only parts of the rule which
differ in source-target yield
-u [ --uniform ] constrain rules to matching consequent
categories
--tight arg (=0) the minimum number of non-null edges
required [0,4]
--null with tight=0, allow entirely null
alignment blocks
--delete allow for source subtrees to be deleted
-r [ --specification-limit ] arg (=0) the maximum extent of specification of
aligned phrases
-v [ --max-variables ] arg (=99) the maximum number of variable nodes in
a rule
-z [ --max-size ] arg (=99) the maximum size of a harvested rule
tree (lhs or rhs tree nodes)
-e [ --epsilon ] extract epsilon rules; i.e. rules which
modify only one of the source and
target trees.
--spmt extract rules in a similar manner to
Marcu et al.'s SPMT
--max-phrase arg (=7) maximum phrase size in SPMT heuristic
(just the lexical part)
Running this on the demo training data produces the following output:
$ ../harvest2 -a demo.align -p demo.training --delete --epsilon | head (S (NP #0) (VP #1) (. #2)) (S (NP #0) (VP #1) (. #2)) (TOP (S #0)) (TOP (S #0)) (NP (NNS #0)) (NP (NNS #0)) (NNS conservationists) (NNS conservationists) (VP (VBP #0) (VP #1)) (VP (VBP #0) (VP #1)) (VBP are) (VBP are) (VP (VBN #0) (PP #1)) (VP (VBN #0) (PP #1)) (VBN concerned) (VBN concerned) (PP (IN #0) (NP #1)) (PP (IN #0) (NP #1)) (IN about) (IN about)Each line consists of a source and target tree fragment with a tab separator, and the #x items denote a variable and its alignment (a.k.a. frontier non-terminal). The options
--delete --epsilon means that source fragments that are
unaligned become deletion rules, e.g.,
(IN in) NULL (NP (DT #0) (JJ #1) (NN #2)) NULL (DT the) NULL (JJ scottish) NULL (NN population) NULLinstead of the following
(PP (IN in) (NP #0)) (PP #0) (NP (NP (DT the) (JJ scottish) (NN population)) (PP #0)) (PP #0)When there are variables in the source that do not appear in the target, this licences the deletion of a subtree. The training algorithm has a flag which controls whether subtrees can simply be deleted by such a rule, or if they also require epsilon rules (those with NULL target fragments) to cover the deleted subtree.
The option -r N allows up to N adjacent rules to be combined into
a single rule. The number of rules is exponential in N, so small values are
advised (≤ 3).
Another useful option is --spmt which uses a
heuristic for rule extraction modelled after the technique in
Marcu et al., EMNLP 2006. This heuristic uses a sliding window over the source
string. For each string it finds the translation string in the target using
the alignment, and then extracts the smallest pair of tree fragments which
cover the two strings. This results in extremely lexicalised transuction
rules. This can be seen in the output,
$ ../harvest2 -a demo.align -p demo.training --delete --epsilon --spmt | head -3 (NP (NNS conservationists)) (NP (NNS conservationists)) (S (NP (NNS conservationists)) (VP (VBP are) (VP #0)) (. #1)) (S (NP (NNS conservationists)) (VP (VBP are) (VP #0)) (. #1)) (S (NP (NNS conservationists)) (VP (VBP are) (VP (VBN concerned) (PP #0))) (. #1)) (S (NP (NNS conservationists)) (VP (VBP are) (VP (VBN concerned) (PP #0))) (. #1))
Alternatively, for tree-to-string data the binary harvest_ghkm implements the Galley et al.'s 2004 grammar extraction heuristic. This supports
the following options, which are self-explanatory:
../harvest_ghkm Allowed options: --help produce help message -a [ --align ] arg the alignments file in Pharaoh format -s [ --source ] arg the source tree file, one per line in PTB format -t [ --target ] arg the target string file, one per line, tokenised -e [ --epsilon ] allow epsilon (deletion) rules -i [ --interior ] output interior alignment between terminals in each rule
In addition to the grammar extracted as above, in our experiments we've added
rules to clone or delete parts of the source tree. These rules are extracted
from the training and testing source trees using the python scripts in the
scripts directory. See sample/runme.sh for An
example of their usage.
rule_features.pyscripts/rule_features.py,
which is used as follows:
$ python scripts/rule_features.py --help
Options:
-h, --help show this help message and exit
--input=INPUT input file of grammar rules; can specify multiple inputs
--pcfg=PCFG expansion frequencies for CFG rewrites; used to implement
PCFG feature
--output=OUTPUT output file for the resultant grammar
--fout=FOUT output file for the feature set
--verbose display to stdout each rule and its features
--type --root --identity --unlex --rcount --wcount
--dvars --yields --length --epsmin
As input files, we use the output from grammar extraction and the copy
and delete rules. The output is the grammar consisting of rules and a sparse
vector of features for each rule and a feature file, which provides a
human-readable defintion of each feature.
Here's a snippet from the grammar for the demo data:
(S (NP #0) (VP #1) (. .)) (S (NP #0) (VP #1) (. .)) 5155 1 4450 1 5162 1 5484 1 5485 1 5486 1 7 1 5487 1 5488 1 5489 1 11 1 12 1 58 1 59 1 5475 1 127 1 4131 1 61 1Where the feature vector is represented as many index, value pairs. The features are defined as:
5155 ('root: lhs', 'S')
4450 ('root: rhs', 'S')
5162 ('root: lhs,rhs', ('S', 'S'))
5484 ('identity: lhs', '(S (NP) (VP) (. .))')
5485 ('identity: rhs', '(S (NP) (VP) (. .))')
5486 ('identity: lhs,rhs', (S (NP #0) (VP #1) (. .)), (S (NP #0) (VP #1) (. .)))
7 identity: lhs and rhs equal
5487 ('unlex: lhs', '(S (NP) (VP) (. .))')
5488 ('unlex: rhs', '(S (NP) (VP) (. .))')
5489 ('unlex: lhs,rhs', '(S (NP #0) (VP #1) (. .))', '(S (NP #0) (VP #1) (. .))')
11 wcount: target
12 wcount: source
58 ('yield: lhs,rhs', ('.',), ('.',))
59 ('yield: tok in lhs and rhs', '.')
5475 ('yield: preterms lhs,rhs', ('NP', 'VP', '.'), ('NP', 'VP', '.'))
127 ('yield: preterm in lhs and rhs', 'NP')
4131 ('yield: preterm in lhs and rhs', 'VP')
61 ('yield: preterm in lhs and rhs', '.')
which can be found by running with the --verbose flag or else
inspecting the output features file (--fout option).
The script supports a number of types of features using the flags --type
--root --identity --unlex --rcount --wcount --dvars --yields --length
--epsmin. Please see the python code and journal article for further
details on how these work.
The model also supports a PCFG probability features (actually the
log-probability of the target). This requires PCFG expansion codes to be
provided as input using the --pcfg flag. The file should contain
lines of the form:
I ADJP JJ JJR 13 17787 P VBZ yells 1 26436where
I denotes an internal node and P a
pre-terminal, and the following strings are the parent and child nodes.
The two numbers are the counts/frequency of the rewrite and the
parent category, respectively.
transducer_learntransducer_learn with no arguments gives the
command line help, pruned here to just the core options:
usage: transducer_learn [options] example_file model_file
Arguments:
example_file-> file with training data
model_file -> file to store learned decision rule in
Learning options:
-c float -> C: trade-off between training error
and margin (default 0.01)
-o [1,2] -> Slack rescaling method to use for loss.
1: slack rescaling
2: margin rescaling
(default 1)
-l [0..] -> Loss function to use.
0: zero/one loss
(default 0)
--g string grammar file name
--m string SRILM file name
--o int LM order
--v string source-conditioned features file name
The grammar must be supplied, while the LM file (and order) is optional. You
can also supply a file specifying source-conditioned features. This is a
fairly complicated format and can really slow down inference, so I wouldn't
suggest using unless absolutely necessary. The -l options
controls choice the loss function, which must be one of:
1: unordered token errors
2: unordered token errors with length penalty
3: unordered token precision
4: (3) x brevity penalty
5: (3) x two-sided brevity penalty
6: unordered ngram errors (uniformly weighted)
7: unordered ngram precision (smoothed geometric mean)
8: (7) x brevity penalty (i.e., BLEU)
9: unordered CFG production errors
10: (9) with length penalty
11: unordered CFG production precision
12: clipped unordered token errors
13: clipped unordered token precision
14: clipped unordered token recall
15: clipped unordered token F1
16: unordered edit distance
17: asymmetric hamming distance, sim. to (2)
The last (-l 17) is a pretty good default for most tasks.
Futher options control the style of tree-transduction. For instance, the model can perform tree-to-tree or tree-to-string transduction. Additionally, it can perform exact or loose rule matching. In loose matching the rule internals are ignored; instead the frontier nodes are matched to any complete sequence of descendent nodes in the target tree.
--j [0,1] index rule sources with a 0) forest [def] 1) trie
--k [0,1] index rule targets with a 0) forest [def] 1) trie
--q [0..2] select gold derivation based on mapping source to
0) target tree, 1) target string, 2) a surrogate tree;
default = 0
The model requires a feature representation for the gold-standard of each
training instance. Because we model derivations, not the resultant tree or
string, we need a gold-standard derivation. But this is a latent variable,
so consequently, we adopt a few different heuristic to find a derivation.
These are controlled using --q, above, and the following:
--s [1..5] heuristic for selecting gold derivation:
1: maximum number of rules (default)
2: minimum number of rules
3: at random
4: maximum score (may want to use --w)
5: minimum score (ditto)
--f [0,1] recompute gold (1) every iteration
only useful with --s [3,4,5]; default no (0)
The surrogate option --q 2 is experimental; this allows
derivations which do not exactly yield the reference tree/string, but instead
chooses a surrogate with low loss. This search is approximate, and is
prone to search errors for some loss functions.
transducer_classifytransducer_classify. This
takes the testing instances (tree pairs, or just input trees), the model
and writes the predictions to an output file:
usage: svm_struct_classify [options] example_file model_file output_fileMost of the same options as training can be supplied, which override the value used in training. (Many of these will have no effect on decoding.)