Parsing Engine

danbikel.parser
Interface Training

All Known Implementing Classes:
AbstractTraining

public interface Training

Specifies methods for language-specific preprocessing of training parse trees. The primary method to be invoked from an implementation of this interface is preProcess(Sexp). Additionally, as implementations are likely to contain or have access to appropriate preprocessing data and methods, this interface also specifies a crucial method to be used for post-processing, to "undo" what it was done during preprocessing after decoding. This post-processing method is postProcess(Sexp), and is invoked by default by the Decoder.

A language package must include an implementation of this interface.

See Also:
preProcess(Sexp), postProcess(Sexp), Decoder

Method Summary
 Sexp addBaseNPs(Sexp tree)
          Adds and/or relabels base NPs in the specified tree.
 Sexp addGapInformation(Sexp tree)
          Augments nonterminals to include gap information for WHNP's that have moved and leave traces (gaps), as in the GPSG framework.
 Set argNonterminals()
          Returns a static set of possible argument nonterminals.
 Symbol defaultArgAugmentation()
          The symbol that is used to mark argument (required) nonterminals by identifyArguments(Sexp).
 Symbol gapAugmentation()
          The symbol that will be used to identify nonterminals whose subtrees contain a gap (a trace).
 Symbol getCanonicalArg(Symbol argLabel)
          Returns the canonical version of the specified argument nonterminal, crucially including its argument augmentation.
 Set getPrunedPreterms()
          Returns the set of pruned preterminals (Sexp objects).
 Set getPrunedPunctuation()
          Returns the set of preterminals (Sexp objects) that were punctuation elements that were "raised away" because they were either at the beginning or end of a sentence.
 boolean hasGap(Symbol label)
          Returns true if and only if label has a gap augmentation as added by addGapInformation(Sexp).
 Sexp identifyArguments(Sexp tree)
          Augments labels of nonterminals that are arguments.
 boolean isArgument(Symbol label)
          Returns true if and only if label has an argument augmentation as added by identifyArguments(Sexp).
 boolean isArgumentFast(Symbol label)
          Returns true if and only if the specified nonterminal label has an argument augmentation preceded by the canonical augmentaion delimiter.
 boolean isValidTree(Sexp tree)
          Returns whether the specified tree is valid.
 void postProcess(Sexp tree)
          Post-processes a parse tree after decoding, eseentially undoing the steps performed in preprocessing.
 Sexp preProcess(Sexp tree)
          The method to call before counting events in a training parse tree.
 SexpList preProcessTest(SexpList sentence, SexpList originalWords, SexpList tags)
          Preprocesses the specified test sentence and its coordinated list of tags.
 Sexp prune(Sexp tree)
          Prunes away subtrees that have a root that is an element of nodesToPrune.
 Sexp raisePunctuation(Sexp tree)
          Raises punctuation to the highest possible point in a parse tree, resulting in a tree where no punctuation is the first or last child of a non-leaf node.
 Sexp relabelSubjectlessSentences(Sexp tree)
          Relabels sentences that have no subjects with the nonterminal label returned by Treebank.subjectlessSentenceLabel().
 Symbol removeArgAugmentation(Symbol label)
          Removes any argument augmentations from the specified nonterminal label.
 Sexp removeGapAugmentation(Sexp sexp)
          If the specified S-expression is a list, this method modifies the list to contain only symbols without gap augmentations; otherwise, this method removes the gap augmentation (if one exists) in the specified symbol and returns that new symbol.
 Sexp removeNullElements(Sexp tree)
          Removes all null elements, that is, those nodes of tree for which Treebank.isNullElementPreterminal(Sexp) returns true.
 boolean removeWord(Symbol word, Symbol tag, int idx, SexpList sentence, SexpList tags, SexpList originalTags, Set prunedPretermsPosSet, Map prunedPretermsPosMap)
          Invoked by the decoder as the first step in preprocessing (prior to the invocation of preProcessTest(danbikel.lisp.SexpList, danbikel.lisp.SexpList, danbikel.lisp.SexpList)).
 Sexp repairBaseNPs(Sexp tree)
          Changes the specified tree so that when the last child of an NPB is an S, the S gets raised to be a sibling immediately following the NPB.
 void setUpFastArgMap(CountsTable nonterminals)
          Indicates to set up a static map for quickly mapping argument nonterminals to their non-argument variants (that is, for quickly stripping away their argument augmentations).
 String skip(Sexp tree)
          Returns whether the specified tree is to be skipped when training.
 Symbol startSym()
          Returns the symbol to indicate hidden nonterminals that precede the first in a sequence of modifier nonterminals.
 Word startWord()
          Returns the Word object that represents the hidden "head word" of the start symbol.
 Symbol stopSym()
          Returns the symbol to indicate a hidden nonterminal that follows the last in a sequence of modifier nonterminals.
 Word stopWord()
          Returns the Word object that represents the hidden "head word" of the stop symbol.
 Sexp stripAugmentations(Sexp tree)
          Strips any augmentations off all of the nonterminal labels of tree.
 Symbol topSym()
          Returns the symbol to indicate the hidden root of all parse trees.
 Word topWord()
          Returns the Word object that represents the hidden "head word" of the hidden root of all parse trees.
 Symbol traceTag()
          The symbol that gets reassigned as the part of speech for null preterminals that represent traces that have undergone WH-movement, as relabeled by the default implementation of addGapInformation(Sexp).
 

Method Detail

setUpFastArgMap

public void setUpFastArgMap(CountsTable nonterminals)
Indicates to set up a static map for quickly mapping argument nonterminals to their non-argument variants (that is, for quickly stripping away their argument augmentations).

N.B.: This method is necessarily thread-safe, as it is expected to be invoked by every Decoder as it starts up, and since there can be multiple Decoder instances within a given VM. However, note that it is inappropriate to invoke this method if the set of nonterminals in the specified counts table is incomplete (see the documentation for the SubcatBag class for an instance where this will be the case).

Parameters:
nonterminals - a counts table whose keys form a complete set of all possible nonterminal labels, as is obtained from DecoderServerRemote.nonterminals() (the counts to which the nonterminals are mapped are not used by this method)

preProcess

public Sexp preProcess(Sexp tree)
The method to call before counting events in a training parse tree.

Parameters:
tree - the parse tree to pre-process
Returns:
tree having been pre-processed

removeWord

public boolean removeWord(Symbol word,
                          Symbol tag,
                          int idx,
                          SexpList sentence,
                          SexpList tags,
                          SexpList originalTags,
                          Set prunedPretermsPosSet,
                          Map prunedPretermsPosMap)
Invoked by the decoder as the first step in preprocessing (prior to the invocation of preProcessTest(danbikel.lisp.SexpList, danbikel.lisp.SexpList, danbikel.lisp.SexpList)). Returns whether the specified word should be removed from the sentence before parsing.

Parameters:
word - a word in the sentence about to parsed
tag - the supplied part-of-speech tag of the specified word, or null if tags were not supplied
idx - the index of the specified word in the specified sentence
sentence - a list of Symbol objects that represent the words of the sentence to be parsed
tags - coordinated list of supplied part-of-speech tag lists for each of the words in the specified sentence, or null if no tags were supplied
originalTags - the cached copy of the specified tags list, used when Settings.restorePrunedWords is true
prunedPretermsPosSet - the set of part-of-speech tags that were pruned during training
prunedPretermsPosMap - a map of words pruned during training to their part-of-speech tags when they were pruned
Returns:
whether the specified word should be removed from the sentence before parsing

preProcessTest

public SexpList preProcessTest(SexpList sentence,
                               SexpList originalWords,
                               SexpList tags)
Preprocesses the specified test sentence and its coordinated list of tags.

Parameters:
sentence - the list of words, where a known word is a symbol and an unknown word is represented by a 3-element list (see DecoderServerRemote.convertUnknownWords(danbikel.lisp.SexpList))
originalWords - the list of unprocessed words (all symbols)
tags - the list of tag lists, where the list at index i is the list of possible parts of speech for the word at that index
Returns:
a two-element list, containing two lists, the first of which is a processed version of sentence and the second of which is a processed version of tags; if tags is null, then the returned list will contain only one element (since SexpList objects are not designed to handle null elements)

isValidTree

public boolean isValidTree(Sexp tree)
Returns whether the specified tree is valid. The particular notion of validity can be language package-dependent.

Parameters:
tree - the parse tree to check for validity

skip

public String skip(Sexp tree)
Returns whether the specified tree is to be skipped when training.

Parameters:
tree - an annotated training tree
Returns:
a string if the specified tree is to be skipped when training, null otherwise
See Also:
Trainer.train(SexpTokenizer,boolean,boolean)

getPrunedPreterms

public Set getPrunedPreterms()
Returns the set of pruned preterminals (Sexp objects).

See Also:
prune(Sexp)

prune

public Sexp prune(Sexp tree)
Prunes away subtrees that have a root that is an element of nodesToPrune.

Side effect: An internal set of pruned preterminals will be updated. This set may be accessed via getPrunedPreterms().

Bugs: Cannot prune away entire tree if the root label of the specified tree is in nodesToPrune.

Parameters:
tree - the parse tree to prune
Returns:
tree having been pruned

identifyArguments

public Sexp identifyArguments(Sexp tree)
Augments labels of nonterminals that are arguments. This method is optional, and may be overridden to simply return tree untouched if argument identification is not desired for a particular language package.

Parameters:
tree - the parse tree to modify
Returns:
a reference to the modified tree object
See Also:
Treebank.canonicalAugDelimiter()

defaultArgAugmentation

public Symbol defaultArgAugmentation()
The symbol that is used to mark argument (required) nonterminals by identifyArguments(Sexp).


isArgument

public boolean isArgument(Symbol label)
Returns true if and only if label has an argument augmentation as added by identifyArguments(Sexp).


isArgumentFast

public boolean isArgumentFast(Symbol label)
Returns true if and only if the specified nonterminal label has an argument augmentation preceded by the canonical augmentaion delimiter. Unlike isArgument(Symbol), this method is thread-safe. Also, it is more efficient than isArgument(Symbol), as it does not actually parse the specified nonterminal label.


getCanonicalArg

public Symbol getCanonicalArg(Symbol argLabel)
Returns the canonical version of the specified argument nonterminal, crucially including its argument augmentation. For example, in the English Penn Treebank, the canonical version of NP-CLR-A would typically be NP-A, where A is the argument augmentation.

Parameters:
argLabel - the argument nonterminal to be canonicalized
Returns:
the canonical version of the specified argument nonterminal

addGapInformation

public Sexp addGapInformation(Sexp tree)
Augments nonterminals to include gap information for WHNP's that have moved and leave traces (gaps), as in the GPSG framework. This method is optional, and may simply return tree untouched if gap information is desired for a particular language package.

Parameters:
tree - the parse tree to which to add gapping
Returns:
the same tree that was passed in, with certain nodes modified to include gap information

hasGap

public boolean hasGap(Symbol label)
Returns true if and only if label has a gap augmentation as added by addGapInformation(Sexp).


gapAugmentation

public Symbol gapAugmentation()
The symbol that will be used to identify nonterminals whose subtrees contain a gap (a trace). This method is used by stripAugmentations(Sexp), so that gap augmentations that are added by addGapInformation(Sexp) do not get removed.


traceTag

public Symbol traceTag()
The symbol that gets reassigned as the part of speech for null preterminals that represent traces that have undergone WH-movement, as relabeled by the default implementation of addGapInformation(Sexp).


relabelSubjectlessSentences

public Sexp relabelSubjectlessSentences(Sexp tree)
Relabels sentences that have no subjects with the nonterminal label returned by Treebank.subjectlessSentenceLabel(). This method is optional, and may be overridden to simply return tree untouched if subjectless sentence relabeling is not desired for a particular language package.

Parameters:
tree - the parse tree in which to relabel subjectless sentences
Returns:
the same tree that was passed in, with subjectless sentence nodes relabeled
See Also:
Treebank.isSentence(Symbol), Treebank.subjectAugmentation(), Treebank.isNullElementPreterminal(Sexp), Treebank.subjectlessSentenceLabel()

stripAugmentations

public Sexp stripAugmentations(Sexp tree)
Strips any augmentations off all of the nonterminal labels of tree. The set of nonterminal labels does not include preterminals, which are typically parts of speech. If a particular language's Treebank augments preterminals, this method should be overridden in a language package's subclass. The only augmentations that will not be removed are those that are added by identifyArguments(Sexp), so as to preserve the transformations of that method. This method should only be called subsequent to the invocations of methods that require augmentations, such as relabelSubjectlessSentences(Sexp).

Parameters:
tree - the tree all of the nonterminals of which are to be stripped of all augmentations except those added by identifyArguments
Returns:
a reference to tree

raisePunctuation

public Sexp raisePunctuation(Sexp tree)
Raises punctuation to the highest possible point in a parse tree, resulting in a tree where no punctuation is the first or last child of a non-leaf node. One consequence is that all punctuation is removed from the beginning and end of the sentence. The punctuation affected is defined by the implementation of the method Treebank.isPuncToRaise(Sexp).

Side effect: All preterminals removed from the beginning and end of the sentence are stored in an internal set, which can be accessed via getPrunedPunctuation().

Example of punctuation raising:

 (S (NP
      (NPB Pierre Vinken)
      (, ,)
      (ADJP 61 years old)
      (, ,))
    (VP joined (NP (NPB the board))) (. .))
 
becomes
 (S (NP
      (NPB Pierre Vinken)
      (, ,)
      (ADJP 61 years old))
    (, ,)
    (VP joined (NP (NPB the board))))
 
This method appropriately deals with the case of having multiple punctuation elements to be raised on the left or right side of the list of children for a nonterminal. For example, in English, if this method were passed the tree
 (S
   (NP (DT The) (NN dog) (, ,) (NNP Barky) (. .) (. .) (. .))
   (VP (VB was) (ADJP (JJ stupid)))
   (. .) (. .) (. .))
 
the result would be
 (S
   (NP (DT The) (NN dog) (, ,) (NNP Barky))
   (. .) (. .) (. .)
   (VP (VB was) (ADJP (JJ stupid))))
 

Bugs: In the pathological case where all the children of a node are punctuation to raise, this method simply emits a warning to System.err and does not attempt to raise them (which would cause an interior node to become a leaf).

Parameters:
tree - the parse tree to destructively modify by raising punctuation
Returns:
a reference to the modified tree object

getPrunedPunctuation

public Set getPrunedPunctuation()
Returns the set of preterminals (Sexp objects) that were punctuation elements that were "raised away" because they were either at the beginning or end of a sentence.

See Also:
raisePunctuation(Sexp)

addBaseNPs

public Sexp addBaseNPs(Sexp tree)
Adds and/or relabels base NPs in the specified tree.

Parameters:
tree - the parse tree in which to add and/or relabel base NPs
Returns:
a reference to the modified version of tree
See Also:
Treebank.isNP(Symbol), Treebank.baseNPLabel(), Treebank.NPLabel()

repairBaseNPs

public Sexp repairBaseNPs(Sexp tree)
Changes the specified tree so that when the last child of an NPB is an S, the S gets raised to be a sibling immediately following the NPB. That is, situations such as
 (NP
   (NPB
     (DT an)
     (NN effort)
     (S ...)))
 
get transformed to
 (NP
   (NPB
     (DT an)
     (NN effort))
   (S ...))
 


removeNullElements

public Sexp removeNullElements(Sexp tree)
Removes all null elements, that is, those nodes of tree for which Treebank.isNullElementPreterminal(Sexp) returns true. Additionally, if the removal of a null element leaves an interior node that is childless, then this interior node is removed as well. For example, if we have the following sentence in English
 (S (NP-SBJ (-NONE- *T*)) (VP ...)) 
it will be transformed to be
 (S (VP ...)) 
N.B.: This method should only be invoked after preprocessing with relabelSubjectlessSentences(Sexp) and addGapInformation(Sexp), as these methods (and possibly others, if overridden) rely on the presence of null elements.

See Also:
Treebank.isNullElementPreterminal(Sexp)

startSym

public Symbol startSym()
Returns the symbol to indicate hidden nonterminals that precede the first in a sequence of modifier nonterminals.

See Also:
Trainer

startWord

public Word startWord()
Returns the Word object that represents the hidden "head word" of the start symbol.

See Also:
startSym(), Trainer

stopSym

public Symbol stopSym()
Returns the symbol to indicate a hidden nonterminal that follows the last in a sequence of modifier nonterminals.

This symbol may also be used as a special value that is guaranteed not to conflict with any nonterminal in a given language's treebank.

See Also:
Trainer

stopWord

public Word stopWord()
Returns the Word object that represents the hidden "head word" of the stop symbol.

See Also:
stopSym(), Trainer

topSym

public Symbol topSym()
Returns the symbol to indicate the hidden root of all parse trees.

See Also:
Trainer

topWord

public Word topWord()
Returns the Word object that represents the hidden "head word" of the hidden root of all parse trees.


argNonterminals

public Set argNonterminals()
Returns a static set of possible argument nonterminals.

Returns:
a static set of possible argument nonterminals

removeArgAugmentation

public Symbol removeArgAugmentation(Symbol label)
Removes any argument augmentations from the specified nonterminal label.

Parameters:
label - the label whose argument augmentations are to be removed
Returns:
a new label with no argument augmentations

removeGapAugmentation

public Sexp removeGapAugmentation(Sexp sexp)
If the specified S-expression is a list, this method modifies the list to contain only symbols without gap augmentations; otherwise, this method removes the gap augmentation (if one exists) in the specified symbol and returns that new symbol.

Parameters:
sexp - a symbol or list of symbols from which to remvoe any gap augmentations
Returns:
a symbol or list of symbols with no gap augmentations

postProcess

public void postProcess(Sexp tree)
Post-processes a parse tree after decoding, eseentially undoing the steps performed in preprocessing.

Parameters:
tree - the tree to be post-processed

Parsing Engine

Author: Dan Bikel.