Parsing Engine

danbikel.parser
Class Trainer

java.lang.Object
  extended bydanbikel.parser.Trainer
All Implemented Interfaces:
Serializable

public class Trainer
extends Object
implements Serializable

Derives all counts necessary to compute the probabilities for this parser, including the top-level counts and all derived counts. The two additional facilities of this class are (1) the loading and storing of a text file containing top-level event counts and (2) the loading and storing of a Java object file containing all derived event counts.

All top-level events or mappings are recorded as S-expressions with the format

(name event count)
for events and
(name key value)
for mappings.

All derived counts are stored by the internal data structures of several Model objects, which are in turn all contained within a single ModelCollection object. This class provides methods to load and store a Java object file containing this ModelCollection, as well as some initial objects containing information about the ModelCollection object (see the -scan flag in the usage of the main method of this class).

The various model objects capture the generation submodels of the different output elements of the parser. The smoothing levels of these submodels are represented by ProbabilityStructure objects, passed as parameters to the Model objects, at construction time. This architecture provides a type of "plug-n-play" smoothing scheme for the various submodels of this parser.

See Also:
main(String[]), Model, ModelCollection, ProbabilityStructure, Serialized Form

Nested Class Summary
static class Trainer.EventEntry
          Class to represent a MapToPrimitive.Entry object for use by the getEventIterator(danbikel.lisp.SexpTokenizer, danbikel.lisp.Symbol) method.
 
Field Summary
protected  Filter allPass
          An instance of AllPass.
protected  Map canonicalSubcatMap
          A reflexive map for storing canonical versions of Subcat objects.
protected  double countThreshold
          The value of the Settings.countThreshold setting.
protected  double derivedCountThreshold
          The value of the Settings.derivedCountThreshold setting.
protected  boolean downcaseWords
          The value of the Settings.downcaseWords setting.
protected  Subcat emptySubcat
          The value returned by Subcats.get().
protected  Symbol gapAugmentation
          The value of Training.gapAugmentation().
protected  CountsTable gapEvents
          A table for storing counts of gap-generation events.
static Symbol gapEventSym
          The label for gap events.
protected  Model gapModel
          The gap-generation model.
protected  CountsTable headEvents
          A table for storing counts of head-generation events.
static Symbol headEventSym
          The label for head nonterminal generation events.
protected  Model headModel
          The head-generation model.
protected  Map headToParentMap
          A map of head child nonterminals to their observed parent nonterminals.
protected  boolean keepAllWords
          The value of the Settings.keepAllWords setting.
protected  boolean keepLowFreqTags
          The value of the Settings.keepLowFreqTags setting.
protected  Map leftSubcatMap
          A map of events from the last back-off level of the left subcat–generation submodel to the set of possible left subcats.
protected  Model leftSubcatModel
          The model for generating subcats that fall on the left side of head children.
protected  Model lexPriorModel
          The model for marginal probabilities of lexical elements (for the estimation of the joint event that is a fully lexicalized nonterminal).
protected  ModelCollection modelCollection
          The set of Model objects and other resources that describe an entire parsing model.
static Symbol modEventSym
          The label for modifier nonterminal generation events.
protected  CountsTable modifierEvents
          A table for storing counts of modifier-generation events.
protected  Map modNonterminalMap
          A map of events from the last back-off level of the modifier nonterminal–generation submodel to the set of possible futures (typically, a future is a modifier label and its head word's part-of-speech tag).
protected  Model modNonterminalModel
          The modifying nonterminal–generation model.
protected  Model modWordModel
          The model that generates head words of modifying nonterminals.
protected  Filter nonPreterm
          A filter that only allows TrainerEvent instances that do not represent preterminals (where the parent is identical to the part-of-speech tag of the head word).
protected  Filter nonStop
          A filter that disallows ModifierEvent instances where the modifier is Training.stopSym(), but allows all other objects.
protected  Filter nonStopAndNonTop
          A filter that disallows ModifierEvent instances where the modifier is neither Training.stopSym() nor Training.topSym(), but allows all other objects.
static Symbol nonterminalEventSym
          The label for nonterminal generation events.
protected  Model nonterminalPriorModel
          The model for conditional probabilities of nonterminals given the lexical components (for the estimation of the joint event that is a fully lexicalized nonterminal).
protected  CountsTable nonterminals
          A table for storing counts of (unlexicalized) nonterminals.
protected  Filter nonTop
          A filter that only allows TrainerEvent instances where the parent nonterminal is not Training.topSym().
protected  Filter nonTopNonPreterm
          A filter that is functionally equivalent to piping objects through both nonTop and nonPreterm.
protected  int numPrevMods
          The value of the Settings.numPrevMods setting.
protected  int numPrevWords
          The value of the Settings.numPrevWords setting.
protected  Map posMap
          A map of words to lists of their observed part-of-speech tags.
static Symbol posMapSym
          The label for word to part-of-speech mappings.
protected  CountsTable priorEvents
          A table for storing counts of lexicalized nonterminal prior events.
protected  Set prunedPreterms
          A set of Sexp objects representing preterminals that were pruned during training.
static Symbol prunedPretermSym
          The label for the set of pruned preterminals.
static Symbol prunedPuncSym
          The label for the set of pruned punctuation preterminals.
protected  Set prunedPunctuation
          Returns the set of preterminals (Sexp objects) that were punctuation elements that were “raised away” because they were either at the beginning or end of a sentence.
protected  int reportingInterval
          The value of the Settings.trainerReportingInterval setting.
protected  Map rightSubcatMap
          A map of events from the last back-off level of the right subcat–generation submodel to the set of possible right subcats.
protected  Model rightSubcatModel
          The model for generating subcats that fall on the right side of head children.
protected  Map simpleModNonterminalMap
          A map from unlexicalized parent-head-side triples to all possible partially-lexicalized modifying nonterminals.
protected  Symbol startSym
          The value of Training.startSym().
protected  Word startWord
          The value of Training.startWord().
protected  Symbol stopSym
          The value of Training.stopSym().
protected  Word stopWord
          The value of Training.stopWord().
protected  Model topLexModel
          The head-word generation model for heads of entire sentences.
protected  Model topNonterminalModel
          The head-generation model for heads whose parents are Training.topSym().
protected  Filter topOnly
          A filter that only allows TrainerEvent instances where the parent is Training.topSym().
protected  Symbol topSym
          The value of Training.topSym().
protected  Symbol traceTag
          The value of Training.traceTag().
protected static Class trainerClass
          The class from which an instance will be constructed in main(String[]).
protected  int unknownWordThreshold
          The value of the Settings.unknownWordThreshold setting.
protected static String[] usageMsg
          The usage for the main method of this class.
protected  CountsTable vocabCounter
          A table for storing counts of vocabulary items.
static Symbol vocabSym
          The label for vocabulary counts.
protected  CountsTable wordFeatureCounter
          A table for storing counts of word feature–vectors.
protected  WordFeatures wordFeatures
          A handle onto static WordFeatures object contained static inside Language.
static Symbol wordFeatureSym
          The label for word feature (unknown vocabulary) counts.
 
Constructor Summary
Trainer()
          Constructs a new training object, which uses values from Settings for its settings.
 
Method Summary
protected  void addGapEvent(GapEvent event)
          This method is a synonym for addGapEvent(event, 1.0).
protected  void addGapEvent(GapEvent event, double count)
          Adds the specified GapEvent to gapEvents with the specified count.
protected  void addHeadEvent(HeadEvent event)
          This method is a synonym for addHeadEvent(event, 1.0).
protected  void addHeadEvent(HeadEvent event, double count)
          Adds the specified HeadEvent to headEvents with the specified count.
protected  void addModifierEvent(ModifierEvent event)
          This method is a synonym for addModifierEvent(event, 1.0).
protected  void addModifierEvent(ModifierEvent event, double count)
          Adds the specified ModifierEvent to modifierEvents with the specified count.
protected  void addToPosMap(Symbol word, Symbol tag)
          Called by addToPosMap(Word).
protected  void addToPosMap(Word word)
          Called by collectStats(danbikel.lisp.Sexp, danbikel.parser.HeadTreeNode, boolean) and alterLowFrequencyWords(HeadTreeNode).
static void addToValueCounts(Map map, Object key, Object value)
          Adds value to the set of values to which key is mapped (if value is not already in that set) and increments the count of that value by 1.
static void addToValueCounts(Map map, Object key, Object value, int count)
          Adds value to the set of values to which key is mapped (if value is not already in that set) and increments the count of that value by count.
protected  void alterLowFrequencyWords(HeadTreeNode tree)
          For every Word in the specified tree, if it occurred less than unknownWordThreshold times, then it is modified.
protected  void clearEventCounters()
          Clears the priorEvents, headEvents, modifierEvents and gapEvents counts tables.
protected  void collectModifierStats(HeadTreeNode tree, Subcat subcat, int gapIdx, boolean side)
          Note the O(n) operation performed on the prevModList.
protected  void collectStats(Sexp orig, HeadTreeNode tree, boolean isRoot)
          Collects the statistics from the specified tree.
protected  void countVocab(HeadTreeNode tree)
          Counts number of occurrences of each word in the specified tree and adds the word with this count to vocabCounter.
protected  void createModelObjects()
          Creates all of the internal model objects used by this trainer when constructing its internal ModelCollection object.
 void createPosMap()
          Creates posMap from the headEvents, modifierEvents and gapEvents counts tables.
 void createPosMap(CountsTable events)
          Adds to posMap using information contained in the specified counts table.
 void deriveCounts()
          Derives event counts for all back-off levels of all sub-models for the current parsing model.
 void deriveCounts(boolean setModelCollection)
          Derives event counts for all back-off levels of all sub-models for the current parsing model.
 void deriveCounts(boolean setModelCollection, FlexibleMap canonical)
          Derives event counts for all back-off levels of all sub-models for the current parsing model.
protected  void deriveCounts(double derivedCountThreshold, FlexibleMap canonical)
          Derives all counts for creating a ModelCollection object.
protected  void deriveModelCounts(double derivedCountThreshold, FlexibleMap canonical)
          A helper method used by deriveCounts(double,FlexibleMap) to derive counts for all Model instances contained within a ModelCollection.
 void doneCollectingObservations()
          A hook that gets called by main(java.lang.String[]) after all observations are collected via any calls to readStats(File), readStats(SexpTokenizer) and train(SexpTokenizer,boolean,boolean).
static SexpList getCanonicalList(Map map, SexpList list)
          Returns a canonical version of the specified list from the specified reflexive map.
static Iterator getEventIterator(SexpTokenizer tokenizer, Symbol type)
          Returns an iterator over TrainerEvent objects that were written out in S-expression form.
static SexpTokenizer getStandardSexpStream(File file)
          Returns a new SexpTokenizer wrapped around the specified file using the encoding specified by Language.encoding() and a buffer size equal to Constants.defaultFileBufsize.
protected static void incrementallyTrain(Trainer trainer, String inputFilename)
          Incrementally updates derived model counts by reading chunks of TrainerEvent objects from the specified input file.
static ModelCollection loadModelCollection(ObjectInputStream ois)
          Loads the ModelCollection from the specified file.
static ModelCollection loadModelCollection(String objectInputFilename)
          Loads the ModelCollection from the specified file.
static void main(String[] args)
          Takes arguments according to the usage as specified in usageMsg.
protected  void modelCollectionSet(FlexibleMap canonical)
          Sets all the data members of the modelCollection member of this trainer with the internal resources constructed by this trainer (such as all the Model instances).
protected  void modelCollectionSetHook()
          A method called by deriveCounts() just after it calls ModelCollection.set(danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.CountsTable, danbikel.parser.CountsTable, danbikel.parser.CountsTable, java.util.Map, java.util.Map, java.util.Map, java.util.Map, java.util.Map, java.util.Map, java.util.Set, java.util.Set, danbikel.util.FlexibleMap).
protected  ModelCollection newModelCollection()
          Returns a new instance of ModelCollection.
static SexpList newStartList()
          Creates and returns a new start list.
static WordList newStartWordList()
           
 void outputHeadToParentMap()
          Outputs the head map internal to this Trainer object to System.err.
static void outputMap(Map map, String mapName)
          Outputs the specified map to System.err
static void outputMap(Map map, String mapName, Writer writer)
          Outputs the specified named map to the specified writer.
static void outputMaps(Map leftMap, String leftMapName, Map rightMap, String rightMapName)
          Outputs both the specified maps to System.err.
static void outputMaps(Map leftMap, String leftMapName, Map rightMap, String rightMapName, Writer writer)
          Outputs both the specified maps to the specified writer.
 void outputModNonterminalMap()
          Outputs the modifier map internal to this Trainer object to System.err.
 void outputSubcatMaps()
          Outputs the subcat maps internal to this Trainer object to System.err.
protected  void precomputeProbs()
          Precomputes all probabilities and smoothing parameters for all Model instances that are part of the ModelCollection of this trainer.
 void readStats(File file)
          Reads the statistics and observations from an output file in the format created by writeStats(Writer).
 void readStats(SexpTokenizer tok)
          Reads the observations and their counts contained in the specified S-expression tokenization stream.
 void readStats(SexpTokenizer tok, int maxEventsToRead)
          Reads at most the specified number of observations and their counts contained in the specified S-expression tokenization stream.
 void readStatsHook(SexpList event)
          A hook for subclasses to read an event of a newly-defined type (called by readStats(SexpTokenizer)).
static void scanModelCollectionObjectFile(ObjectInputStream ois, OutputStream os)
          Scans the object file and prints out the information contained in its header objects.
static void scanModelCollectionObjectFile(String scanObjectFilename, OutputStream os)
          Scans the object file and prints out the information contained in its header objects.
 void setModelCollection(ObjectInputStream ois)
          Sets the internal modelCollection member of this class to the instance loaded from the specified input stream.
 void setModelCollection(String objectInputFilename)
          Sets the internal modelCollection data member of this class to the object of that type loaded from the specified file.
 void train(SexpTokenizer tok, boolean auto, boolean stripOuterParens)
          Records observations from the training trees contained in the specified S-expression tokenizer.
 void writeModelCollection(ObjectOutputStream oos, String trainingInputFilename, String trainingOutputFilename)
          Writes the internal ModelCollection object to the specified output stream, writing a header containing the names of the training input file and training output file.
 void writeModelCollection(String objectOutputFilename, String trainingInputFilename, String trainingOutputFilename)
          Writes the internal ModelCollection object to the specified output file, writing a header containing the names of the training input file and training output file.
 void writeStats(File file)
          Writes the statistics and mappings collected by train(SexpTokenizer,boolean,boolean) to a human-readable text file, by constructing a Writer around a stream around the specified file and calling writeStats(Writer).
 void writeStats(Writer writer)
          Writes the statistics and mappings collected by train(SexpTokenizer,boolean,boolean) to a human-readable text file.
 void writeStatsHook(Writer writer)
          A hook for subclasses to write out any additional top-level events, or top-level events of a different, newly-defined type.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

trainerClass

protected static Class trainerClass
The class from which an instance will be constructed in main(String[]). This data member may be re-assigned in a subclass' main method before invocation of this class' main method, so that all trainer method invocations done by this class' main method will be on an instance of the subclass.


nonterminalEventSym

public static final Symbol nonterminalEventSym
The label for nonterminal generation events. This symbol has the print-name "nonterminal".


headEventSym

public static final Symbol headEventSym
The label for head nonterminal generation events. This symbol has the print-name "head".


modEventSym

public static final Symbol modEventSym
The label for modifier nonterminal generation events. This symbol has the print-name "mod".


gapEventSym

public static final Symbol gapEventSym
The label for gap events. This symbol has the print-name "gap".


posMapSym

public static final Symbol posMapSym
The label for word to part-of-speech mappings. This symbol has the print-name "pos".


vocabSym

public static final Symbol vocabSym
The label for vocabulary counts. This symbol has the print-name "vocab".


wordFeatureSym

public static final Symbol wordFeatureSym
The label for word feature (unknown vocabulary) counts. This symbol has the print-name "word-feature".


prunedPretermSym

public static final Symbol prunedPretermSym
The label for the set of pruned preterminals. This symbol has the print-name "pruned-preterm".

See Also:
Training.prune(Sexp)

prunedPuncSym

public static final Symbol prunedPuncSym
The label for the set of pruned punctuation preterminals. This symbol has the print-name "pruned-preterm".

See Also:
Training.raisePunctuation(Sexp), Training.getPrunedPunctuation()

unknownWordThreshold

protected int unknownWordThreshold
The value of the Settings.unknownWordThreshold setting.


countThreshold

protected double countThreshold
The value of the Settings.countThreshold setting.


derivedCountThreshold

protected double derivedCountThreshold
The value of the Settings.derivedCountThreshold setting.


reportingInterval

protected int reportingInterval
The value of the Settings.trainerReportingInterval setting.


numPrevMods

protected int numPrevMods
The value of the Settings.numPrevMods setting.


numPrevWords

protected int numPrevWords
The value of the Settings.numPrevWords setting.


keepAllWords

protected boolean keepAllWords
The value of the Settings.keepAllWords setting.


keepLowFreqTags

protected boolean keepLowFreqTags
The value of the Settings.keepLowFreqTags setting.


downcaseWords

protected boolean downcaseWords
The value of the Settings.downcaseWords setting.


nonterminals

protected CountsTable nonterminals
A table for storing counts of (unlexicalized) nonterminals. The keys are instances of Symbol.


priorEvents

protected CountsTable priorEvents
A table for storing counts of lexicalized nonterminal prior events. The keys are instances of PriorEvent.


headEvents

protected CountsTable headEvents
A table for storing counts of head-generation events. The keys are instances of HeadEvent.


modifierEvents

protected CountsTable modifierEvents
A table for storing counts of modifier-generation events. The keys are instances of ModifierEvent.


gapEvents

protected CountsTable gapEvents
A table for storing counts of gap-generation events. The keys are instances of GapEvent.


vocabCounter

protected CountsTable vocabCounter
A table for storing counts of vocabulary items. The keys are instances of Symbol.


wordFeatureCounter

protected CountsTable wordFeatureCounter
A table for storing counts of word feature–vectors. The keys are instances of Symbol.


posMap

protected Map posMap
A map of words to lists of their observed part-of-speech tags. The keys in this map are instances of Symbol, and the values are SexpList instances that represent sets by containing lists of distinct Symbol objects.


headToParentMap

protected Map headToParentMap
A map of head child nonterminals to their observed parent nonterminals. The keys are instances of Symbol, and the values are Set instances containing Symbol objects.


leftSubcatMap

protected Map leftSubcatMap
A map of events from the last back-off level of the left subcat–generation submodel to the set of possible left subcats. The keys are instnaces of Event, and the values are Set instances containing Subcat objects.


rightSubcatMap

protected Map rightSubcatMap
A map of events from the last back-off level of the right subcat–generation submodel to the set of possible right subcats. The keys are instnaces of Event, and the values are Set instances containing Subcat objects.


modNonterminalMap

protected Map modNonterminalMap
A map of events from the last back-off level of the modifier nonterminal–generation submodel to the set of possible futures (typically, a future is a modifier label and its head word's part-of-speech tag). The keys are instances of Event, and the values are Set instances containing Event objects.


simpleModNonterminalMap

protected Map simpleModNonterminalMap
A map from unlexicalized parent-head-side triples to all possible partially-lexicalized modifying nonterminals. This map provides a simpler mechanism for determining whether a given modifier is possible in the current parent-head context than is provided by modNonterminalMap.

The keys are SexpList objects containing exactly three Symbol elements representing the following in a production:

  1. an unlexicalized parent nonterminal
  2. an unlexicalized head nonterminal
  3. the direction of modification, either Constants.LEFT or Constants.RIGHT.

The values consist of Set objects containing SexpList objects that contain exactly two Symbol elements representing a partially-lexicalized modifying nonterminal:

  1. the unlexicalized modifying nonterminal
  2. the part-of-speech tag of the modifying nonterminal's head word.

An example of a partially-lexicalized nonterminal in the Penn Treebank is NP(NNP), which is a noun phrase headed by a singular proper noun.

See Also:
Settings.useSimpleModNonterminalMap

prunedPreterms

protected Set prunedPreterms
A set of Sexp objects representing preterminals that were pruned during training.

See Also:
Training.prune(Sexp), Treebank.isPreterminal(Sexp)

prunedPunctuation

protected Set prunedPunctuation
Returns the set of preterminals (Sexp objects) that were punctuation elements that were “raised away” because they were either at the beginning or end of a sentence.

See Also:
Training.raisePunctuation(Sexp), Treebank.isPuncToRaise(Sexp)

canonicalSubcatMap

protected transient Map canonicalSubcatMap
A reflexive map for storing canonical versions of Subcat objects.


emptySubcat

protected transient Subcat emptySubcat
The value returned by Subcats.get().


modelCollection

protected ModelCollection modelCollection
The set of Model objects and other resources that describe an entire parsing model.


lexPriorModel

protected Model lexPriorModel
The model for marginal probabilities of lexical elements (for the estimation of the joint event that is a fully lexicalized nonterminal).


nonterminalPriorModel

protected Model nonterminalPriorModel
The model for conditional probabilities of nonterminals given the lexical components (for the estimation of the joint event that is a fully lexicalized nonterminal).


topNonterminalModel

protected Model topNonterminalModel
The head-generation model for heads whose parents are Training.topSym().


topLexModel

protected Model topLexModel
The head-word generation model for heads of entire sentences.


headModel

protected Model headModel
The head-generation model.


gapModel

protected Model gapModel
The gap-generation model.


leftSubcatModel

protected Model leftSubcatModel
The model for generating subcats that fall on the left side of head children.


rightSubcatModel

protected Model rightSubcatModel
The model for generating subcats that fall on the right side of head children.


modNonterminalModel

protected Model modNonterminalModel
The modifying nonterminal–generation model.


modWordModel

protected Model modWordModel
The model that generates head words of modifying nonterminals.


wordFeatures

protected transient WordFeatures wordFeatures
A handle onto static WordFeatures object contained static inside Language.


startSym

protected Symbol startSym
The value of Training.startSym().


stopSym

protected Symbol stopSym
The value of Training.stopSym().


topSym

protected Symbol topSym
The value of Training.topSym().


startWord

protected Word startWord
The value of Training.startWord().


stopWord

protected Word stopWord
The value of Training.stopWord().


gapAugmentation

protected Symbol gapAugmentation
The value of Training.gapAugmentation().


traceTag

protected Symbol traceTag
The value of Training.traceTag().


allPass

protected Filter allPass
An instance of AllPass.


nonTop

protected Filter nonTop
A filter that only allows TrainerEvent instances where the parent nonterminal is not Training.topSym().


nonPreterm

protected Filter nonPreterm
A filter that only allows TrainerEvent instances that do not represent preterminals (where the parent is identical to the part-of-speech tag of the head word).


nonTopNonPreterm

protected Filter nonTopNonPreterm
A filter that is functionally equivalent to piping objects through both nonTop and nonPreterm.


topOnly

protected Filter topOnly
A filter that only allows TrainerEvent instances where the parent is Training.topSym().


nonStop

protected Filter nonStop
A filter that disallows ModifierEvent instances where the modifier is Training.stopSym(), but allows all other objects.


nonStopAndNonTop

protected Filter nonStopAndNonTop
A filter that disallows ModifierEvent instances where the modifier is neither Training.stopSym() nor Training.topSym(), but allows all other objects.


usageMsg

protected static final String[] usageMsg
The usage for the main method of this class. Please run java danbikel.parser.Trainer -help to display the complete usage of this class.

Constructor Detail

Trainer

public Trainer()
Constructs a new training object, which uses values from Settings for its settings. This class is not thread-safe, and there will typically be one instance of a Trainer object per process, constructed via the main(java.lang.String[]) method of this class.

See Also:
Settings.unknownWordThreshold, Settings.countThreshold, Settings.derivedCountThreshold, Settings.trainerReportingInterval, Settings.numPrevMods
Method Detail

newModelCollection

protected ModelCollection newModelCollection()
Returns a new instance of ModelCollection. Subclasses may override this method to return different sub-types of ModelCollection.

Returns:
a new instance of ModelCollection

train

public void train(SexpTokenizer tok,
                  boolean auto,
                  boolean stripOuterParens)
           throws IOException
Records observations from the training trees contained in the specified S-expression tokenizer. The observations are either mappings stored in Map objects or items to be counted, stored in CountsTable objects. All the trees obtained from tok are first preprocessed using Training.preProcess(Sexp).

Parameters:
tok - the S-expression tokenizer from which to obtain training parse trees
auto - indicates whether to automatically determine whether to strip off outer parens of training parse trees before preprocessing; if the value of this argument is false, then the value of stripOuterParens is used
stripOuterParens - indicates whether an outer layer of parentheses should be stripped off of trees obtained from tok before preprocessing and training (only used if the auto argument is false)
Throws:
IOException
See Also:
CountsTable, Training.preProcess(Sexp)

countVocab

protected void countVocab(HeadTreeNode tree)
Counts number of occurrences of each word in the specified tree and adds the word with this count to vocabCounter. Specifically, if the tree with which this recursive method is called represents a preterminal that is not a trace and that is not already a key in wordFeatureCounter, then the word field of the tree's headWord is added (with a count of 1) to vocabCounter.

Parameters:
tree -

alterLowFrequencyWords

protected void alterLowFrequencyWords(HeadTreeNode tree)
For every Word in the specified tree, if it occurred less than unknownWordThreshold times, then it is modified. If keepAllWords is true, then the word's features field is set, using Word.setFeatures(Symbol); otherwise, the word's word field is set, using Word.setWord(Symbol).
This method also invokes addToPosMap(danbikel.parser.Word) with posMap and the head word as arguments if keepLowFreqTags is true.

Parameters:
tree - the tree in which to alter word frequencies

addHeadEvent

protected void addHeadEvent(HeadEvent event)
This method is a synonym for addHeadEvent(event, 1.0).

Parameters:
event - the event to be added with a count of 1.0
See Also:
addHeadEvent(HeadEvent,double)

addModifierEvent

protected void addModifierEvent(ModifierEvent event)
This method is a synonym for addModifierEvent(event, 1.0).

Parameters:
event - the event to be added with a count of 1.0
See Also:
addModifierEvent(ModifierEvent,double)

addGapEvent

protected void addGapEvent(GapEvent event)
This method is a synonym for addGapEvent(event, 1.0).

Parameters:
event - the event to be added with a count of 1.0
See Also:
addGapEvent(GapEvent,double)

addHeadEvent

protected void addHeadEvent(HeadEvent event,
                            double count)
Adds the specified HeadEvent to headEvents with the specified count. This is a helper method used by the collectStats and readStats methods. The purpose of using this protected method is to provide a hook for subclasses.

Parameters:
event - the HeadEvent to be added
count - the count of the event to be added
See Also:
MapToPrimitive.add(Object,double)

addModifierEvent

protected void addModifierEvent(ModifierEvent event,
                                double count)
Adds the specified ModifierEvent to modifierEvents with the specified count. This is a helper method used by the collectStats, collectModifierStats and readStats methods. The purpose of using this protected method is to provide a hook for subclasses.

Parameters:
event - the ModifierEvent to be added
count - the count of the event to be added
See Also:
MapToPrimitive.add(Object,double)

addGapEvent

protected void addGapEvent(GapEvent event,
                           double count)
Adds the specified GapEvent to gapEvents with the specified count. This is a helper method used by the collectStats, collectModifierStats and readStats methods. The purpose of using this protected method is to provide a hook for subclasses.

Parameters:
event - the GapEvent to be added
count - the count of the event to be added
See Also:
MapToPrimitive.add(Object,double)

collectStats

protected void collectStats(Sexp orig,
                            HeadTreeNode tree,
                            boolean isRoot)
Collects the statistics from the specified tree. Some "statistics" are actually mappings, such as part-of-speech-to-word mappings.

Parameters:
orig - the original (preprocessed) tree, used for debugging purposes
tree - the tree from which to collect statistics and mappings
isRoot - indicates whether tree is the observed root of a tree (the observed root is the child of the hidden root, represented by the symbol Training.topSym())

newStartList

public static SexpList newStartList()
Creates and returns a new start list. A start list is a list of length equal to the value of Settings.get(Settings.numPrevMods), where every element is the symbol Language.training.startSym(). This is the appropriate initial list of previously "generated" modidifers when beginning the Markov process of generating modifiers.

Returns:
a new list of start symbols
See Also:
Training.startSym()

newStartWordList

public static WordList newStartWordList()

collectModifierStats

protected void collectModifierStats(HeadTreeNode tree,
                                    Subcat subcat,
                                    int gapIdx,
                                    boolean side)
Note the O(n) operation performed on the prevModList.


createPosMap

public void createPosMap()
Creates posMap from the headEvents, modifierEvents and gapEvents counts tables.


createPosMap

public void createPosMap(CountsTable events)
Adds to posMap using information contained in the specified counts table.

Parameters:
events - the counts table of TrainerEvent instances from which to derive a mapping of words to their observed parts of speech

addToPosMap

protected final void addToPosMap(Word word)
Called by collectStats(danbikel.lisp.Sexp, danbikel.parser.HeadTreeNode, boolean) and alterLowFrequencyWords(HeadTreeNode).

Parameters:
word - the Word object containing word (and possibly a word-feature vector) and a tag with which that word (and possibly feature vector) has been observed with

addToPosMap

protected final void addToPosMap(Symbol word,
                                 Symbol tag)
Called by addToPosMap(Word).

Parameters:
word - the word with which to associate a part of speech
tag - the part-of-speech tag associated with the specified word

createModelObjects

protected void createModelObjects()
Creates all of the internal model objects used by this trainer when constructing its internal ModelCollection object. Each model is created by first creating its ProbabilityStructure object, and then calling that object's ProbabilityStructure.newModel() method to wrap itself in a Model instance. There are ten Model members of this class: In order to determine the fully-qualified class name for the associated ProbabilityStructure for each of the above models, the following algorithm is used: Please read the documentation for the Settings.globalModelStructureNumber setting for more details on all the model structure–specific settings that control which concrete subclasses of ProbabilityStructure are instantiated.


deriveCounts

public void deriveCounts()
Derives event counts for all back-off levels of all sub-models for the current parsing model. After deriving counts, the modelCollectionSet(FlexibleMap) method will be invoked.

See Also:
Model.deriveCounts(CountsTable,Filter, double,FlexibleMap)

deriveCounts

public void deriveCounts(boolean setModelCollection)
Derives event counts for all back-off levels of all sub-models for the current parsing model.

Parameters:
setModelCollection - indicates whether to invoke modelCollectionSet(FlexibleMap) after deriving counts
See Also:
Model.deriveCounts(CountsTable,Filter, double,FlexibleMap)

deriveCounts

public void deriveCounts(boolean setModelCollection,
                         FlexibleMap canonical)
Derives event counts for all back-off levels of all sub-models for the current parsing model.

Parameters:
setModelCollection - indicates whether to invoke modelCollectionSet(FlexibleMap) after deriving counts
canonical - the FlexibleMap instance to use for creating a reflexive map of canonical versions of event objects creating when deriving counts
See Also:
Model.deriveCounts(CountsTable,Filter, double,FlexibleMap)

clearEventCounters

protected void clearEventCounters()
Clears the priorEvents, headEvents, modifierEvents and gapEvents counts tables.


deriveCounts

protected void deriveCounts(double derivedCountThreshold,
                            FlexibleMap canonical)
Derives all counts for creating a ModelCollection object.

Parameters:
derivedCountThreshold - the count threshold below which to throw away derived events
canonical - a reflexive map of canonical versions of derived Event and Transition objects, shared among all Model instances of this trainer

deriveModelCounts

protected void deriveModelCounts(double derivedCountThreshold,
                                 FlexibleMap canonical)
A helper method used by deriveCounts(double,FlexibleMap) to derive counts for all Model instances contained within a ModelCollection.

Parameters:
derivedCountThreshold - the count threshold below which to throw away derived events
canonical - a reflexive map of canonical versions of derived Event and Transition objects, shared among all Model instances

precomputeProbs

protected void precomputeProbs()
Precomputes all probabilities and smoothing parameters for all Model instances that are part of the ModelCollection of this trainer.

See Also:
Model.precomputeProbs()

modelCollectionSet

protected void modelCollectionSet(FlexibleMap canonical)
Sets all the data members of the modelCollection member of this trainer with the internal resources constructed by this trainer (such as all the Model instances).

Parameters:
canonical - a reflexive map of canonical versions of derived Event and Transition objects, shared among all Model instances

modelCollectionSetHook

protected void modelCollectionSetHook()
A method called by deriveCounts() just after it calls ModelCollection.set(danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.Model, danbikel.parser.CountsTable, danbikel.parser.CountsTable, danbikel.parser.CountsTable, java.util.Map, java.util.Map, java.util.Map, java.util.Map, java.util.Map, java.util.Map, java.util.Set, java.util.Set, danbikel.util.FlexibleMap).


getCanonicalList

public static final SexpList getCanonicalList(Map map,
                                              SexpList list)
Returns a canonical version of the specified list from the specified reflexive map.

Parameters:
map - a reflexive map of SexpList objects
list - the list to canonicalize
Returns:
a canonical version of the specified list from the specified reflexive map

outputHeadToParentMap

public void outputHeadToParentMap()
Outputs the head map internal to this Trainer object to System.err.


outputSubcatMaps

public void outputSubcatMaps()
Outputs the subcat maps internal to this Trainer object to System.err.


outputModNonterminalMap

public void outputModNonterminalMap()
Outputs the modifier map internal to this Trainer object to System.err.


outputMap

public static void outputMap(Map map,
                             String mapName)
Outputs the specified map to System.err


outputMaps

public static void outputMaps(Map leftMap,
                              String leftMapName,
                              Map rightMap,
                              String rightMapName)
Outputs both the specified maps to System.err.


outputMap

public static void outputMap(Map map,
                             String mapName,
                             Writer writer)
                      throws IOException
Outputs the specified named map to the specified writer.

Throws:
IOException

outputMaps

public static void outputMaps(Map leftMap,
                              String leftMapName,
                              Map rightMap,
                              String rightMapName,
                              Writer writer)
                       throws IOException
Outputs both the specified maps to the specified writer.

Throws:
IOException

addToValueCounts

public static final void addToValueCounts(Map map,
                                          Object key,
                                          Object value)
Adds value to the set of values to which key is mapped (if value is not already in that set) and increments the count of that value by 1.

Parameters:
map - the map of keys to sets of values, where each value has its own count (map is actually a map of keys to maps of values to counts)
key - the key in map to associate with a set of values with counts
value - the value to add to the set of key's values, whose count is to be incremented by 1

addToValueCounts

public static final void addToValueCounts(Map map,
                                          Object key,
                                          Object value,
                                          int count)
Adds value to the set of values to which key is mapped (if value is not already in that set) and increments the count of that value by count.

Parameters:
map - the map of keys to sets of values, where each value has its own count (map is actually a map of keys to maps of values to counts)
key - the key in map to associate with a set of values with counts
value - the value to add to the set of key's values, whose count is to be incremented by count
count - the amount by which to increment value's count

writeStats

public void writeStats(File file)
                throws IOException
Writes the statistics and mappings collected by train(SexpTokenizer,boolean,boolean) to a human-readable text file, by constructing a Writer around a stream around the specified file and calling writeStats(Writer).

Throws:
IOException
See Also:
train(SexpTokenizer,boolean,boolean), writeStats(Writer)

writeStatsHook

public void writeStatsHook(Writer writer)
                    throws IOException
A hook for subclasses to write out any additional top-level events, or top-level events of a different, newly-defined type. This default implementation does nothing.

Parameters:
writer -
Throws:
IOException

writeStats

public void writeStats(Writer writer)
                throws IOException
Writes the statistics and mappings collected by train(SexpTokenizer,boolean,boolean) to a human-readable text file.
This method calls writeStatsHook(Writer) just before terminating.

Throws:
IOException
See Also:
train(SexpTokenizer,boolean,boolean), SymbolicCollectionWriter.writeMap(Map,Symbol,Writer), CountsTable.output(String,Writer)

readStats

public void readStats(File file)
               throws FileNotFoundException,
                      UnsupportedEncodingException,
                      IOException
Reads the statistics and observations from an output file in the format created by writeStats(Writer). Observations are one of several types, all recorded as S-expressions where the first element is one of the following symbols:

Parameters:
file - the file containing the S-expressions representing top-level observations and their counts
Throws:
FileNotFoundException
UnsupportedEncodingException
IOException

getStandardSexpStream

public static SexpTokenizer getStandardSexpStream(File file)
                                           throws FileNotFoundException,
                                                  UnsupportedEncodingException,
                                                  IOException
Returns a new SexpTokenizer wrapped around the specified file using the encoding specified by Language.encoding() and a buffer size equal to Constants.defaultFileBufsize.

Parameters:
file - the file around which to construct a SexpTokenizer
Returns:
a new SexpTokenizer wrapped around the specified file using the encoding specified by Language.encoding() and a buffer size equal to Constants.defaultFileBufsize
Throws:
FileNotFoundException - if the specified file cnanot be found
UnsupportedEncodingException - if the encoding specified by Language.encoding() is unsupported
IOException - if there is a problem opening a stream for the specified file

readStatsHook

public void readStatsHook(SexpList event)
A hook for subclasses to read an event of a newly-defined type (called by readStats(SexpTokenizer)). This method is responsible for printing out any error messages if the specified event is improperly formatted or is not recognized. New event types must still have the same general S-expression format requirements as the existing event types of this class: they must be lists of length 2 or 3.
The default implementation here simply prints an error message to System.err indicating that the specified event is an unrecognized event type.

Parameters:
event - the event to be read

getEventIterator

public static Iterator getEventIterator(SexpTokenizer tokenizer,
                                        Symbol type)
Returns an iterator over TrainerEvent objects that were written out in S-expression form.

Parameters:
tokenizer - the S-expression reader from which to read TrainerEvent objects that were serialized as S-expression strings
type - the type of TrainerEvent objects to retrive; the value of this argument may be one of
Returns:
an iterator over TrainerEvent objects that were written out in S-expression form

readStats

public void readStats(SexpTokenizer tok)
               throws IOException
Reads the observations and their counts contained in the specified S-expression tokenization stream. The S-expressions contained in the specified stream are expected to be in the format output by writeStats(Writer). Observations are one of several types, all recorded as S-expressions where the first element is one of the following symbols:

Parameters:
tok - the S-expression tokenization stream from which to read top-level counts
Throws:
IOException - if the underlying stream throws an IOException

readStats

public void readStats(SexpTokenizer tok,
                      int maxEventsToRead)
               throws IOException
Reads at most the specified number of observations and their counts contained in the specified S-expression tokenization stream. The S-expressions contained in the specified stream are expected to be in the format output by writeStats(Writer). Observations are one of several types, all recorded as S-expressions where the first element is one of the following symbols:

Parameters:
tok - the S-expression tokenization stream from which to read top-level counts
maxEventsToRead - the maximum number of events to read from the specified stream; if the value of this parameter is less than 1, then all observations are read from the underlying stream, and the behavior of this method is identical to readStats(SexpTokenizer)
Throws:
IOException - if the underlying stream throws an IOException

doneCollectingObservations

public void doneCollectingObservations()
A hook that gets called by main(java.lang.String[]) after all observations are collected via any calls to readStats(File), readStats(SexpTokenizer) and train(SexpTokenizer,boolean,boolean). The default implementation does nothing.


writeModelCollection

public void writeModelCollection(String objectOutputFilename,
                                 String trainingInputFilename,
                                 String trainingOutputFilename)
                          throws FileNotFoundException,
                                 IOException
Writes the internal ModelCollection object to the specified output file, writing a header containing the names of the training input file and training output file.

Parameters:
objectOutputFilename - the output file to which to write the internal ModelCollection object constructed by this trainer
trainingInputFilename - the name of the input file of training parse trees from which events and counts were collected
trainingOutputFilename - the name of the training output file of top-level (maximal context) events
Throws:
FileNotFoundException - if the specified output filename cannot be created
IOException - if there is a problem writing to the stream of the specified output file

writeModelCollection

public void writeModelCollection(ObjectOutputStream oos,
                                 String trainingInputFilename,
                                 String trainingOutputFilename)
                          throws IOException
Writes the internal ModelCollection object to the specified output stream, writing a header containing the names of the training input file and training output file.

Parameters:
oos - the output stream to which to write the internal ModelCollection object constructed by this trainer
trainingInputFilename - the name of the input file of training parse trees from which events and counts were collected
trainingOutputFilename - the name of the training output file of top-level (maximal context) events
Throws:
IOException - if there is a problem writing to the stream of the specified output file

setModelCollection

public void setModelCollection(String objectInputFilename)
                        throws ClassNotFoundException,
                               IOException,
                               OptionalDataException
Sets the internal modelCollection data member of this class to the object of that type loaded from the specified file.

Parameters:
objectInputFilename - the object from which to load a ModelCollection
Throws:
ClassNotFoundException - if the concrete type of ModelCollection read from the specified file cannot be found
IOException - if there is a problem reading from the stream of the specified file
OptionalDataException - if there is a problem reading primitive data associated with the ModelCollection object read from the specified file

loadModelCollection

public static ModelCollection loadModelCollection(String objectInputFilename)
                                           throws ClassNotFoundException,
                                                  IOException,
                                                  OptionalDataException
Loads the ModelCollection from the specified file.

Parameters:
objectInputFilename - the name of the Java serialized object file from which to load a ModelCollection instance; the file must contain a series of header objects as produced by writeModelCollection(String,String,String)
Returns:
the ModelCollection object contained in the specified file
Throws:
ClassNotFoundException - if the concrete type of the ModelCollection or any of the header objects in the specified file cannot be found
IOException - if there is a problem reading from the specified file
OptionalDataException - if there is a problem reading primitive data associated with an object from the object input stream created from the specified file

setModelCollection

public void setModelCollection(ObjectInputStream ois)
                        throws ClassNotFoundException,
                               IOException,
                               OptionalDataException
Sets the internal modelCollection member of this class to the instance loaded from the specified input stream.

Parameters:
ois - an object input stream containing a series of header objects and ultimately a ModelCollection instance
Throws:
ClassNotFoundException - if the concrete type of the ModelCollection or any of the header objects in the specified input stream cannot be found
IOException - if there is a problem reading from the specified input stream
OptionalDataException - if there is a problem reading primitive data associated with an object from the specified object input stream

loadModelCollection

public static ModelCollection loadModelCollection(ObjectInputStream ois)
                                           throws ClassNotFoundException,
                                                  IOException,
                                                  OptionalDataException
Loads the ModelCollection from the specified file.

Parameters:
ois - the object input stream from which to load a ModelCollection instance; the stream must contain a series of header objects, as produced by writeModelCollection(String,String,String)
Returns:
the ModelCollection object contained in the specified file
Throws:
ClassNotFoundException - if the concrete type of the ModelCollection or any of the header objects in the specified file cannot be found
IOException - if there is a problem reading from the specified file
OptionalDataException - if there is a problem reading primitive data associated with an object from the object input stream created from the specified file

scanModelCollectionObjectFile

public static void scanModelCollectionObjectFile(String scanObjectFilename,
                                                 OutputStream os)
                                          throws ClassNotFoundException,
                                                 IOException,
                                                 OptionalDataException
Scans the object file and prints out the information contained in its header objects. The specified object file must contain serialized objects of the type and in the order produced by writeModelCollection(String,String,String).

Parameters:
scanObjectFilename - the object whose header is to be scanned
os - the output stream to which to print information
Throws:
ClassNotFoundException - if any of the concrete types of the header objects in the specified file cannot be found
IOException - if there is a problem reading from the stream created from the specified file
OptionalDataException - if there is a problem of extra primtive data when deserializing an object from the object input stream created from the specified file

scanModelCollectionObjectFile

public static void scanModelCollectionObjectFile(ObjectInputStream ois,
                                                 OutputStream os)
                                          throws ClassNotFoundException,
                                                 IOException,
                                                 OptionalDataException
Scans the object file and prints out the information contained in its header objects. The specified object file must contain serialized objects of the type and in the order produced by writeModelCollection(String,String,String).

Parameters:
ois - the object input stram whose header objects are to be scanned
os - the output stream to which to print information
Throws:
ClassNotFoundException - if any of the concrete types of the header objects in the specified stream cannot be found
IOException - if there is a problem reading from the specified stream
OptionalDataException - if there is a problem of extra primtive data when deserializing an object from the specified object input stream

incrementallyTrain

protected static void incrementallyTrain(Trainer trainer,
                                         String inputFilename)
                                  throws FileNotFoundException,
                                         UnsupportedEncodingException,
                                         IOException
Incrementally updates derived model counts by reading chunks of TrainerEvent objects from the specified input file. The number of TrainerEvent objects read at a time (the chunk size) is determined by the value of the Settings.maxEventChunkSize.

Parameters:
trainer - the Trainer instance for which incremental training is to be performed
inputFilename - the file containing observations to be read by the readStats(SexpTokenizer,int) method
Throws:
FileNotFoundException - if the specified file cannot be found
UnsupportedEncodingException - if the encoding used to read characters from the specified file is not supported
IOException - if there is a problem reading from the specified file

main

public static void main(String[] args)
Takes arguments according to the usage as specified in usageMsg. Please run java danbikel.parser.Trainer -help to display the complete usage of this class.


Parsing Engine

Author: Dan Bikel.