Parsing Engine

danbikel.parser
Class EMDecoder

java.lang.Object
  extended bydanbikel.parser.Decoder
      extended bydanbikel.parser.EMDecoder
All Implemented Interfaces:
Serializable

public class EMDecoder
extends Decoder

Provides the methods necessary to perform constrained CKY parsing on input sentences so as to perform the E-step of the Inside-Outside EM algorithm.

See Also:
Serialized Form

Nested Class Summary
 
Nested classes inherited from class danbikel.parser.Decoder
Decoder.TimeoutException
 
Field Summary
protected  EMChart chart
          The parsing chart.
protected  double cummulativeInsideLogProb
          The value of all sentences' inside probabilities in log-space.
protected  CountsTable eventCounts
          The map of events to their expected counts (cleared after every sentence).
protected static int MAX_UNARY_PRODUCTIONS
          A hack to limit the number of unary productions, instead of doing The Right Thing and computing infinite sums for looping derivations, as described by Stolcke (1995) and Goodman (1999).
protected static double probCertain
          The value of Constants.probCertain.
protected static double probImpossible
          The value of Constants.probImpossible.
protected  Set topProbItemsToAdd
          A temporary storage area used by addTopUnaries(int) for storing items to be added to the chart when iterating over a cell in the chart.
 
Fields inherited from class danbikel.parser.Decoder
canonicalPrevModLists, canonicalWords, cellLimit, commaForPruning, conjForPruning, constraints, currItemsAdded, downcaseWords, emptySubcat, err, findAtLeastOneSatisfyingConstraint, hardConstraints, headToParentMap, id, isomorphicTreeConstraints, kBest, keepAllWords, LEFT, leftSubcatMap, leftSubcatPS, leftSubcatPSLastLevel, logOfZero, logProbCertain, lookupHeadEvent, lookupLeftStopEvent, lookupModEvent, lookupPriorEvent, lookupRightStopEvent, lookupSubcat, lookupWord, maxParseTime, maxPruneFact, maxSentLen, modNonterminalMap, modNonterminalPS, modNonterminalPSLastLevel, nonterminals, numPrevMods, numPrevWords, originalSentence, originalTags, originalWords, parentHeadSideLookupList, partiallyLexedModLookupList, posMap, posSet, posToExampleWordMap, prevItemsAdded, prevModLookupList, prevModWordLeftLookupList, prevModWordRightLookupList, prunedPretermsPosMap, prunedPretermsPosSet, prunedPunctuationPosMap, pruneFact, pruneFactIncrement, relaxConstraints, restorePrunedWords, RIGHT, rightSubcatMap, rightSubcatPS, rightSubcatPSLastLevel, sentence, sentenceIdx, sentLen, server, simpleModNonterminalMap, startList, startSym, startWord, startWordList, stopProbItemsToAdd, stopSym, stopWord, substituteWordsForClosedClassTags, time, tmpChildrenList, topSym, unaryItemsToAdd, useCommaConstraint, useHeadToParentMap, useLowFreqTags, useOnlySuppliedTags, useSimpleModNonterminalMap, wordSet, zeroSubcatArr
 
Constructor Summary
EMDecoder(int id, DecoderServerRemote server)
          Constructs a new decoder that will use the specified DecoderServer to get all information and probabilities required for decoding (parsing).
 
Method Summary
protected  void addPretermHeadEvent(EMItem item, double expectedCount, CountsTable counts)
          Whenever a preterminal is generated, either as a head child or a modifier of some other item, a trivial head-generation event is added, generating the word from the lexicalized preterminal, which by design always generates its head word with probability 1.
protected  List addStopProbs(EMItem item, List itemsAdded, int level)
           
protected  void addSynthesizedTopModEvent(TrainerEvent event, double expectedCount, CountsTable counts)
          Adds an event as though a tree's non-hidden root is a modifier of +TOP+ (in addition to being a head child).
protected  void addTopUnaries(int end)
          Adds hiden root nonterminal probabilities.
protected  List addUnaries(EMItem item, List itemsAdded, int level)
           
protected  void addUnariesAndStopProbs(int start, int end)
          Finds all possible parent-head (or unary) productions using the root node of each existing chart item within the specified span as the head, creates new items based on these existing items, multiplying in the parent-head probability; then, using these new items, this method also creates additional new items in which stop probabilities have been multiplied; all new items are added to the chart.
protected  void complete(int start, int end)
          Constructs all possible items spanning the specified indices and adds them to the chart.
protected  CountsTable computeEventCounts()
          Returns a counts table with the expected couunt of all top-level events produced when constrain-parsing the current sentence.
protected  void computeEventCounts(int start, int end, double sentenceProbInverse, CountsTable counts)
          Computes expected counts for top-level (maximal context) events produced for the specified span when decoding the current sentence; stores these events and their expected counts in the specified CountsTable object.
protected  void computeOutsideProbs()
          Computes outside probabilities for the entire chart.
protected  void computeOutsideProbs(int start, int end)
          Computes outside probabilities for all derivations in the specified span.
protected  void joinItems(EMItem modificand, EMItem modifier, boolean side)
          Joins two chart items, one representing the modificand that has not yet received its stop probabilities, the other representing the modifier that has received its stop probabilities.
protected  CountsTable parseAndCollectEventCounts(SexpList sentence)
          Constrain-parses the specified sentence and computes expected top-level (maximal context) event counts.
protected  CountsTable parseAndCollectEventCounts(SexpList sentence, SexpList tags)
          Constrain-parses the specified sentence and computes expected top-level (maximal context) event counts.
protected  CountsTable parseAndCollectEventCounts(SexpList sentence, SexpList tags, ConstraintSet constraints)
          Constrain-parses the specified sentence and computes expected top-level (maximal context) event counts.
protected  void seedChart(Symbol word, int wordIdx, Symbol features, boolean neverObserved, SexpList tagSet, boolean wordIsUnknown, Symbol origWord, ConstraintSet constraints)
          Adds a chart item for every possible part of speech for the specified word at the specified index in the current sentence.
 
Methods inherited from class danbikel.parser.Decoder
addStopProbs, addUnaries, commaConstraintViolation, convertHeadToParentMap, convertSubcatMap, convertSubcatMaps, derivationOrderOK, getCanonicalWord, getExampleWordForTag, getPossibleSubcats, getPrevMods, getPrevModWords, getTagSet, initialize, initialize, isPuncRaiseWord, joinItems, parse, parse, parse, postProcess, preProcess, removeWord, restoreOriginalWords, restorePrunedWords, restorePrunedWordsRecursive, setCommaConstraintData, setUnion, wrapCachingServer
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MAX_UNARY_PRODUCTIONS

protected static final int MAX_UNARY_PRODUCTIONS
A hack to limit the number of unary productions, instead of doing The Right Thing and computing infinite sums for looping derivations, as described by Stolcke (1995) and Goodman (1999).

See Also:
Constant Field Values

probCertain

protected static final double probCertain
The value of Constants.probCertain.

See Also:
Constant Field Values

probImpossible

protected static final double probImpossible
The value of Constants.probImpossible.

See Also:
Constant Field Values

topProbItemsToAdd

protected Set topProbItemsToAdd
A temporary storage area used by addTopUnaries(int) for storing items to be added to the chart when iterating over a cell in the chart.

Bugs: It is a design error to have created this Set member with the same name as the ArrayList member in the superclass. The designer of this class should be appropriately flogged.


cummulativeInsideLogProb

protected double cummulativeInsideLogProb
The value of all sentences' inside probabilities in log-space. Used to gather training data log-likelihood at the end of each EM iteration.


eventCounts

protected CountsTable eventCounts
The map of events to their expected counts (cleared after every sentence).


chart

protected EMChart chart
The parsing chart.

Constructor Detail

EMDecoder

public EMDecoder(int id,
                 DecoderServerRemote server)
Constructs a new decoder that will use the specified DecoderServer to get all information and probabilities required for decoding (parsing).

Parameters:
id - the id of this parsing client
server - the DecoderServerRemote implementor (either local or remote) that provides this decoder object with information and probabilities required for decoding (parsing)
Method Detail

seedChart

protected void seedChart(Symbol word,
                         int wordIdx,
                         Symbol features,
                         boolean neverObserved,
                         SexpList tagSet,
                         boolean wordIsUnknown,
                         Symbol origWord,
                         ConstraintSet constraints)
                  throws RemoteException
Description copied from class: Decoder
Adds a chart item for every possible part of speech for the specified word at the specified index in the current sentence.

Overrides:
seedChart in class Decoder
Parameters:
word - the current word
wordIdx - the index of the current word in the current sentence
features - the word-feature vector for the current word
neverObserved - indicates whether the current word was never observed during training (a truly unknown word)
tagSet - a list containing all possible part of speech tags for the current word
constraints - the constraint set for this sentence
Throws:
RemoteException - if any calls to the underlying DecoderServerRemote object throw a RemoteException
See Also:
Chart.add(int,int,Item)

parseAndCollectEventCounts

protected CountsTable parseAndCollectEventCounts(SexpList sentence)
                                          throws RemoteException
Constrain-parses the specified sentence and computes expected top-level (maximal context) event counts. This method performs the E-step of the Inside-Outside EM algorithm (with a little bit of M thrown in, in that events are aggregated over the entire sentence).

Parameters:
sentence - a list of symbols representing the words of a sentence
Returns:
a counts table of the top-level (maximal context) events generated while constrain-parsing this sentence
Throws:
RemoteException

parseAndCollectEventCounts

protected CountsTable parseAndCollectEventCounts(SexpList sentence,
                                                 SexpList tags)
                                          throws RemoteException
Constrain-parses the specified sentence and computes expected top-level (maximal context) event counts. This method performs the E-step of the Inside-Outside EM algorithm (with a little bit of M thrown in, in that events are aggregated over the entire sentence).

Parameters:
sentence - a list of symbols representing the words of a sentence
tags - a list of symbols that represent the part-of-speech tags of the words of the specified sentence (coordinated with the specified list of words)
Returns:
a counts table of the top-level (maximal context) events generated while constrain-parsing this sentence
Throws:
RemoteException

parseAndCollectEventCounts

protected CountsTable parseAndCollectEventCounts(SexpList sentence,
                                                 SexpList tags,
                                                 ConstraintSet constraints)
                                          throws RemoteException
Constrain-parses the specified sentence and computes expected top-level (maximal context) event counts. This method performs the E-step of the Inside-Outside EM algorithm (with a little bit of M thrown in, in that events are aggregated over the entire sentence).

Parameters:
sentence - a list of symbols representing the words of a sentence
tags - a list of symbols that represent the part-of-speech tags of the words of the specified sentence (coordinated with the specified list of words)
constraints - a set of parsing constraints
Returns:
a counts table of the top-level (maximal context) events generated while constrain-parsing this sentence
Throws:
RemoteException

computeOutsideProbs

protected void computeOutsideProbs()
Computes outside probabilities for the entire chart. This step depends on having computed inside probabilities via constrain-parsing first.


computeOutsideProbs

protected void computeOutsideProbs(int start,
                                   int end)
Computes outside probabilities for all derivations in the specified span.

Parameters:
start - the index of the first word in the span whose chart items' outside probabilities are to be computed
end - the index of the last word in the span whose chart items' outside probabilities are to be computed

computeEventCounts

protected CountsTable computeEventCounts()
Returns a counts table with the expected couunt of all top-level events produced when constrain-parsing the current sentence.

Returns:
a counts table with the expected couunt of all top-level events produced when constrain-parsing the current sentence.

computeEventCounts

protected void computeEventCounts(int start,
                                  int end,
                                  double sentenceProbInverse,
                                  CountsTable counts)
Computes expected counts for top-level (maximal context) events produced for the specified span when decoding the current sentence; stores these events and their expected counts in the specified CountsTable object.

Parameters:
start - the index of the first word in the span whose expected event counts are to be computed
end - the index of the last word in the span whose expected event counts are to be computed
sentenceProbInverse - the inverse of the total inside probability of the current sentence under the current model
counts - the table in which to store expected event counts

addSynthesizedTopModEvent

protected void addSynthesizedTopModEvent(TrainerEvent event,
                                         double expectedCount,
                                         CountsTable counts)
Adds an event as though a tree's non-hidden root is a modifier of +TOP+ (in addition to being a head child). Without these “fake” events, the last level of the modifier-word model (such as ModWordModelStructure2) would not contain counts for words that are the head of the entire sentence (since they are not generated as modifiers of anything). This enables the (now deprecated) count-sharing scheme to work, whereby the last back-off level of TopLexModelStructure1 would use the p(w|t) counts from the last level of ModWordModelStructure2.

Parameters:
event - the HeadEvent instance for an observed tree root, from which a ModifierEvent is to be produced
expectedCount - the expected count of the specified head-generation event
See Also:
Settings.trainerShareCounts

addPretermHeadEvent

protected void addPretermHeadEvent(EMItem item,
                                   double expectedCount,
                                   CountsTable counts)
Whenever a preterminal is generated, either as a head child or a modifier of some other item, a trivial head-generation event is added, generating the word from the lexicalized preterminal, which by design always generates its head word with probability 1. These extra events allow the trainer to gather counts for all lexicalized nonterminals using only the head-generation event counts table, thereby eliminating the somewhat risky scheme of “count sharing”.

See Also:
Settings.trainerShareCounts

addTopUnaries

protected void addTopUnaries(int end)
                      throws RemoteException
Description copied from class: Decoder
Adds hiden root nonterminal probabilities. That is, for each derivation spanning the entire sentence from index 0 to the specified end index, this method produces new chart items in which the probability of producing that derivation given Training.topSym() has been multiplied to the existing item's score.

Overrides:
addTopUnaries in class Decoder
Parameters:
end - the index of the last word of the sentence being parsed
Throws:
RemoteException

complete

protected void complete(int start,
                        int end)
                 throws RemoteException,
                        Decoder.TimeoutException
Description copied from class: Decoder
Constructs all possible items spanning the specified indices and adds them to the chart. This involves joining modificands (items to be modified) with modifiers when the modificand has not yet received its stop probabilities and when the spans of both modificand and modifier cover the specified span.

Overrides:
complete in class Decoder
Parameters:
start - the index of the first word in the span for which all chart items are to be created and added to the chart
end - the index of the last word in the span for which all chart items are to be created and added to the chart
Throws:
Decoder.TimeoutException - if the boolean value of Settings.maxParseTime is greater than zero has been reached while parsing
RemoteException
See Also:
Decoder.joinItems(CKYItem,CKYItem,boolean)

joinItems

protected void joinItems(EMItem modificand,
                         EMItem modifier,
                         boolean side)
                  throws RemoteException
Joins two chart items, one representing the modificand that has not yet received its stop probabilities, the other representing the modifier that has received its stop probabilities.

Parameters:
modificand - the chart item representing a partially-completed subtree, to be modified on side by modifier
modifier - the chart item representing a completed subtree that will be added as a modifier on side of modificand's subtree
side - the side on which to attempt to add the specified modifier to the specified modificand
Throws:
RemoteException

addUnariesAndStopProbs

protected void addUnariesAndStopProbs(int start,
                                      int end)
                               throws RemoteException
Description copied from class: Decoder
Finds all possible parent-head (or unary) productions using the root node of each existing chart item within the specified span as the head, creates new items based on these existing items, multiplying in the parent-head probability; then, using these new items, this method also creates additional new items in which stop probabilities have been multiplied; all new items are added to the chart. Stop probabilities are the probabilities associated with generating Training.stopSym() as a modifier on either side of a production.

Overrides:
addUnariesAndStopProbs in class Decoder
Parameters:
start - the index of the first word in the span
end - the index of the last word in the span
Throws:
RemoteException
See Also:
Decoder.addUnaries(CKYItem, java.util.List), Decoder.addStopProbs(CKYItem, java.util.List)

addUnaries

protected List addUnaries(EMItem item,
                          List itemsAdded,
                          int level)
                   throws RemoteException
Throws:
RemoteException

addStopProbs

protected List addStopProbs(EMItem item,
                            List itemsAdded,
                            int level)
                     throws RemoteException
Throws:
RemoteException

Parsing Engine

Author: Dan Bikel.