Syntax: Algorithm

To analyze the syntax of a language, a "recursive learning" method is implemented using C++. Since the program would ideally require an absolute minimum of initial information, this method takes a small initial set of words and builds on this by alternately using known words to learn new sentence structures and using known sentence structures to learn new words as seen below.

{cat, man, has} → "The man has a cat" → "The x y a z" → "The man wore a hat" → {cat, man, has, wore, hat}

Representing Information

The information this program gathers can be split into two pieces, information on the words themselves and information on valid sentence structures. For words, the program keeps track of the word itself and the word's type. Only two specific types are defined in the program, "noun" and "verb", and the actual meaning of each is defined by context. Any word which does not fit these definitions is defined relative to these definitions. For sentences, the program keeps track of the number of words in a given sentence pattern and the type of the word in each position (in an array).

"word" datatype
  string word
  string type
"pattern" datatype
  int word_count
  string pattern[]

Sentence Structure Method

This first method analyzes new words based on the structure of the sentence, by selecting the most applicable of existing patterns based on correspondence with known information. The program keeps track of how many "matches" there are between the sentence and the known pattern, taking the one with the most matches (if greater than half the word count) and using it to set unknown word types. This method is particularly useful for learning new nouns and verbs, in situations where other words in the sentence are primarily known grammatical particles, and would be the default method if a sufficient match could be found.

int matches
for all known patterns
  matches = 0
  if sentence.word_count == pattern.word_count
  for all sentence positions i
    if word[i].type == pattern[i].type
  if matches > word_count/2
    for all sentence positions i
      if word[i] undefined
        word[i].type = pattern[i].type

Word Context Method

The other method uses the surrounding words to define the type of any unknown words. The program notes the types of the words before and after the unknown word, and the type of the unknown word is then stated as "a<type of previous word> b<type of next word>". For example, a word with type "anoun bverb" would be one which tends to come after nouns and before verbs. This type could also be analyzed and modified to give further insight into other types of words such as adjectives. The program can redefine known words if new information is found on their positioning, using the new definition if it is shorter, and thus more general, than the previous definition. This method is best used when most of the nouns and verbs in a sentence are known, but other words exist which are not known.

for all sentence positions i
  if word[i] undefined
    word[i].type = ""
    if word not first in sentence
      word[i].type += ("a" + word[i-1].type)
    if word not last in sentence
      word[i].type += ("b" + word[i+1].type)
  else if word[i].type is not noun or verb
    temp_type = ""
    if word not first in sentence
      temp_type += ("a" + temp_type)
    if word not last in sentence
      temp_type += ("b" + temp_type)
    if temp_type.length < word[i].type.length
      word[i].type = temp_type