Important Properties of Language
Certain elements of language are commonly used both in natural language processing and the general analysis of language. These properties include bigrams and recursion, two properties which play a significant role in this project.
An n-gram is a sequence of n letters. The most commonly used forms of n-grams are bigrams and trigrams. Use of n-grams is often found in cryptography - a common method of decoding a message that has been encoded using a keyword, like the Vigenere Cipher encryption, is to calculate the distance in letters between two of the same bigrams in order to determine the length of the keyword, and then the keyword itself.
In natural language processing, bigrams are used for word discrimination, which is the understanding of an unknown word based upon bigram correspondence with a reference list of known words. In addition, word bigrams are used in some lexical parsers, the Markov Model for bag generation, and several text categorization tools.
The principle of recursion is an essential aspect of human language, and is considered one of the primary ways in which children learn a language. It can show that, for a sentence with a particular pattern, a word with a specific part of speech can be exchanged for another word of the same part of speech:
"The boy wears a hat." → "The dog wears a collar."
(A B_1 C A B_2) → (A B_3 C A B_4)
In addition, the words of a sentence can remain unchanged, while the pattern changes, thereby introducing a new part of speech:
"The boy wears a hat." → "The boy wears a big hat."
(A B C A B) → (A B C A D B)
Finally, recursion can be used to indicate the grammatical structures of a language. This occurs in the following poem:
When tweetle beetles fight,
a tweetle beetle battle.
And when they
battle in a puddle,
it's a tweetle
beetle puddle battle.
AND when tweetle beetles
battle with paddles in a puddle,
they call it a tweetle
beetle puddle paddle battle.
Excerpt from Dr. Seuss' Fox in Socks.
The author uses the recursive principle to indicate that "tweetle beetle" can be both a noun and an adjective, and then repeats this demonstration with "puddle" and "paddle". Further, the correct placement of the new noun acting as an adjective is shown to be between the old string of nouns acting as adjectives and the object noun. Conversely, the poem illustrates that a noun acting as an adjective can be rewritten as a preposition and an added clause, e.g "a tweetle beetle puddle battle" can be rephrased as "a tweetle beetle battle in a puddle". Thus, the principle of recursion can allow a child (or a computer) to acquire new vocabulary, new types of parts of speech, and new forms of grammar.