This project explores the nature of language acquisition in computers, guided by techniques similar to those used in children. While existing natural language processing methods are limited in scope and understanding, our system aims to gain an understanding of language from first principles and hence minimal initial input.
The first portion of our system is focused on understanding the morphology, or word structure, of language using bigrams which are two-letter sequences within words. The program was developed first in C++, and then translated into Java to take advantage of the ability to use non-standard characters. We use frequency distributions and differences between them to define and distinguish languages. English and French texts were analyzed to determine a difference threshold of 55 before the texts are considered to be in different languages. The program was also tested with Spanish texts and found to work with the same threshold, and the frequency distributions for the individual languages were analyzed.
The second portion of our system focuses on gaining an understanding of syntax, or sentence structure, of a language using a recursive method. The program uses one of two possible methods to analyze given sentences based on either sentence patterns or surrounding words. Both methods have been implemented in C++. Using a minimum of initial input, the program is able to understand the structure of simple sentences and learn new words.
In addition, we have provided some suggestions regarding future work and potential extensions of the existing program. We have described how the program might currently analyze certain sentence constructs along with how the program could be edited to provide better understanding. We have also made suggestions regarding how to implement a computationally intuitive understanding of semantics, or the meanings of words in a language.