# from Machine Learning to Machine Reasoning

## Contents

## Introduction

Learning and reasoning are both essential abilities associated with intelligence and machine learning and machine reasoning have received considerable attention given the short history of computer science. The statistical nature of machine learning is now understood but the ideas behind machine reasoning is much more elusive. Converting ordinary data into a set of logical rules proves to be very challenging: searching the discrete space of symbolic formulas leads to combinatorial explosion <ref>Lighthill, J. "Artificial intelligence: a general survey." In Artificial intelligence: a paper symposium. Science Research Council.</ref>. Algorithms for probabilistic inference <ref>Pearl, J. "Causality: models, reasoning, and inference." Cambridge: Cambridge University Press.</ref> still suffer from unfavourable computational properties <ref>Roth, D. "On the hardness of approximate reasoning" Artificial Intelligence, 82, 273–302.</ref>. Algorithms for inference do exist but they do however, come at a price of reduced expressive capabilities in logical inference and probabilistic inference.

Humans display neither of these limitations.

The ability to reason is the not the same as the ability to make logical inferences. The way that humans reason provides evidence to suggest the existence of a middle layer, already a form of reasoning, but not yet formal or logical. Informal logic is attractive because we hope to avoid the computational complexity that is associated with combinatorial searches in the vast space of discrete logic propositions.

It turns out that deep learning and multi-task learning show that we can leverage auxiliary tasks to help solve a task of interest. This idea can be interpreted as a rudimentary form of reasoning.

## Auxiliary Tasks

In order to consider the relevance of an auxiliary task, let us consider the task of of identifying person from face images. It remains expensive to collect and label millions of images representing the face of each subject with a good variety of positions and contexts. However, it is easier to collect training data for a slightly different task of telling whether two faces in images represent the same person or not: two faces in the same picture are likely to belong to two different people; two faces in successive video frames are likely to belong to the same person. These two tasks have much in common image analysis primitives, feature extraction, part recognizers trained on the auxiliary task can help solve the original task.

Figure below illustrates the a transfer learning strategy involving three trainable models. The preprocessor P computes a compact face representation of the image and the comparator labels the face. We first assemble two preprocessors P and one comparator D and train this model with abundant labels for the auxiliary task. Then we assemble another instance of P with classifier C and train the resulting model using a restrained number of labelled examples from the original task.

## Reasoning Revisited

Little attention has been paid to the rules that describe how to assemble trainable models that perform specific tasks. However, these composition rules play an extremely important rule as they describe algebraic manipulations that let us combine previously acquire knowledge in order to create a model that addresses a new task.

We now draw a bold parallel: "algebraic manipulation of previously acquired knowledge in order to answer a new question" is a plausible definition of the word "reasoning".

Composition rules can be described with very different levels of sophistication. For instance, graph transformer networks (depicted in the figure below) <ref>Bottou, L., LeCun, Y., & Bengio, Y. "Global training of document processing systems using graph transformer networks." In Proc. of computer vision and pattern recognition (pp. 489–493). New York: IEEE Press.</ref> construct specific construct specific recognition and training models for each input image using graph transduction algorithms. The specification of the graph transducers then should be viewed as a description of the composition rules.

## Probabilistic Models

Graphical models describe the factorization of joint probability distributions into elementary conditional distributions with specific independence assumptions. The probabilistic rules then induce an algebraic structure on the space of conditional probability distributions, describing relations in an arbitrary set of random variables.

## Reasoning Systems

We are no longer fitting a simple statistical model to data and instead, we are dealing with a more complex model consisting of (1) an algebraic space of models, and (b) composition rules that establish a correspondence between the space of models and the space of questions of interest. We call such an object a "reasoning system".

Reasoning systems are unpredictable and thus vary in expressive power, predictive abilities and computational examples. A few examples include:

*First order logic reasoning*- Consider a space of models composed of functions that predict the truth value of first order logic as a function of its free variables. This space is highly constrained by algebraic structure and hence, if we know some of these functions, we can apply logical inference to deduce or constrain other functions. First order logic is highly expressive because the bulk of mathematics can be formalized as first order logic statements <ref>Hilbert, D., & Ackermann, W."Grundzüge der theoretischen Logik." Berlin: Springer.</ref>. However, this is not sufficient in expressing natural language: every first order logic formula can be expressed in natural language but the converse is not true. Finally, first order logic usually leads to computationally expensive algorithms.

*Probabilistic reasoning*- Consider a space of models formed by all the conditional probability distributions associated with a set of predefined random variables. These conditional distributions are highly constrained by algebraic structure and hence, we can apply Bayesian inference to form deductions. Probability models are more computationally inexpensive but this comes at a price of lower expressive power: probability theory can be describe by first order logic but the converse is not true.

*Causal reasoning*- The event "it is raining" and "people carry open umbrellas" is highly correlated and predictive: if people carry open umbrellas, then it is likely that it is raining. This does not, however, tell you the consequences of an intervention: banning umbrellas will not stop the train.

*Newtonian Mechanics*- Classical mechanics is an example of the great predictive powers of causal reasoning. Newton's three laws of motion make very accurate predictions on the motion of bodies on our universe.

*Spatial reasoning*- A change in visual scene with respect to one's change in viewpoint is also subjected to algebraic constraints.

*Social reasoning*- Changes of viewpoints also play a very important role in social interactions.

*Non-falsifiable reasoning*- Examples of non-falsifiable reasoning include mythology and astrology. Just like non-falsifiable statistical models, non-falsifiable reasoning systems are unlikely to have useful predictive capabilities.

It is desirable to map the universe of reasoning system, but unfortunately, we cannot expect such theoretical advances on schedule. We can however, nourish our intuitions by empirically exploring the capabilities of algebraic structures designed for specific applicative domains.

The replication of essential human cognitive processes such as scene analysis, language understanding, and social interactions form an important class of applications. These processes probably include a form of logical reasoning because are able to explain our conclusions with logical arguments. However, the actual processes happen without conscious involvement suggesting that the full complexity of logic reasoning is not required.

The following sections describe more specific ideas investigating reasoning systems suitable for natural language processing and vision tasks.

## Association and Dissociation

We consider again a collection of trainable modules. The word embedding module W computes a continuous representation for each word of the dictionary. The association module is a trainable function that takes two vectors representation space and produces a single vector in the same space, which is suppose to represent the associate of the two inputs. Given a sentence segment composed of *n* words, the figure below shows how *n-1* applications of the associate module reduce the sentence segment to a single vector. We would like this vector to be a representation of the meaning of this sentence and each intermediate result to represent the meaning of the corresponding sentence fragment.

(figure 6 here)

There are many ways of bracketing the same sentence to achieve a different meaning of that sentence. The figure below, for example, corresponds to the bracketing of the sentence "*((the cat) (sat (on (the mat))*". In order to determine which form of bracketing of the sentence splits the sentence into fragments that have the most meaning, we introduce a new scoring module R which takes in a sentence fragment and measures how meaningful is that corresponding sentence fragment.

(figure 7 here)

The idea is to apply this R module to every intermediate result and summing all of the scores to get a global score. The task then, is to find a bracketing that maximizes this score. There is also the challenge of training these modules to achieve the desired function. The figure below illustrates a model inspired by Collobert et. al.<ref>Collobert, R., & Weston, J. "Fast semantic extraction using a novel neural network architecture." In Proc. 45th annual meeting of the association of computational linguistics (ACL) (pp. 560–567).</ref><ref>Collobert, R. "Deep learning for efficient discriminative parsing." In Proc. artificial intelligence and statistics (AISTAT).</ref> This is a stochastic gradient descent method and during each iteration, a short sentence is randomly selected from a large corpus and bracketed as shown in the figure. An arbitrary word is the then replaced by a random word from the vocabulary. The parameters of all the modules are then adjusted using a simple gradient descent step.

(figure 8 here)

In order to investigate how well the system maps words to the representation space, all two-word sequences of the 500 most common words were constructed and mapped into the representation space. The figure below shows the closest neighbors in the representation space of some of these sequences.

(figure 9 here)

The disassociation module D is the opposite of the association model, that is, a trainable function that computes two representation space vectors from a single vector. When its input is a meaningful output of the association module, its output should be the two inputs of the association module. Stacking one instance of the association module and one instance of the dissociation module is equivalent to an auto-encoder.

(figure 10)

Association and dissociation modules are not limited to just natural language processing tasks. A number of state-of-the-art systems for scene categorization and object recognition use a combination of strong local features, such as SIFT or HOG features, consolidated along a pyramidal structure. Similar pyramidal structure has been associated with the visual cortex. Pyramidal structures work poorly as image segmentation tools. Take for example, the figure below which shows that a large convolutional neural network provides good object recognition accuracies but coarse segmentation.

(figure 11)

Finally, we envision module that convert image representations into sentence representations and conversely. Given an image, we could parse the image and convert the final image representation into a sentence representation. Conversely, given a sentence, we could produce a sketch of the associated image by similar means.

## Universal Parser

The figure below shows a model of short-term memory (STM) capable of two possible actions: (1) inserting a new representation vector into the short-term memory and (2) apply the association module A to two representation vectors taken from the short-term memory and replacing them by the combined representation vector. Each application of the association module is scored using the saliency scoring module R. The algorithm terminates when STM contains a single representation vector and there are no more representation vectors to insert.

(figure 12)

The algorithm design choices determine which data structure is most appropriate for implementing the STM. In the English language, sentences are created by words separated by spaces and therefore it is attractive to implement the STM as a stack and construct a shift/reduce parser.

## More Modules

The previous sections discussed the association and dissociation modules. Here, we discuss a few more modules that perform predefined transformations on natural language sentences; modules that implement specific visual reasoning primitives; and modules that bridge the representations of sentences and the representations of images.

- Operator grammars <ref>Harris, Z. S. "Mathematical structures of language." Volume 21 of Interscience tracts in pure and applied mathematics.</ref> provide a mathematical description of natural languages based on transformation operators.
- There is also a natural framework for such enhancements in the case of vision. Modules working on the representation vectors can model the consequences of various interventions.

## Representation Space

Previous models have functions operating on low dimensional vector space but modules with similar algebraic properties could be defined on a different set of representation spaces. Such choices have a considerable impact on the computational and practice aspects of the training algorithms.

- In order to provide sufficient capabilities, the trainable functions must often be designed with linear parameterizations. The algorithms are simple extensions of the multilayer network training procedures, using back-propagation and stochastic gradient descent.
- Sparse vectors in much higher dimensional spaces are attractive because they provide the opportunity to rely more on trainable modules with linear parameterization.
- The representation space can also be a space of probability distributions defined on a vector of discrete random variables.

## Conclusions

The research directions outlined in this paper is intended to advance the practical and conceptual understanding of the relationship between machine learning and machine reasoning. Instead of trying to bridge the gap between machine learning and "all-purpose" inference mechanisms, we can instead algebraically enrich the set of manipulations applicable to a training system and building reasoning abilities from the ground up.

## Bibliography

<references />