Auditory recognition and classification based on changes in attentional state.
Click to see entire strip...
The input provided by the MIDI file is preprocessed to calculate Pitch, Volume, and Duration values, as well as a voice index. The voice index need is replaced by the time elapsed value, which makes the use of voices general enough to allow for a dynamic reordering of voices. Maximum, minimum, and average values within a window are calculated for each of them in a pre-processing step. Input Details...
Based on these values and their maximums, minimums and averages, saliency values are calculated for each piece of data and their derivatives. This results in saliency maps which make different vectors stand out, depending on what feature one is paying attention to. Saliency Details...
The pattern matching algorithm will be biased in matching the values which are salient on the vectors which are salient. Patterns of varying lengths will be stored for salient vectors, and matched as new inputs come in. Pattern Details...
When a pattern is matched partially with the current input, the system will predict future notes as the continuation of the stored pattern. Since more than one patterns will be matched (at least one for each pattern length - one for every pattern table), a weighted average will be calculated, depending on certainty and pattern length. Prediction Details...
Some of the research in Artificial Intelligence research has been oriented mainly towards building a general computational theory of intelligence, of which human intelligence would be only one instance. However, some of the recent research in Artificial Intelligence has been primarily directed towards understanding human intelligence, taking the results of neuroscience and psychology into large consideration.
Our research fits in the second category, as a part of the effort to understand general intelligence through a study of animal and particularly human intelligence, the only form of intelligence we know. We use methods of Computer Science, and especially Computer Engineering, as tools to exploit the facts revealed in biology, psychology and neuroscience. Most importantly, we believe that perception and interaction with the real world is at the heart of intelligence.
In this perspective, our research is aimed at expressing, with computational tools, a model of human perceptual information processing. In particular, this project is a proposal to explore an attention-based model for auditory processing, inspired by the model proposed by Rao  and Ullman  for visual spatial perception.
A lot of work has been done in fields related to visual information processing, for example computer vision, character recognition, etc. However, less work has been done in auditory information processing, in areas other than language processing. Furthermore, only a small fraction of the work on vision is directly aimed at understanding practically how the human brain does it. Similarly, little work has been done in understanding how we listen to and react to sounds other than conversations, for instance, music. How do we understand, interpret, recognize music? Moreover, why does it affect our emotions, our mood?
For the human visual system, Ullman  proposed a theory of cognitive routines to account for the flexibility and polyvalence of our visual system in performing various complex information extraction tasks. Rao  proposed a model on how the routines can be performed and learned using a generic architecture around attention.
Our task in this project was to try out a model similar to Rao's model for musical perception. We chose music/audition instead of vision, to see if the generic model for spatial information processing is also a generic model for other senses, as Rao speculates. The concept of an attention-based model is an attractive one, for two reasons, which are related to each other:
Although we abstract away the issues of note harmonics and playing instrument types, we use a keyboard as input. This allows for a more rapid creation of new tunes, as well as for future real-time input. The same mechanisms used in loading the MIDI file output from the electric keyboard can also be used in loading pre-made MIDI files.
Dedric playing puts in the mood. ;o)
The input is limited to one instrument at a time, although multiple notes can be played simultaneously. The notes can be loaded by our utilities from either the MIDI0 or MIDI1 file format. The notes are separated after loading into our system into different voices.
Matching notes can be tricky, because the MIDI format stores only time ON/OFF, with a code format which can be confusing. Dedric's program successfully decodes this information and returns a Pitch/Volume/Duration/Time on encoding for each of the notes played. This encoding includes a cleaning up of slight note overlaps.
Ovidiu's code, after updating the internal MOOD note store, takes the new note and tags it with a voice. The voices denote notes which overlap, or notes which follow a pattern which is distinct from a separate set of notes being played in the same tune. This is akin to two objects which move about in space: each has a trajectory which is separate from the other object's trajectory. When doing vision, one of the first tasks is to separate potential objects based on their motion. The equivalent is being attempted here in the auditory realm.
As Manolis observes, this is yet another great MOOD success.On to Augmenting the Input...
The duration elapsed is the time difference between the current time and the time of the note. When a pattern of 5 consecutive notes is constructed, the duration fields will be [4, 3, 2, 1, 0], and they will match exactly any other pattern of consecutive notes. However, there is no constraint on which data will enter a pattern.
For example, a pattern table can have the first element come from voice X, the second from voice Y, and the third from voice Z. A pattern in that table could be "voice X falls 2 notes ago, voice Y rises 10 notes ago, and voice Z has maximum pitch now". In that case the duration fields will be [2, 10, 1] for voices [X Y Z]. The pattern will also match [2, 8, 1], where Y rose 8 notes ago instead. This allows the system to match a larger class of patterns independent of the exact time of occurence, and which voice the notes belong to.
If such a duration field is not included, one has to train the system to recognize only cases where a pattern depends on exact time relationships. This varying duration field allows for more dynamic relationships between voices. Since pattern tables are created dynamically, depending on what data is salient, the relationships to look for need not be hard-coded. A different pattern table will be created for different combinations of notes, but that's another story, which belongs to the pattern matching domain.
On the diagram to the left, the yellow boxes represent the data received directly by the MIDI file. In the red boxes lies the transformed data, as described above, that will be used in matching patterns. The particular voice is not matched within the pattern element, but is encoded in the pattern table. Within a pattern table, the same element will always come from the same voice X.
The input vector is augmented by the derivatives of the three input values, represented in blue. Only the basic values and their derivatives are matched when comparing elements. The saliency values are simply there to determine what is important when making a match, but they are not matched themselves.
Temporarily, maximum, minimum and average values of each of the six fields are calculated for the window-size (magenta), and they are used to calculate the saliency maps (cyan). The window size represents how far back in the past we should look for normalizing values when comparatively analyzing the current data. It is relative to those min max and avg values that values are matched within patterns, as will be described later on. They also serve as a normalization step, so that data given in different units can be compared in an unbiased way.
Along with the other saliencies, a periodicity saliency is calculated, which makes salient notes based on a periodically salient history (green).
Thus we have reached from a 3-value input (pitch, volume, time) a 7-value vector (pitch, volume, duration, their derivatives, and time elapsed) to match upon. Adding periodicity, each vector also has an 8-field saliency, which determines what values should be matched more carefully.On to Saliency or back to overview.
Saliency is a value between 0 and 1 that indicates how important a vector or a field inside a vector is. The most important ones have a value, the least important, a saliency of 0.
The saliency value for a vector is computed as the average of individual saliencies calculated for each of the components.
Individual saliencies for each value are computed relative to the window of attention. The highest value in the window will generally have a saliency value of 1, as well as the lowest one. An average value will have a saliency value of 0.The highest volume in the current window yields a saliency of 1. The lowest has a saliency of 0.
The longest duration and the shortest one have saliencies of 1 (either long = salient or staccato = salient), and the average duration will have a saliency of 0.
Similarly for pitch - highest and lowest notes have saliency of 1, and average pitch has saliency of 0.
Derivatives follow the saliency rules as the components they come from.
We have thus constructed many saliency maps. One can pay attention to notes salient in volume, or in pitch, or in duration, or in their derivatives. The pattern matcher also computes a periodicity saliency map, which gives a value of 1 to notes expected to be salient, and a 0 to the non-salient ones. Together with the pre-saliency values, this final addition constitutes the saliency part of the code.
On to Pattern Matching...
Pattern MatchingPattern matching, along with learning the periodicity of the input, does the basic computations the code is based upon. It is able to learn patterns of different length from the input, and match partially completed patterns to either predict future inputs, or turn the focus of attention to different patterns interesting to a specific behavior.
The pattern matcher receives commands from the attention loop, which directs it on what to match, what to record (learn), and what to pay attention to.
Every time a pattern is matched, individual elements get averaged, and their certainties updated. The certainty of a match between two values is one minus the difference between the two values over the difference between the maximum value and the minimum value at the current window. The certainty of a a match between two pattern elements is the average of the certainties of their values. The certainty of a match between two patterns, is the average of the certainties of their elements, weighted by the saliency of each element. Intuitively, one will try to match within a sequence of notes the important notes, and will care less about less salient notes which do not match.
When the certainty with which two patterns are matched is too low, then a new pattern is added to the pattern table.
When the certainty is high enough, then the matched pattern is extended to include the current pattern with which it matched.
Making a Match
In the figure, we would like that the second pattern enclosed in the box matches the first one, previously recorded. Clearly, the values do not match, nor do their derivatives, since everything seems to have doubled (for example in volume).
Therefore, when a match is made, what is compared is not absolute values, but instead values relative to the current maximum and minimum of the window-size considered.
Making a Prediction
Predictions are made accordingly, relative to the current window maximum and minimum. If the current pattern has length 4, then the system will match the first 4 elements of a stored pattern of length 5, and predict the next note as the final element of the matched stored pattern.
Patterns of different sizes are matched in making the predictions. Their predictions are weighted then averaged, to obtain the final prediction. For each length, let best[length] be the pattern of that length that best matched the current pattern (of length-1). Each best[length] will have a certainty value associated with it, and with each element. This certainty value, along with the pattern length, are used in weighting the different size predictions.
The longer the length of the pattern is, the higher its weight should be. The reason for this bias of trusting longer patterns is that they are less likely to match. For almost anything, a two-note pattern will match with high certainty.
Mood - Report
It's basically a database of shared state, and provides no functionality by itself, but simply a place to scribble on at every iteration of the attentional loop.
Our hope is that similar behaviors correspond to similar types of music, and thus our program would be able to recognize and categorize types of music simpy by the way it reacts to each of them.
We reduced the input to a 3-dimensional vector for pitch, duration and volume. This is an oversimplification of our perception. One direction for future work is to explore the lower-level sound wave perception with the correct level of complexity. We would understand how the many harmonics of an instrument enriches our perception of music.
Our project expressed a language of attention for audition, within the limits of music perception, given one instrument. With a more accurate model for lower-level auditory perception, we can extend this language to all of the auditory world. We could understand how we perceive several instruments coherently together, or how we can select one instrument, one conversation among others, or one pattern of sounds in the middle of nature.
Visual and Auditory Routines are nothing else but elements of our implicit memory. With this insight, we can explore how implicit memory relates to explicit memory and understand how it all works in the brain.
When we first learn how to play an instrument, we must pay very close attention to the movement of our fingers. But once the learning is done, we can play the piece like a routine, without even paying attention to what our fingers are doing, and focus on something else, for example a conversation with a friend. How does attention move horizontally between senses? How does attention move vertically between levels of processing, when it can follow every movement of the fingers or follow the global flow of a routine? Furthermore, how do we distribute attention, in other words pay attention to several things at the same time?
Having a unified model for auditory and visual perception can help us understand the functions of the neural structure of the human brain. How is attention implemented in our brain?
If we can put attention at the center of all our perception, will that explain the notion of consciousness and self-awareness?
What is beautiful about music? How can music affect our mood, make us sad, happy, angry, energetic, or even empowered or enraged? Music has the potential to create a strong emotional response in people, sometimes pushing it to an extreme. For example, Muslims give to some forms of music the same forbidden status as alcohol or gambling, because such music is believed to alienate people. If our model shows how we perceive music, then by studying what patterns of music perception affect our emotional state, we might get insights about how emotions work.