Evaluation!
A number of songs were tested against. The bulk of tests were not done using mimiced target samples for the simple reason that it allowed me to verify that the behaviors I was observing were a result of the algorithm, not whether I had made a good mimic or not. There were are number of things to investigate:
Performance for different number of algorithm iterations
Performance for different numbers of target descriptor components M.
Performance for different numbers of backing descriptor componets N.
The effects of introducing sparsity on different vectors to reduce effect of weak components.
Different risk boundaries for the Bayes binary masking.
Besides qualitative evaluation, quantitative evaluation was doing using through use of Mean Square Evaluation (MSE) and the Blind Source Separation Tool-Box from http://bass-db.gforge.inria.fr/bss_eval/ which is a standard speech separation evaluation set. While the absolute results from MSE were often not very helpful, the trending follows the most important metric, Source Interference Ratio (SIR). MSE is calculated from the ffts making it less computationally expensive at mid stages and will likely become the driving metric in automatic optimization to come later. The BSS tool-kit results compare the actual wav outputs returning Signal to Distortion (SDR), Signal to Artifacts Ratio (SAR), and SIR. SIR measures the leakage between the two extracted tracks and is the best measure for success. SDR and SAR relate more closely to general audio quality.
I have also done a comparison with one of the most basic means of extraction, the straight application of a Bayes Binary Mask.
Data targets and sources have attempted to be somewhat diverse. With one noted exception, all test were run using source audio. Mimics are noted where done in addition:
Lead Vocal and bass-line from Oasis's Wonderwall (500,000 samples starting at .49 seconds)
Vocal extraction performed using both source audio and a mimic executed using voice
Lead Vocal from Radiohead's Nude (400,000 samples starting from 1.27 seconds)
Lead Vocal and Bass from George Cochran's Entertaining Circumstances (400,00 samples at 1.44 seconds - thanks George!)
Clarinet from Larry Linkin performing “One Note Samba” (400,000 samples starting at .16 seconds)-
In this case I don't have the source audio. This is a music minus one example so I have the backing source only. The extraction was completed using a MIDI keyboard controlling a clarinet-like sound.
Melodic lead from my own unnamed Ableton toy set.
Extracted using mimicry on violin.
Lead Violin from my own Remeji (500,000 samples starting from .46 seconds) This is the hardest test set offered. Violin has a notoriously complex spectrum and worse, here it has been recorded with significant reverb. There are also other string parts in similar frequency range.
Also performed extraction using mimic recorded on violin.