Concluding to the Future

Check it out, here are the results of using a mimic to extract the violin from a song remake I've done called Remeji. The mimic was played on the violin. The song is fairly harmonically dense and also includes other string sounds in similar frequency ranges.

Yes, it works. But there is still plenty to do. The examples so far were subject to test on a fairly wide scope of possible parameters. Since every song is different, it would be counter to expectations if one set of component number choices or sparsity settings were always best. The ideal is to reduce these to a smaller set so that extraction can be carried with good results in reasonable speed. Although a large number of statistical evaluations were presented, in the end, the best extraction is subjective and in the ear of the listener.

The biggest optimization is selecting M & N correctly. With simple sounds, M can be determined by the actual part but lacking that convenience, it must be discovered. Previously it was suggested that looking for a threshold drop in MSE during complete extraction could be used to determine optimal M. If only extracting melody, choice of N is less important. If extracting the backing, a higher M definitely extracts more effectively than a low M even if for melody purposes, that allows too much interference in. Keeping M over 30 and N over 100 was generally sufficient for good backing extraction.

The iterations of the algorithm also changes results. Not surprisingly, higher iterations of the full M+N decomposition lead to “more averaging” yielding a less crisp but more stable result. That shorter runs do better for melody extraction is interesting but computationally encouraging. An optimality test for this is computationally easy so is less of a concern.

Sparsity is very subjective. A big message is don't use too much. Sometimes a statistic such as SIR will do well with a lot of sparsity but the increase in artifact results in significant intolerable sound quality loss. My personal preference was usually a small amount of sparsity in H ( s < .2 usually like 0.05). Sparsity in W sometimes but not always improved sound and there was a same situation for a combination of the two. Sparsity in Z almost never yielded tolerable results.

The need for sparsity is reduced when using the Bayes Decision Mask. Again, there is a careful balance between artifact and audio, but it did generally improve the cleanliness of the result. I find it works better for melody than backing. A lot of the higher M extractions successfully remove the target anyway so that further clean-up is unnecessary. With the melody, it helps eliminate some of the audio that otherwise creeps in when the target line isn't playing, something much more distracting than noise in the backing.

Future Work

A new much more limited test suite needs to be developed to find these optimal parameters in a span of a couple minutes not an hour. With the reduced search range and better informed means to discover optimal components this should be feasible. As I can generally identify the extra components visually a better way of finding M might be possible and would be handy. I did already do small studies of both KL divergence and simple variance to identify least optimal components for targeted removal but they didn't seem to be informational.

Other work still to do includes an investigation into how to break up a song for best extraction. Here I did clips. But the end goal is the full song. Additionally, finding a better way to pick starting priors would be fantastic. The quality of extraction can still range widely if the random starting priors aren't very good and the algorithm ends up in local optima.

For more examples of extracted audio go to:

Extraction Examples