How To Separate Audio

First, an introduction to new terminology. M and N are numbers of components. PLCA decomposes the fft into three matricies: W, H & Z. W is the M columned matrix containing the frequency probabilities p(f|z). H is the M rowed matrix relating the timing info p(t|z) and Z is the M element vector of the probability p(z) of the component overall. W*diag(Z)*H = V where V is the approximate distribution of the fft.

PLCA with Dirichlet Hyperparamters

  1. In order to separate audio, first get a wav of the original mix. Then record a mimic of the target line.

  2. Convert everything to the frequency domain. Do your FFTs!

  3. Using standard PLCA, train a model with M components on the recorded mimic. This give matricies W, H, and Z for the target line that is to be extracted.

  4. Train a model using PLCA with Dirichlet hyperparameters with M + N components on the full mix.

    1. Providing W, and H, as priors is a start towards extraction but not sufficient to continue weighting the target line sufficiently. Instead use W, and H, as Dirichlet hyperparameters on the M trained components with decreasing weight throughout. This basically means add these weights multiplied by a decreasing factor each algorithm iteration keeping the whole normalized. For this project I decreased the weighting factor linearly from 1 to zero. Please reference code or the algorithm description.

    2. Train the remaining N components as normal with no Dirichlet hyperparameters. The first M components will contain the extracted source while the last N componets are the backing.

  5. The end result provides the probabilities for the frequency/time component belonging in either the M component extraction and the N component. The resultant V is broken into Vm, the probabilities of the target extraction and Vn, the probabilities of the backing. To recover the actual energy of the recording the extraction was derived from the two masks Vn & Vm are applied on the original fft magnitude. Now you have your an extraction.

Bayes Binary Masking

This is where PLCA ends. But there is one more thing we can do to improve the results- apply Bayesian decision making to make a binary decision whether a frequency/time component ends up in only one half of the extraction. Using the mask V from PLCA may distribute the frequency/time power between the target and the backing. Listening to results, often there are low amounts of backing that seep through to the extraction and vice versa. Using a Bayes derived binary mask removes some of this noise at the cost of additional distortion. This is often preferable to the ear.



Next: Data and Evaluation

This algorithm is described in more detail in “Separation by “humming”: User-guided sound extraction from monophonic mixtures” (Smaragdis & Mysore 2009).