NoRSE Algorithm

Abstract: 

NoRSE was developed to analyze high-frequency data sets collected from multi-state, dynamic experiments, such as molecular adsorption and desorption onto carbon nanotubes.  As technology improves sampling frequency, these stochastic data sets become increasingly large with faster dynamic events. More efficient algorithms are needed to accurately locate the unique states in each time trace. NoRSE adapts and optimizes a previously published noise reduction algorithm (Chung et al., 1991) and uses a custom peak flagging routine to rapidly identify unique event states.  The algorithm is explained using experimental data from our lab and its fitting accuracy and efficiency are then shown with a generalized model of stochastic data sets.  The algorithm is compared to another recently published state finding algorithm and is found to be 27 times faster and more accurate over 55% of the generalized experimental space.  NoRSE is written as an M-file for Matlab.

Link to paper: http://bioinformatics.oxfordjournals.org/content/early/2011/11/14/bioinformatics.btr632

Supplied code: NoRSE Algorithm for MATLAB

function NoRSE
% Originally Coded by Nigel F. Reuel 4.01.2011
% Massachusetts Institue of Technology
% Department of Chemical Engineering
%
% Last Updated 10.27.2011
%
% NoRSE was developed to analyze high-frequency data sets collected
% from multi-state, dynamic experiments, such as molecular adsorption
% and desorption onto carbon nanotubes. Full description available in the
% Bioinformatics publication entitled, "NoRSE: Noise Reduction and State
% Evaluator for High-Frequency Single Event Traces" by NF Reuel et al.
%
% The code is written to be as general as possible, but there are a few
% parameters that can be modified by the user to improve fitting to their
% specific data sets. See lines quoted below for further discussion of
% these parameters.
%
% ^^^ User specified parameters in the noise reduction subroutine ^^^
%
% 1. p (line 163)
% 2. M (line 162)
% 3. N (line 160)
%
% ^^^ User specified parameters in the peak finder subroutine ^^^
%
% 1. nbins (line 65)
% 2. SW (line 262)
% 3. MoleL (line 297)
% 4. SigBinD (line 413)
%
% ----- Step 1: Read in the Traces (can normalize here if desired) --------
%
% Data Columns = Trace number , Data Rows = Time steps
%
RTD = csvread('Data.csv');
%Determine the number of time steps (Ns) and number of transducers (Nt):
[Ns,Nt]= size(RTD);
RTD_n = RTD;
% Create empty matrix to recieve the noise reduced traces:
M = zeros(Ns,Nt);
%
% ---------------- Step 2: Denoise each of the Traces --------------------
%
%The denoising algorithm loses the first and last time steps, so set a new
%time window:
time = Ns - 2;
%Create an empty matrix to recieve the denoised trace data:
T_denoise = zeros(time,Nt);
for i = 1:Nt
NT = RTD_n(:,i);
IC = feval(@NoiseReduc,NT);
T_denoise(:,i) = IC(:,1);
end
% csvwrite(['T_denoise',int2str(iii),'.csv'],T_denoise);
%
%---------------- Step 3: Construct APH for each trace ------------------
%
% Number of bins for the histogram construction. We found 200 works well
% for experimental traces of ~1000 steps in length. You may need to
% decrease this for shorter traces or increase for longer.
%
% ^^^ NOTE: User can change following parameter to improve fit ^^^
%
nbins = 200;
%
%Create empty recieving matrices for histogram counts (nM) and the bin
%centers (xM)
nM = zeros(Nt,nbins);
xM = zeros(Nt,nbins);
for i = 1:Nt
Y = T_denoise(:,i);
[n,xout] = hist(Y,nbins);
nM(i,:) = n(1,:);
xM(i,:) = xout(1,:);
end
%xlswrite('nM',nM);
%xlswrite('xM',xM);
%
%---------------- Step 4: Find peaks for each APH histrogram ------------
%
%Use PeakFinder Function to find the Average Peak Locations (APL) - these
%translate to the bin number
APL = feval(@PeakFinder,nM,Ns);
[Nt,Nlev] = size(APL);
%
%Translate the APL bin numbers back to the normalized intensity level to
%determine the stochastic step levels (SLL)
SLL = zeros(Nt,Nlev);
for i = 1:Nt
% Very rarely, the peak finder will flag no bins as peaks, in this case
% we will use a general gaussian fit of the denoised data to find a
% mean step value.
if sum(APL(i,:)) == 0
Ncurves = 1;
X = T_denoise(i,:);
[u,~,~,~] = feval(@fit_mix_gaussian,X,Ncurves);
SLL(i,1) = u;
else
for j = 1:Nlev
if APL(i,j) ~= 0
N_bin = APL(i,j);
SLL(i,j) = xM(i,N_bin);
end
end
end
end
%
%----------- Step 5 Determine the NoRSE Fit Trace ---------------
%
for ii = 1:Nt
RealT = T_denoise(:,ii);
count = 0;
for m = 1:Nlev
if SLL(ii,m) > 0
count = count + 1;
end
end
for i = 1:time
compare = zeros(count,1);
for j = 1:count
compare(j,1) = abs(RealT(i,1)-SLL(ii,j));
end
[C,I] = min(compare);
M(i+1,ii,1) = SLL(ii,I);
end
% Because the noise reduction program takes away the first and last time
% point, estimate their values as those immediately following and
% proceeding respectively
M(1,ii,1) = M(2,ii,1);
M(time+2,ii,1) = M(time+1,ii,1);
end
% Record the noise-reduced traces in a csv format:
csvwrite('NRData.csv',M)
end
%
%
%
%
% <--------------Functions embeded in the Trace Analyzer ------------>
%
function IC = NoiseReduc(NT)
%
%Coded by Nigel Reuel on 8.31.2010
%
%Adaptation of Chung and Kennedy (1991) forward-backward non-linear
%filtering technique to reduce the noise of our
%fluorescent signal in an attempt to resolve the single molecule events
%amidst the noise.
%
%Input: Takes supplied noisy trace (NT)
%Output: Returns cleaned trace without the first and last data point
%
time = length(NT);
%
%Set algorithm parameters (discussion on this in Bioinformatics paper - Supplement A):
%
% ^^^ NOTE: User can change following parameter to improve fit ^^^
%
N = [4 8 16 32]; %Sampling window length
K = length(N); %Number of sampling windows
M = 10; %Weighting parameter length
P = 40; %Parameter that effects sharpness of steps
%
% -------- The Forward-Backward non-linear Algorithm -----------
%
%First - Run each time step and store the average forward and backward
%predictors for each of the sampling windows
%
I_avg_f = zeros(time,K);
I_avg_b = zeros(time,K);
for g = 1:time
%Run for each sampling window
for k = 1:K
%Solve for the average forward predictor (w/ some
%logic to help solve the beginning of the trace
window = N(k);
if g == 1
I_avg_f(g,k) = NT(1,1);
elseif g - window - 1 < 0
I_avg_f(g,k) = sum(NT(1:g-1,1))/g;
else
epoint = g - window;
spoint = g - 1;
I_avg_f(g,k) = sum(NT(epoint:spoint,1))/window;
end
%Now do the same for the backward predictor
if g == time
I_avg_b(g,k) = NT(g,1);
elseif g + window > time
sw = time - g;
I_avg_b(g,k) = sum(NT(g+1:time,1))/sw;
else
epoint = g + window;
spoint = g + 1;
I_avg_b(g,k) = sum(NT(spoint:epoint,1))/window;
end
end
end
%
%Second - Solve for non-normalized forward and backward weights:
f = zeros(time,K);
b = zeros(time,K);
for i = 1:time
for k = 1:K
Mstore_f = zeros(M,1);
Mstore_b = zeros(M,1);
for j = 0:M-1
t_f = i - j;
t_b = i + j;
if t_f < 1
Mstore_f(j+1,1) = (NT(i,1) - I_avg_f(i,k))^2;
else
Mstore_f(j+1,1) = (NT(t_f,1) - I_avg_f(t_f,k))^2;
end
if t_b > time
Mstore_b(j+1,1) = (NT(i,1) - I_avg_b(i,k))^2;
else
Mstore_b(j+1,1) = (NT(t_b,1) - I_avg_b(t_b,k))^2;
end
end
f(i,k) = sum(Mstore_f)^(-P);
b(i,k) = sum(Mstore_b)^(-P);
end
end
% Third - Solve for vector of normalization factors for the weights
C = zeros(time,1);
for i = 1:time
Kstore = zeros(K,1);
for k = 1:K
Kstore(k,1) = f(i,k) + b(i,k);
end
C(i,1) = 1/sum(Kstore);
end
% Fourth and final step - Put all parameters together to solve for the
% intensities
Iclean = zeros(time,1);
for i = 1:time
TempSum = zeros(K,1);
for k = 1:K
TempSum(k,1) = f(i,k)*C(i,1)*I_avg_f(i,k) + b(i,k)*C(i,1)*I_avg_b(i,k);
end
Iclean(i,1) = sum(TempSum);
end
IC = Iclean(2:time-1,1);
return
end

function PeakLocAvg = PeakFinder(nM,Ns)
% Coded by Nigel Reuel on 9.15.2010 updated in March 2011 by NFR
% This function finds the peaks of a histogram using a modified running
% average algorithm + peak flagging routine
%
[nT,nB] = size(nM);
%
% Size of window for running average algorithm (increasing this causes more
% of a rough approximation of peaks) User may need to vary for their
% specific data set)
%
% ^^^ NOTE: User can change following parameter to improve fit ^^^
%
SW = 3;
%
% Centered running average algorithm to help find peaks:
%
T_ra = zeros(nT,nB);
for j = 1:nT
for i = 1:nB
if i < (SW + 1)
T_ra(j,i) = sum(nM(j,1:i+SW))/(i+SW);
elseif i >= (SW+1) && i <= (nB-SW)
T_ra(j,i) = sum(nM(j,i-SW:i+SW))/(SW*2+1);
else
T_ra(j,i) = sum(nM(j,i:nB))/(nB-i+1);
end
end
end
%xlswrite('T_ra',T_ra);
%
%----------------------Peak finding algorithm---------------------------
%
% Base function on physical example of climbing up and down hills. Drop
% flags on supposed peaks. Eliminate "peaks" that are molehills. Combine
% peaks that are close neighbors.
Flags = zeros(nT,nB,6);
% Layer 1 = Peak Flag
% Layer 2 = Right Valley Bin#
% Layer 3 = Left Valley Bin#
% Layer 4 = Peak magnitude (Histogram Count)
% Layer 5 = Right Valley magnitude (Histogram Count)
% Layer 6 = Left Valley magnitude (Histogram Count)
% Molehill level (number of responses that are not significant - some fraction of the avg response)
%
%
% ^^^ NOTE: User can change following parameter to improve fit ^^^
%
MoleL = (Ns/nB/SW)*.5; % = #responses/#bins/SW ~= 1/2 of the Avg. Response - no significance.
% The multiplied factor ('0.5') can be decreased to allow smaller peak heights to be deemed 'significant'
% or increased to allow for only the highest peaks to be flagged.
%
PeakLocAvg = zeros(nT,2);
for i = 1:nT
for j = 1:nB
if T_ra(i,j) > MoleL
% Logic to find significant peaks at endpoints (and the R/L
% valleys)
if j == 1 && T_ra(i,j) > T_ra(i,j+1)
Flags(i,j,1) = 1;
Flags(i,j,4) = T_ra(i,j);
Pval = T_ra(i,j);
Pnext = T_ra(i,j+1);
index = j;
% Find right hand valley:
while Pval > Pnext
index = index + 1;
Pval = T_ra(i,index);
Pnext = T_ra(i,index+1);
end
Flags(i,j,2) = index;
Flags(i,j,5) = T_ra(i,index);
elseif j == 1 && T_ra(i,j) <= T_ra(i,j+1)
Flags(i,j,1) = 0;
elseif j == nB && T_ra(i,j) > T_ra(i,j-1)
Flags(i,j,1) = 1;
Flags(i,j,4) = T_ra(i,j);
Pval = T_ra(i,j);
Pprior = T_ra(i,j-1);
index = j;
% Find left hand valley:
while Pval > Pprior
index = index - 1;
Pval = T_ra(i,index);
Pprior = T_ra(i,index-1);
end
Flags(i,j,3) = index;
Flags(i,j,6) = T_ra(i,index);
elseif j == nB && T_ra(i,j) <= T_ra(i,j-1)
Flags(i,j,1) = 0;
% Logic to find peaks at all midpoints (and their R/L valleys)
elseif T_ra(i,j) >= T_ra(i,j-1) && T_ra(i,j) >= T_ra(i,j+1)
Flags(i,j,1) = 1;
Flags(i,j,4) = T_ra(i,j);
% Find left hand valley:
Pval = T_ra(i,j);
Pprior = T_ra(i,j-1);
index = j;
if j > 2
Switch = 1;
else
Switch = 0;
index = 1;
end
while Pval > Pprior && Switch == 1;
index = index - 1;
Pval = T_ra(i,index);
Pprior = T_ra(i,index-1);
if index == 2
Switch = 0;
index = 1;
end
end
Flags(i,j,3) = index;
Flags(i,j,6) = T_ra(i,index);
% Find right hand valley:
Pval = T_ra(i,j);
Pnext = T_ra(i,j+1);
index = j;
if j < nB - 1
Switch = 1;
else
Switch = 0;
index = nB;
end
while Pval > Pnext && Switch == 1;
index = index + 1;
Pval = T_ra(i,index);
Pnext = T_ra(i,index+1);
if index == nB - 1;
Switch = 0;
index = nB;
end
end
Flags(i,j,2) = index;
Flags(i,j,5) = T_ra(i,index);
end
end
end
%{
X = (1:200)';
Y1 = Flags(i,:,1)';
Y2 = T_ra(i,:);
plot(X,Y1,X,Y2)
%}
%Make a vector containing the peak locations:
PeakVec = 0;
count = 1;
for j = 1:nB
if Flags(i,j,1) == 1
PeakVec(1,count) = j;
count = count + 1;
end
end
% Now analyze peak pairs:
% Specify significant peak valley distance and significant bin
% distance:
%
% ^^^ NOTE: User can change following parameter to improve fit. We found this parameter is analogous to that specified in line 297 ^^^
%
SigD = MoleL;
%
% ^^^ NOTE: User can change following parameter to improve fit ^^^
%
SigBinD = nB*(.025); % = 2.5% of the total bin #.
% Decreasing the multiplied factor (0.025) will allow for peaks closer
% together to remain unique, increasing the factor causes more of the
% flagged peaks to be unified into single 'significant' peaks.
%
%
Npeaks = length(PeakVec);
% Determine the peak pairwise interactions...each will have a right
% hand value and left hand value (except for the endpoints):
% 1: Unique
% 2: Combine
% 3: Ignore
% Create a matrix to recieve these calculations:
PeakCode = zeros(2,Npeaks); % Row1 = to right, Row2 = to left
for j = 1:Npeaks-1
% Determine the peak bin numbers:
PBN1 = PeakVec(1,j);
PBN2 = PeakVec(1,j+1);
% Calcualate the valley distances (left to right)
VD1 = abs(Flags(i,PBN1,4) - Flags(i,PBN1,5));
VD2 = abs(Flags(i,PBN2,6) - Flags(i,PBN2,4));
%
if VD1 <= SigD && VD2 <= SigD
% Combine
PeakCode(1,j) = 2;
PeakCode(2,j+1) = 2;
elseif VD1 >= SigD && VD2 <= SigD
% Left Peak sig, right peak ignore
PeakCode(1,j) = 1;
PeakCode(2,j+1) = 3;
elseif VD1 <= SigD && VD2 >= SigD
% Left peak ignore, right peak significant
PeakCode(1,j) = 3;
PeakCode(2,j+1) = 1;
elseif VD1 >= SigD && VD2 >= SigD
% Both peaks are significant
PeakCode(1,j) = 1;
PeakCode(2,j+1) = 1;
end
end
% Logic for the peak endpoints
Left = abs(Flags(i,PeakVec(1,1),4) - Flags(i,PeakVec(1,1),6));
if Left >= SigD
PeakCode(2,1) = 1;
else
PeakCode(2,1) = 3;
end
Right = abs(Flags(i,PeakVec(1,Npeaks),4) - Flags(i,PeakVec(1,Npeaks),5));
if Right >= SigD
PeakCode(1,Npeaks) = 1;
else
PeakCode(1,Npeaks) = 3;
end
% Now determine what to do with the peaks. Look at combinations
% reading left to right. Put signifcant peaks into a vector.
SPV = 0;
j = 1;
count = 1;
while j <= Npeaks
C1 = PeakCode(1,j);
C2 = PeakCode(2,j);
if C1 == 1 && C2 == 1
% This is a bonafide unique peak. Put it in the vector.
SPV(1,count) = PeakVec(1,j);
count = count + 1;
j = j+1;
elseif C1 == 2
% This peak will be merged with other peak(s)
count2 = 2;
Merge = PeakVec(1,j);
while C1 == 2
j = j + 1;
C1 = PeakCode(1,j);
Merge(1,count2) = PeakVec(1,j);
count2 = count2 + 1;
end
% Insert logic here to eliminate valleys and slow rises (see
% what is happening at the significant enpoints!)
End1 = Flags(i,Merge(1,1),4) - Flags(i,Merge(1,1),6);
End2 = Flags(i,Merge(1,end),4) - Flags(i,Merge(1,end),5);
Distance = Merge(1,end) - Merge(1,1);
if End1 >= SigD && End2 >= SigD
SPV(1,count) = round(mean(Merge));
count = count + 1;
j = j + 1;
elseif End1 >= SigD && End2 <= (SigD)*(-1) && Distance >= SigBinD
SPV(1,count) = round(mean(Merge));
count = count + 1;
j = j + 1;
elseif End1 <= SigD*(-1) && End2 >= (SigD) && Distance >= SigBinD
SPV(1,count) = round(mean(Merge));
count = count + 1;
j = j + 1;
end
else % Represents 1/3, 3/1, and 3/3 combos which are all ignored!!
j = j + 1;
end
end
Nsigpeaks = length(SPV);
for j = 1:Nsigpeaks
PeakLocAvg(i,j) = SPV(1,j);
end
%{
% Plot the PeakLocAvg to see how the algorithm performs)
S = Nsigpeaks;
FF = zeros(1,nB);
for k = 1:S
Temp = PeakLocAvg(i,k);
FF(1,Temp) = 1;
end
X = (1:200)';
Y1 = FF(1,:)';
Y2 = T_ra(i,:);
plot(X,Y1,X,Y2)
stop = 1;
%}
end


return
end

function [u,sig,t,iter] = fit_mix_gaussian( X,M )
%
% Gaussian Fit Function by Ohad Gal (c) 2003
% Download at:
% http://www.mathworks.com/matlabcentral/fileexchange/4222-a-collection-of-fitting-functions
% on 3.21.2011
%
% fit_mix_gaussian - fit parameters for a mixed-gaussian distribution using EM algorithm
%
% format: [u,sig,t,iter] = fit_mix_gaussian( X,M )
%
% input: X - input samples, Nx1 vector
% M - number of gaussians which are assumed to compose the distribution
%
% output: u - fitted mean for each gaussian
% sig - fitted standard deviation for each gaussian
% t - probability of each gaussian in the complete distribution
% iter- number of iterations done by the function
%
% initialize and initial guesses
N = length( X );
Z = ones(N,M) * 1/M; % indicators vector
P = zeros(N,M); % probabilities vector for each sample and each model
t = ones(1,M) * 1/M; % distribution of the gaussian models in the samples
u = linspace(min(X),max(X),M); % mean vector
sig2 = ones(1,M) * var(X) / sqrt(M); % variance vector
C = 1/sqrt(2*pi); % just a constant
Ic = ones(N,1); % - enable a row replication by the * operator
Ir = ones(1,M); % - enable a column replication by the * operator
Q = zeros(N,M); % user variable to determine when we have converged to a steady solution
thresh = 1e-3;
step = N;
last_step = inf;
iter = 0;
min_iter = 10;

% main convergence loop, assume gaussians are 1D
while ((( abs((step/last_step)-1) > thresh) & (step>(N*eps)) ) | (iter<min_iter) )

% E step
% ========
Q = Z;
P = C ./ (Ic*sqrt(sig2)) .* exp( -((X*Ir - Ic*u).^2)./(2*Ic*sig2) );
for m = 1:M
Z(:,m) = (P(:,m)*t(m))./(P*t(:));
end

% estimate convergence step size and update iteration number
prog_text = sprintf(repmat( '\b',1,(iter>0)*12+ceil(log10(iter+1)) ));
iter = iter + 1;
last_step = step * (1 + eps) + eps;
step = sum(sum(abs(Q-Z)));
fprintf( '%s%d iterations\n',prog_text,iter );

% M step
% ========
Zm = sum(Z); % sum each column
Zm(find(Zm==0)) = eps; % avoid devision by zero
u = (X')*Z ./ Zm;
sig2 = sum(((X*Ir - Ic*u).^2).*Z) ./ Zm;
t = Zm/N;
end

sig = sqrt( sig2 );

return
end