Improvement to Binornd Function

Contents

Problem Description

Geck 179533: "A customer at the Zurich FMA focus group noted that binornd is too slow. I suspect that's because we create a large binary matrix and sum it, which is going to be slow for large N"

Binornd function is indeed much slower than it could potentially be:

tic, binornd(1e4,0.4,40);toc
Elapsed time is 1.426907 seconds.

because it is using a very inefficient high-school algorithm, namely

This algorithm takes O(N) time and memory. Yes, binornd crashes when N*numberOfRandomVariables >~ 1e8 (depending on the computer).

Some alternative algorithms are

Solution

In this report, we will choose the recursive method because it is much simpler to code and understand than the acceptance-rejection method. In addition, we believe that betarnd is fast enough, and O(loglog(N)) is almost as good as O(1) in practice, given that loglog of the largest representable number is about 6.57.

The recursive algorithm as described in pages 536-7 of Devroye(1986) -see also errata for a typo on p537- reduces the problem of generating a (N,p) binomial random variate to generating another binomial variate with smaller N*p. The number N may persistently remain large, but the product N*p is bound to get small quickly.

This algorithm needs to be supported by a random variate generator that is fast for small N*p (the matlab's current one is not suitable because it is O(N)). For this purpose we choose the first waiting algorithm described on page 525, which is indeed O(N*p).

Those are implemented in the file mybinornd.m.

Comparison and Testing

The new algorithm is much superior than the original one in terms of computational time:

tic, binornd(1e4,0.4,60);toc
tic, mybinornd(1e4,0.4,60);toc
Elapsed time is 3.598106 seconds.
Elapsed time is 0.027788 seconds.

It is also more memory efficient:

try
    binornd(2e4,0.3,100);
catch
    u=lasterror; display(u.message)
end
display('whereas ')
mybinornd(2e4,0.3,100);
display('generates no errors')
Error using ==> rand
Out of memory. Type HELP MEMORY for your options.
whereas 
generates no errors

But does it give the right answer? Here's one attempt to check that

N=100;
p=0.25;
mu=round(N*p); std=round(sqrt(N*p*(1-p)));  % Find mu and std for plotting purposes
M=10000; % Sample size
x=1:N; y=M*binopdf(x,N,p);  % The continous PDF curve

% Generate random variates
r1= mybinornd(N,p,M,1);
r0= binornd(N,p,M,1);

% Find out how many incidences of each case occured
yr1=zeros(1,N); yr0=yr1;
for k=1:M;
    yr1(r1(k))=yr1(r1(k))+1;
    yr0(r0(k))=yr0(r0(k))+1;
end

% Plot and compare with PDF
subplot(1,2,1), hold off
bar(x,yr1,'b'), hold on, plot(x,y,'m','LineWidth',2);
axis([mu-5*std mu+5*std 0 max([yr1(:);yr0(:)])+1])
xlabel('r'), ylabel('Number of incidents with X = r')
title('mybinornd')

subplot(1,2,2), hold off
bar(x,yr0),hold on, plot(x,y,'m','LineWidth',2);
axis([mu-5*std mu+5*std 0 max([yr1(:);yr0(:)])+1])
xlabel('r')
title('binornd')

The curious reader is urged to play with the parameters of the above code. We can also simulate with larger parameters, but this time we do not call binornd because it is too slow and it generates memory error.

clear, clf
N=1000;
p=0.45; mu=round(N*p); std=round(sqrt(N*p*(1-p)));
M=1e6;
x=1:N; y=M*binopdf(x,N,p); s=5;

r1= mybinornd(N,p,M,1);
yr1=zeros(1,N);
for k=1:numel(r1);
    yr1(r1(k))=yr1(r1(k))+1;
end
bar(x,yr1), hold on, plot(x,y,'m','LineWidth',2);
axis([mu-s*std mu+s*std 0 max(yr1(:)+1)])
xlabel('r'), ylabel('Number of incidents with X = r')
title('mybinornd')

What we see is a close agreement with the binopdf curve when the sample size is large.

Conclusion

In this report, we have implemented a recursive algorithm for binornd, which is

For large sample size, we have found that the generated random variates agree with the PDF function.

The only disadvantage for this algorithm could be its dependence on betarnd, which we hope uses an efficient algorithm.

rithm could be its dependence on % betarnd, which we hope uses an efficient algorithm. % % % ##### SOURCE END ##### --> tml>