Some Thoughts over the BINOCDF function

Contents

Problem Desription

The binocdf function currently does "cumsum" on the binopdf function. This results in unnecessarily slow performance and memory failures at large x. For example:

try
    binocdf(5e7,1e8+1,0.5)
catch
    error=lasterror;Error_from_binocdf=error.message
end

% The current binocdf can sum terms up to about 1e7 (depending on the
% computer) and it is hard to say that it is efficient.
Error_from_binocdf =

Error using ==> zeros
Out of memory. Type HELP MEMORY for your options.

Some useful Modifications: MBINOCDF

For large N, the binomial distribution approaches the normal distribution, and the region where this distribution is not negligible is order std = sqrt(N*p*(1-p)). For large N, the binopdf takes non-negligible values only around the mean mu = N*p. If p=q=.5, then the pdf dies to below eps(1) at about 10*std from the mean mu, and below eps(0) at about 40*std. If p ~=q, this may not be true, but the range where pdf is non-negligible is order sqrt(N), so there is no reason to call binopdf to calculate the zeros again and again.

For asypoisspdf, we have implemented the newtons method on the log function of the terms to find the range of non-zero terms. For variety, here we implement a much algorithm in mbinocdf, as follows:

Then, we "cumsum" the binopdf function over the non-negligible region.

The resulting routine is called mbinocdf, and it works for a larger range or parameters:

format long
p1=mbinocdf(5e7,1e8+1,0.5)
p1 =

   0.499999973346127

Here the answer needs to be 0.5, but is off a few digits because of the round-off errors.

% mbinocdf is usually more efficient than binocdf:
tic;binocdf(1e4,1.4e4,0.4);toc
tic;mbinocdf(1e4,1.4e4,0.4);toc
Elapsed time is 0.019459 seconds.
Elapsed time is 0.006737 seconds.

In mbinocdf, we also check for immediate zeros or ones and do not compute them:

p1=mbinocdf(1e10,1e20,0.4)
p2=mbinocdf(9e19,1e20,0.4)
p1 =

     0


p2 =

     1

The regular binocdf would fail on those occasions.

Replacement of BINOCDF with BETAINC at high Np:

The CDF for binomial distribution satisfies the identity binocdf(x,n,p) = betainc(1-p,n-x,x+1). The betainc function is computed by either continued fractions at small parameters, or by an asymptotic formulation of A&S (26.5.20) at large parameters.

There is also alternate asymptotic formulation of the betainc function by N.M. Temme, which is claimed to be accurate up to O(1/N) (ref: Special Functions: An introduction to the Classical Functions of Mathematical Physics, p.292-294)

Since the binocdf function generates some round-off errors, we are not going to use it as a point of reference. Instead, we will "diff" the cdf functions given by the two formulations of the betainc along with mbinocdf and compare the result to the binopdf, which we know is accurate.

The following loop does the job:

clear
Nlist=1+[1e4:4e4:9e4 1e5:4e5:9e5 1e6:4e6:9e6 1e7:4e7:9e7 ...
    1e8:4e8:9e8 1e9:4e9:9e9 1e10 1e11];

for k=1:numel(Nlist);

    % Set parameters
    N=Nlist(k);

    p=0.01;q=1-p; mu=round(N*p); std=round(sqrt(N*p*q));
    x=mu-40*std:mu-std;

    % Calculate cdf and pdf using three methods
    p0=mbinocdf(x,N,p);
    p1=m3binocdf(x,N,p); % Uses Temme's Asymptotic form
    p2=m2binocdf(x,N,p); % Uses A&S's asymptotic form
                         % (also in current version of matlab)

    y=binopdf(x,N,p); y=y(2:end);
    y0=diff(p0);y1=diff(p1);y2=diff(p2);

    % Error analysis: Make sure the cut-off is the same for all methods
    imp=find((y>1e-300 & y1>1e-300)| y>1e-290 | y1>1e-290 |y2> 1e-290);
    y1=y1(imp);y=y(imp); x=x(imp);p1=p1([imp max(imp)+1]);
    y0=y0(imp); y2=y2(imp);

    [er1(k) ier1]=max(abs(y-y1));
    er0(k)=norm(y-y0,'inf'); er2(k)=norm(y-y2,'inf');
    [rer0(k) irer0]=max(abs((y-y0)./y));
    [rer1(k) irer1]=max(abs((y-y1)./y));
    [rer2(k) irer2]=max(abs((y-y2)./y));
    xr=x(irer1); yr=y(irer1); y1r=y1(irer1);
    p1r=p1(irer1); p1s=p1(irer1+1);zr=(xr-mu)/std;

end

loglog(Nlist,rer0,'b--',Nlist,rer1,'k-.',Nlist,rer2,'r');
xlabel('N'), ylabel('Relative Error'), axis([Nlist(1) Nlist(end) 1e-16 1e4])
legend('cumsum','temme','abetainc','Location','West'), title(['p = ' num2str(p)])
open temme2.fig

It is important to restrict ones self to important indices "imp", otherwise we obtain %100 relative error because the cut-off values of the betainc function occur a little earlier than the binopdf.

Note that the error analysis remains similar for a large range of the parameter p (we have tried p=0.01 and p=0.99).

In addition, the absolute errors are remain small at all times at large parameters, this can be checked by plotting er1,er2 instead of rer1,rer2 in the code:

loglog(Nlist,er0,'b--',Nlist,er1,'k-.',Nlist,er2,'r');
xlabel('N'), ylabel('Absolute Error'), axis([Nlist(1) Nlist(end) 1e-20 1])
legend('cumsum','temme','abetainc'), title(['p = ' num2str(p)])

Unfortunately, we cannot go to higher values to investigate the behaviour of the error due to memory problems. Also the errors are relatively bigger as the parameter p approaches 0 or 1, although they still remain small.

As far as the available data shows, the cumsum method is the most accurate one among others, it is also the most computationally involving one. We think that the range that mbinocdf take, i.e. about up to 1e10, will almost always be more than enough for any meaningful applications.

Still, the ultimate cdf function mybinocdf can call betainc function for higher values.

The conclusions of this section are

The Gram-Charlier Expansion

The Edgeworth or Gram-Charlier expansion describes the pdf or cdf of a distribution in terms of its cumulants (ref: http://en.wikipedia.org/wiki/Edgeworth_series) For example

The Gram-Charlier expansions remain valid in the central region, and beyond,depending on their order. The second order GC expansion, theoretically, shall remain valid in a region of order N^0.75. If N is large, the binopdf is already zero (i.e. less than eps(0)) outside this region. This motivates us to study the Gram-Charlier expansion as a means to compute the binocdf at large N.

In the following code, we test the accuracy of the Gaussian and 2nd order Gram-Charlier approximations of the binomial distribution at large N:

clear
Nlist=[1e5:4e5:9e5 1e6:4e6:9e6 1e7:4e7:9e7 1e8:4e8:9e8 5e9 1e10];

for k=1:numel(Nlist);

% Set the parameters:
N=Nlist(k);
p=0.55; q=1-p; mu=round(N*p); std=ceil(sqrt(N*p*q));
x=mu-40*std:mu;

% Calculate the CDFs
p0=mbinocdf(x,N,p);
p1=m4binocdf(x,N,p);  % Calculate Gaussian Approximation
p2=m5binocdf(x,N,p);  % Calculate 2nd order GC Approximation

%semilogy(x,p0,'b',x,p1,'r');

% Error Analysis
imp=find(p0>1e-300 & p1>1e-300);
p0=p0(imp);p1=p1(imp);p2=p2(imp);x=x(imp);

[rer1(k) ierd]= max(abs((p1-p0)./p0));
[rer2(k) ierd]= max(abs((p2-p0)./p0));

end

loglog(Nlist,rer1,'r',Nlist,rer2,'b--'),
xlabel('N'), ylabel('Relative Error')
legend('Gaussian','2nd order GC'),  title(['p = ' num2str(p)])
open GC2.fig

The Approximation due to Peizer and Pratt

This approximation can be found on page 115 of Kemp, the equation 3.32. It is supposed to have less than 0.1 percent error. The delta_x is missing an absolute value operator.

The approximation is not doing well when delta_x gets small and the inversion 1/delta_x may have large errors. So we test it in a region away from the center by an std.

Nlist=[1e5:4e5:9e5 1e6:4e6:9e6 1e7:4e7:9e7 1e8:4e8:9e8 5e9 1e10 1e11];

for k=1:numel(Nlist);

    % Calculate the CDFs
    N=Nlist(k);
    p=0.5; q=1-p; mu=round(N*p); std=ceil(sqrt(N*p*q));
    x=max(1,mu-40*std):mu-std;

    p0=mbinocdf(x,N,p);
    p1=pbinocdf(x,N,p);

    % Error Analysis
    imp=find(((p0>1e-300 & p1>1e-300) | p1>1e-290) | p0> 1e-290);
    p0=p0(imp);p1=p1(imp);x=x(imp);

    [rer1(k) ierd]= max(abs((p1-p0)./p0));
    er1(k)= max(abs((p1-p0)));
end

loglog(Nlist,rer1,'r',Nlist,er1,'b'), xlabel('N')
legend('Relative Error','Absolute Error','Location','SouthEast')
title('Peizer and Pratts Approximation')

At this point, we have confidence in the accuracy of mbinocdf, therefore we have used it as a reference point in the error analysis. The Gram-Charlier expansion seems to be doing quite well, but only for p values away from 0 and 1. While for p=0.5, even the gaussian approximation (which corresponds with the 1st order GC, because skewness is zero) is pretty good, when p=0.1 or 0.9, it is way off. In those cases, the second order Gram-Charlier could still be somewhat acceptable, but it takes to go to higher values of N to observe the convergence.

The conclusion of this section is that the Gram-Charlier expansions are not in general a good idea to compute binocdf because of the asymmetries.

Summary and Conclusions

In this report, we have suggested some little modifications to the function binocdf to make it more efficient and applicable on a larger range. We also analyzed two possible betainc replacements for binocdf at larger values of N. Those replacements seem to be doing well with errors around 1e-3 until 1e10, however they are much less accurate than summing the pdf function. Finally, we have investigated the Gram-Charlier approximation to binocdf, which does well when p equals or close to 0.5, but it is not accurate at all when p gets close to zero or 1.

te than summing the pdf function. % Finally, we have investigated the Gram-Charlier approximation to binocdf, % which does well when p equals or close to 0.5, but it is not accurate at all % when p gets close to zero or 1. ##### SOURCE END ##### -->