Unset better formulae for POISSCDF, BINOCDF, NBINCDF:
Contents
Description of the problem: POISSCDF
The function poisscdf utilizes the cumsum function on poisspdf. This results in
- longer computational times (on the order of a few seconds for x = 1e7, no matter what lambda is)
- memory problems for x larger than about 1e8.
When the ratio x/lambda gets large, poisscdf approaches 1, and summing the poisspdf is largely a waste of time. However, for finite x/lambda the computation yields nontrivial values away from 1. Some examples are
clear; x=1e7; lam=1; tic, p1 = poisscdf(x,lam); toc, p1 x=1e7; lam=1e7; tic, p1 = poisscdf(x,lam); toc, p1
Elapsed time is 4.765935 seconds. p1 = 1.000000000000000 Elapsed time is 5.077998 seconds. p1 = 0.500084107548683
Suggested Solution (Peter Perkins 2003)
The function poisscdf(x,lambda), it turns out, is equal to 1-gamcdf(lambda,x+1,1) (Lawless, 2nd ed., pg43, ex. 1.6c). For example:
x=1e7; lam=1e7; tic, p1 = poisscdf(x,lam); toc, p1 tic, p2 = 1-gamcdf(lam,x+1,1); toc, p2
Elapsed time is 5.044705 seconds. p1 = 0.500084107548683 Elapsed time is 0.028917 seconds. p2 = 0.500084104393297
The two formulas agree up to 8 significant digits in this example. We will show in the next section that this is not always the case.
Accuracy of the Suggestion
We may assume that poisscdf is more accurate than 1-gamcdf, as the latter involves numerical integration of the gamma distribution function and therefore may more readily prone to errors. For an error check, we run the loop:
lamlist=10.^(1:7); % Number list for lam=lambda for k=1:numel(lamlist); lam=lamlist(k); std=round(sqrt(lam)); x=max(1,lam-40*std):lam+10*std; % Choose x around lambda, where the % distribution takes non-trivial values p1list = poisscdf(x,lam); p2list = 1-gamcdf(lam,x+1,1); imp=find((p1list>1e-300 & p2list>1e-300)|p1list>1e-290|p2list>1e-300); p1list=p1list(imp);p2list=p2list(imp);x=x(imp); relerror(k)=max(abs(p1list-p2list)./(abs(p1list)+(p1list==0))); abserror(k)=max(abs(p1list-p2list)); end loglog(lamlist,max(relerror,1e-16),'r',lamlist,abserror,'g'); xlabel('\lambda');axis([0 lamlist(end) 1e-16 1e4]) legend('Relative Error','Absolute Error','Location','NorthWest') title('Replacement of poisscdf by 1-gamcdf')
Although absolute errors are reasonable, the relative error is always %100 because 1-gamcdf can never be less than 1e-16. If the true answer is less than 1e-16, the user is going to get just 'zero'.
Alternative Suggestion for POISSCDF:
It turns out there is a simpler formulation for poisscdf, which satisfies poisscdf(x,lam)=gammainc(lam,x+1,'upper'), given in wikipedia (Did not try hard but I am sure it can be found elsewhere, too. In addition, it's easy to obtain it by successive integration by parts on the gammainc function)
clear lamlist=10.^(2:5); % Number list for lam=lambda for k=1:numel(lamlist); lam=lamlist(k); std=round(sqrt(lam)); x=max(1,lam-40*std):lam+10*std; % Choose x around lambda, where the % distribution takes non-trivial values p1list = poisscdf(x,lam); p2list = gammainc(lam,x+1,'upper'); imp=find((p1list>1e-300 & p2list>1e-300)|p1list>1e-290|p2list>1e-300); p1list=p1list(imp);p2list=p2list(imp);x=x(imp); relerror(k)=max(abs(p1list-p2list)./abs(p1list)); abserror(k)=max(abs(p1list-p2list)); end loglog(lamlist,max(relerror,1e-16),'r',lamlist,abserror,'g'); xlabel('\lambda'); legend('Relative Error','Absolute Error','Location','NorthWest') title('Replacement of poisscdf by gammainc')
Although this formulation does much better than the one suggested in the previous section, it still does poorly at large lambda.
Similar considerations for BINOCDF and NBINCDF:
It is suggested that for large values of x, the relation binocdf(x,n,p)=1-betainc(p,x+1,n-x) = betainc(1-p,n-x,x+1) from (26.5.2 A&S) be used. We prefer the latter formulation as it would be more accurate near zero. A repetition of the numerical experiments can easily be run as follows:
% LOOP FOR BINOCDF: p=0.5; nlist=10.^(1:1:7); for k=1:numel(nlist); n=nlist(k); mu=round(n*p);std=round(sqrt(n*p*(1-p))); % Choose x around mu, where the distribution takes non-trivial values x=max(1,mu-40*std):min(n-1,mu+10*std); p1list = binocdf(x,n,p); p2list = betainc(1-p,n-x,x+1); imp=find((p1list>1e-300 & p2list>1e-300)|p1list>1e-290|p2list>1e-300); p1list=p1list(imp);p2list=p2list(imp); relerror12(k)=norm((p1list-p2list)./p1list,'inf'); abserror12(k)=norm(p1list-p2list,'inf'); end loglog(nlist,relerror12,'r',nlist,abserror12,'g'); xlabel('n'),legend('Relative Error','Absolute Error','Location','NorthWest') title('Replacement of binocdf by betainc')
For nbincdf, the formula is nbincdf(x,r,p) = betainc(p,r,x+1).
LOOP FOR NBINCDF:
p=0.5; rlist=10.^(1:1:5); for k=1:numel(rlist); r=rlist(k); mu=r*(1-p)/p; std=sqrt(r*(1-p)/p^2); % Choose x around mu, where the distribution takes non-trivial values x=round(max(0,mu-40*std)):round(mu); p3list = nbincdf(x,r,p); p4list = betainc(p,r,x+1); imp=find((p3list>1e-300 & p4list>1e-300)|p3list>1e-290|p4list>1e-290); p3list=p3list(imp);p4list=p4list(imp); relerror34(k)=norm((p3list-p4list)./p3list,'inf'); abserror34(k)=norm(p3list-p4list,'inf'); end loglog(rlist,relerror34,'r',rlist,abserror34,'g'); xlabel('n'); legend('Relative Error','Absolute Error','Location','NorthWest') title('Replacement of nbincdf by betainc')
The situation is worse for nbincdf, where both the absolute and the relative errors quickly grow big.
Notes and Conclusions
- The idea of poisscdf=1-gamcdf is not good because when gamcdf is close to 1, the digits after the sixteenth decimal place are lost. Alternatively, we have suggested the use of gammainc('upper') function in place of poisscdf, which gives a better match, but still does poorly.
- Similarly, the suggestions for binocdf and nbincdf fail to be accurate at large values of x.
- Conclusion: Either the function gammainc and betainc are not calculated accurately, or there must be some round-off error in summing about 1e5 numbers. I personally think that the former is more likely to be the case.
- Side Observation: nbinpdf(3000,200,0.5) yields NaN, instead of 0. This can easily be fixed by replacing the last two command lines in nbinpdf by "y(k) = exp(gammaln(r(k) + x(k)) - gammaln(x(k) + 1) - gammaln(r(k)) +... r(k).*(log(p(k)))+x(k).*(log(1-p(k))));"
- Unlikely, but if any of the replacements for binocdf or nbincdf are to be used, the results should be validated for a large range of values for p (I've only tried 0.05 0.1 0.5).
- Trivia: Both betainc(1-p,n-x,x+1) and betainc(p,x+1,n-x) yield NaN when x=n. So, if any replacement were to be done, the extreme case binocdf(n,n,p) = 1 would have been handled separately (as it is already done now in the current version).
- In Gecko (5) Perkins also suggested that the formula 26.5.23 in A&S may be useful in evaluating the hygecdf, however, we were not able to find any useful relations for hygecdf except the fact that the hypergeometric distribution probabilities are the coefficients in the Taylor series for the hypergeometric function.