Possible Improvement of NCX2CDF
Contents
Problem Description:
ncx2cdf(1e-5,8,320)
ans =
0
Gecko: "A customer reports that this should be 8.48437e-93. Knuesel's program gives that answer."
"MATLAB is summing a series by computing one term ans using a recurrence formula to get the value of the next term. It gives up when its first term is zero and the recurrence formula would not improve things. If I (tlane) many terms directly and sum them, I get 1.3574e-096. I am not sure which one is correct."
The noncentralized chi square distribution CDF is given by the sum
The first term j = 0 may sometimes be very important, and negligince of them may yield the wrong answer:
x=1e-5;k=8;lam=320;j=1:100; sum(poisspdf(j,lam/2).*chi2cdf(x,k+2*j))
ans = 1.3574e-096
The correct answer is
j=0:100; sum(poisspdf(j,lam/2).*chi2cdf(x,k+2*j))
ans = 8.4844e-093
Behaviour of poisspdf(j,lam/2) and chi2cdf(x,k+2*j) as functions of j
POISSPDF(j,lam/2): As a function of j, POISSPDF(j,lam/2) takes on its sole maximum at around j = lam/2, and drops to zero as j tends infinity. Its value at j = 0 is exp(-lam/2), which may not be small for small and moderate lam. Note that all these can be seen by differentiating the function
with respect to j. To illustrate, let us plot a couple of cases:
lam=10; j=0:30; subplot('position',[.15 .5 .3 .4]) semilogy(j,poisspdf(j,lam/2),'k',[lam/2 lam/2],[poisspdf(j(end),lam/2) 1],'r') axis([0 j(end) poisspdf(j(end),lam/2) 1]), xlabel('j'),ylabel('poisspdf(j,\lambda/2)'), title(['\lambda = ' num2str(lam)]) lam=320; j=0:500; subplot('position',[.6 .5 .3 .4]) semilogy(j,poisspdf(j,lam/2),'k',[lam/2 lam/2],[poisspdf(j(end),lam/2) 1],'r') axis([0 j(end) poisspdf(j(end),lam/2) 1]), xlabel('j'),ylabel('poisspdf(j,\lambda/2)'),title(['\lambda = ' num2str(lam)])
CHI2CDF(x,k+2*j): This function always has a maximum at j = 0. If x is small, it is sharply decreasing to zero from then on. If x is large, it is decreasing slowly, and the for the limiting case x = Inf it is a flat function 1.
To see that CHI2CDF(x,k+2*j) = gammainc(x,k+2*j) is non-increasing in j, observe that
Hence (gammainc(x,a))^{-1} is non-decreasing in a, and therefore gammainc(x,a) is non-increasing in a.
To illustrate, let us make a couple of plots:
x=1e-2; k=1; j=0:50; subplot('position',[.15 .5 .3 .4]) semilogy(j,chi2cdf(x,k+2*j)) axis([0 j(end) chi2cdf(x,k+2*j(end)) 10]), xlabel('j'),ylabel('chi2cdf(x,k+2j)'), title(['x = ' num2str(x) ', k = ' num2str(k)]) x=100; k=1; j=0:200; subplot('position',[.6 .5 .3 .4]) semilogy(j,chi2cdf(x,k+2*j)) axis([0 j(end) chi2cdf(x,k+2*j(end)) 10]), xlabel('j'),ylabel('chi2cdf(x,k+2j)'), title(['x = ' num2str(x) ', k = ' num2str(k)])
Why ncx2cdf fails?
We ask matlab:
why
Damian wanted it that way.
The ncx2cdf function is summing the terms poisspdf(j,\lambda/2) chi2cdf(x,k+2j) in the following way:
- Start at the peak of poisspdf(j,lam/2)
- Proceed in both direction with the use of recurrence formula
- Stop when the next term calculated hits zero.
As matlab answers, the reason of failure is clear: This algorithm assumes that the main contribution to the series is coming from around the term j = lam/2. Indeed, this is true most of the time.
When is this not the case? When x is small, as we have illustrated, the chi2cdf function may have a sharp peak at zero. In this case only the terms around j = 0 will contribute.
This is the problem the customer has pointed to as well. His x is 1e-5. But the problem actually sets in well before then. (see next section)
A little addendum to ncx2cdf: sncx2cdf
In order to avoid this problem, we can detect when the recursion stops prematurely while computing the terms from lam/2 to zero in decreasing order, and start the same recursive algorithm from zero (in increasing order). The resulting function is sncx2cdf (s stands for smart, not Sabri, that would be a capital letter).
Let's try:
sncx2cdf(1e-5,8,320)
ans = 8.4844e-093
Bingo.
Issues with sncx2cdf and the function sancx2cdf
However, it is too early to conclude that we are done. In fact, the following plot is alarming:
clf x=0:0.01:10; y=ncx2cdf(x,8,320); ys=sncx2cdf(x,8,320); semilogy(x,y,'b','LineWidth',2), hold on semilogy(x,ys,'r'), xlabel('x'), ylabel('ncx2cdf(x,8,320)') legend('ncx2cdf','sncx2cdf')
Below a certain threshold, we see that ncx2cdf plot is absent, because it yields zero. The 'improved' version sncx2cdf, however, goes unstable for those values, while sometimes yielding the correct value.
To understand why this happens, let us examine the plot
clf x=1.1;k=8;lam=320; j=0:20; summand=poisspdf(j,lam/2).*chi2cdf(x,k+2*j); semilogy(j,summand), xlabel('j'),ylabel('j^{th} term in sum')
The function sncx2cdf starts at j = 0, and tries to work its way up to the bigger terms in the sum around j = 8. Thus, it is calculating the bigger terms using the recursive formula on the smaller terms. A small error made in the smaller terms will multiply itself in the bigger terms.
To avoid that, we will calculate the terms whenever the recursive formula calculates bigger contributions, and this happens whenever
P_new .* C > P .* C_old iff C .* hdelta./counter > C_old
The resulting algorithm is named sancx2cdf. (This modification can also be done for the terms near the peak of poisspdf, and the resulting algorithm can be found in sabrncx2cdf.m - it is not clear to me if this also an improvement or not)
To conclude this section, let us make the same plots with sancx2cdf:
clf x=0:0.01:10; y=ncx2cdf(x,8,320); ysa=sancx2cdf(x,8,320); semilogy(x,y,'b','LineWidth',2), hold on semilogy(x,ysa,'r'), xlabel('x'), ylabel('ncx2cdf(x,8,320)') legend('ncx2cdf','sancx2cdf')
Analysis of sancx2cdf and ncx2cdf
The following code takes a while to run, and it compares the function ncx2cdf, sancx2cdf and a loop that blindly sums "Nterms" number of terms. The errors are pretty small for "lamlist", but they get larger (still less than a percent) for the set lamlist2. (The figure not generated here, just rename "lamlist2" to "lamlist" and rerun the code)
hold off,clear Ntermslist=[50:10:90 100:100:800]; for N=1:numel(Ntermslist); Nterms=Ntermslist(N); [N numel(Ntermslist)]; k=1; lamlist=[0:1:9 10:10:90 100:100:200]; lamlist2=[300:100:1000]; j=0:1:Nterms; for n=1:numel(lamlist); lam=lamlist(n); x=[0:0.01:0.09 0.1:0.1:0.9 1:1:9 10:10:90 100:100:1000]; for i=1:numel(x); y(i)=sum(poisspdf(j,lam/2).*chi2cdf(x(i),k+2*j)); end ysa=sancx2cdf(x,k,lam); yold=ncx2cdf(x,k,lam); abser(n)=norm(ysa-y,'inf');abser2(n)=norm(yold-y,'inf'); er(n)=norm((ysa-y)./(y+(y==0)),'inf');er2(n)=norm((yold-y)./(y+(y==0)),'inf'); end abserr(N)=max(er); abserr2(N)=max(er2); [err(N) ierr(N)]=max(er); err2(N)=max(er2); end loglog(Ntermslist,err,Ntermslist,err2); legend('sancx2cdf','ncx2cdf') xlabel('Number of terms summed'), ylabel('Relative Error')
Efficiency vs Accuracy
There are some values for which the improved function sancx2cdf still fails, for example
x=1;k=1;lam=1500; psa=sancx2cdf(x,k,lam) % our function j=0:500; term=poisspdf(j,lam/2).*chi2cdf(x,k+2*j); p=sum(term) % the accurate value
psa =
0
p =
8.0550e-312
This happens because the contribution to the series is coming from the middle values betweeen 0 and lam, and the 0-th term and lam-th term are both 'zero':
semilogy(j,term), xlabel('j'),ylabel('j^{th} term in sum')
One way to handle those cases is to 'blindly' sum a lot of terms. There could be more efficient algorithms as well. The question is whether we want to sacrifice the efficiency of the current algorithm just to be able to compute a few values on the order of 1e-300 more accurately. Since those values are rarely important, for this report, I have decided to leave the function sancx2cdf as it is.
Summary
In this report, we analyzed the problem pointed out by a customer in the ncx2cdf function. This issue shows up at smaller values of x when the first few terms in the summation formula are important. We proposed two little modifications to the existing function ncx2cdf and the new function can be found in file sancx2cdf.m.