next up previous contents index
Next: 4.8.3 Comparison of two Up: 4.8 Rounded interval arithmetic Previous: 4.8.1 Double precision floating   Contents   Index

4.8.2 Extracting the exponent from the binary representation

To calculate $ ulp$ it is necessary to extract the integer value of the exponent from the binary representation. Recall that the value of the significand $ B$ of a double precision number $ X$ is:

$\displaystyle B = 1 + b_1 2^{-1} + b_2 2^{-2} + \cdots + b_{52} 2^{-52}\;,$     (4.56)

and that the double precision value $ X = (-1)^s 2^E B$ . The value of the least significant bit $ b_{52}$ is $ 2^{-52}$ . Thus, the value of $ ulp$ is $ 2^E 2^{-52} = 2^{E-52}$ .

The value of $ ulp$ can be computed using the standard C mathematical functions frexp() and ldexp() [195, pp. 250-51] as follows [246,4]:

Algorithm 4.1  
   

    #include <math.h>            /* standard C math library header */

    double ulp(double x)
    {
      double ulp;                /* ulp of x */
      int    exp;                /* exponent of x, where exp = E+1 */

      frexp(x, &exp);            /* extract exponent of x */
      ulp = ldexp(0.5, exp-52);  /* calculate ulp = 0.5^(exp-52) */

      return ulp                 /* return ulp */
    }

(Note that the function frexp() assumes that $ 0.5 \leq B < 1$ . Recall that the unbiased exponent $ E$ defined by IEEE-754 assumes that $ 1 \leq
B < 2$ , thus $ exp = E+1$ . In papers [57,58,178,179,180,181,246,254,255,425,427] the convention of assuming $ 0.5 \leq B < 1$ is followed.)

Because of the use of standard library functions, this implementation is slow. To avoid using the library functions and to construct the $ ulp$ directly, recall that the biased exponent $ e$ occupies bits 62 through 52. If we could manipulate the binary representation as a 64-bit integer, we could extract $ e$ by dividing by $ 2^{52}$ , which would right-shift the bit pattern by 52 bits, placing $ e$ in bits 10 through 0. The sign bit $ s$ , which would then occupy bit 11, could be removed by performing a bitwise logical AND with the 64-bit mask $ 0
\cdots 011111111111$ [4].

Most commercially available processors and programming languages, however, do not support 64-bit integers; generally, only 8, 16, and 32-bit integers are available. To overcome this, we can overlay the storage location of the 64-bit double precision number with an array of four 16-bit (short) integers:

\begin{figure*}\centerline{
\psfig{figure=fig/fp_rep.eps,height=0.8in}
}
\end{figure*}

In C or C++ this can be accomplished using the union data structure:

    typedef union {
      double         dp;     /* the 64-bit double precision value */
      unsigned short sh[4];  /* overlay an array of  */
    } Double;                /* 4 16-bit  integers */
After the assignment:
    double x;                /* the double precision value */
    Double D;                /* copy of x */

    D.dp = x;
the exponent of the variable x can be extracted from D.sh[0], whose 16 bits contain the sign bit $ s$ (bit 15), the 11-bit biased exponent $ e$ (bits 14 through 4), and the 4 most significant bits $ b_1 b_2 b_3 b_4$ of the mantissa $ m$ (bits 3 through 0):
15 14 $ \cdots$ 4 $ 3 \;\: 2 \;  1 \;  0$
$ s$ $ e$ $ b_1 b_2 b_3 b_4$

The biased exponent $ e$ can be extracted from D.sh[0] by performing a bitwise logical AND with the 16-bit mask $ 0111111111110000$ to zero-out the sign bit and the four most significant bits of the mantissa:

15 14 $ \cdots$ 4 $ 3 \; 2 \; 1\; 0$
0 $ e$ $ 0 \; 0 \; 0 \; 0$



and then right-shifting $ e$ by 4 bits:



15 $ \; \cdots \;$ 11 10 $ \;\; \cdots \;\;$ 0
0 $ \;\; \cdots \;\;$ 0 $ e$




Then the unbiased exponent $ E = e - 1023$ .

Some processor architectures order the four 16-bit elements of the array sh in the reverse order, in other words, sh[0] is the rightmost (least significant) 16-bit word, not the leftmost (most significant). To avoid the problem, we can define the constant MSW to indicate the proper index of the left-most 16 bit array element:

    #define MSW 0     /* 0 if the left-most 16-bit short is sh[0] */
                      /* 3 if the left-most 16-bit short is sh[3] */

When the $ ulp$ is a denormalized number a special case needs to be taken. Recall that $ ulp = 2^{E-52}$ and $ E$ must be greater than $ -1023$ . Thus, if $ E \leq -971$ (or equivalently, $ e
\leq 52$ , since $ e = E + 1023$ ) then $ 2^{E-52}$ can only be represented as a denormalized number with biased exponent $ e_{ulp} =
0$ and mantissa $ m_{ulp} = b_i$ , for $ 0 \leq i \leq 51$ , where:

$\displaystyle b_i = \left\{ \begin{array}{ll} 1 & \mbox{if $i = e-1$} \ 0 &
\mbox{otherwise}\;. \end{array} \right.$     (4.57)

If $ e > 52$ then $ ulp$ can be represented by the normalized number with $ e_{ulp} = e-52$ and $ m_{ulp} = 0 \cdots 0$ .

The following function directly constructs $ ulp$ as the appropriate normalized or denormalized number [4]:

Algorithm 4.2  
    
    static unsigned short mask[16] = { /* bit masks for bits 0 - 15 */
      0x0001, 0x0002, 0x0004, 0x0008, 0x0010, 0x0020, 0x0040, 0x0080,
      0x0100, 0x0200, 0x0400, 0x0800, 0x1000, 0x2000, 0x4000, 0x8000};

    double ulp(double x)
    {
      Double U,     /* ulp of x */
             X;     /* working copy of x */
      int    bit,   /* position of bit e-1 in 16-bit word */
             e1,    /* biased exponent - 1 */
             word;  /* index of 16-bit word containing bit e-1 */

      X.dp = x;
      X.sh[MSW] &= 0x7ff0;  /* isolate exponent in 16-bit word */

      /* X.sh[0] now holds the exponent in bits 14-4 */

      U.dp = 0.0;     /* initialize exponent and mantissa to 0 */

      if (X.sh[MSW] > 0x0340)      /* ulp is normalized number */
        U.sh[MSW] = X.sh[MSW]-0x0340; /*  set exponent to e-52 */

      /* the value 0x0340 is 52 left-shifted 4 bits, 
         i.e. 0x0340 = 832 = 52<<4 */

      else {                     /* ulp is denormalized number */
        e1 = (X.sh[MSW]>>4) - 1;       /*  biased exponent - 1 */
        word = e1>>4;   /* find 16-bit word containing bit e-1 */
        if (MSW == 0) word = 3 - word;   /* compensate for word 
                                            ordering */
        bit  = e1%16;   /*  find the bit position in this word */
        U.sh[word] |= mask[bit];          /*  set the bit to 1 */
      }

      return U.dp;          /* return ulp */
    }

(Note that the C right-shift operation $ n»m$ is equivalent to integer division, $ n/2^m$ . Similarly, the left-shift $ n«m$ is equivalent to integer multiplication, $ n \cdot 2^m$ .)

This implementation correctly and efficiently computes the $ ulp$ . For example, for the value $ X$ = +2.2250738585072014 $ \times $ $ 10^{-308}$ (the smallest positive normalized double precision number) the $ ulp$ = +4.9406564584124654 $ \times $ $ 10^{-324}$ , which has the denormalized binary representation:




0$ \cdot$ 00000000000$ \cdot$ 0000 0000000000000000 0000000000000000 0000000000000001



For the value $ X$ = -1.7976931348623157 $ \times $ $ 10^{+308}$ (the largest negative normalized number) the $ ulp$ = +1.9958403095347198 $ \times $ $ 10^{+292}$ , which has the normalized binary representation:




0$ \cdot$ 11111001010$ \cdot$ 0000 0000000000000000 0000000000000000 0000000000000000


next up previous contents index
Next: 4.8.3 Comparison of two Up: 4.8 Rounded interval arithmetic Previous: 4.8.1 Double precision floating   Contents   Index
December 2009