Next: 4.8.3 Comparison of two Up: 4.8 Rounded interval arithmetic Previous: 4.8.1 Double precision floating   Contents   Index

4.8.2 Extracting the exponent from the binary representation

To calculate it is necessary to extract the integer value of the exponent from the binary representation. Recall that the value of the significand of a double precision number is:

    (4.56)

and that the double precision value . The value of the least significant bit is . Thus, the value of is .

The value of can be computed using the standard C mathematical functions frexp() and ldexp() [195, pp. 250-51] as follows [246,4]:

Algorithm 4.1  
   

    #include <math.h>            /* standard C math library header */

    double ulp(double x)
    {
      double ulp;                /* ulp of x */
      int    exp;                /* exponent of x, where exp = E+1 */

      frexp(x, &exp);            /* extract exponent of x */
      ulp = ldexp(0.5, exp-52);  /* calculate ulp = 0.5^(exp-52) */

      return ulp                 /* return ulp */
    }

(Note that the function frexp() assumes that . Recall that the unbiased exponent defined by IEEE-754 assumes that , thus . In papers [57,58,178,179,180,181,246,254,255,425,427] the convention of assuming is followed.)

Because of the use of standard library functions, this implementation is slow. To avoid using the library functions and to construct the directly, recall that the biased exponent occupies bits 62 through 52. If we could manipulate the binary representation as a 64-bit integer, we could extract by dividing by , which would right-shift the bit pattern by 52 bits, placing in bits 10 through 0. The sign bit , which would then occupy bit 11, could be removed by performing a bitwise logical AND with the 64-bit mask [4].

Most commercially available processors and programming languages, however, do not support 64-bit integers; generally, only 8, 16, and 32-bit integers are available. To overcome this, we can overlay the storage location of the 64-bit double precision number with an array of four 16-bit (short) integers:

In C or C++ this can be accomplished using the union data structure:

    typedef union {
      double         dp;     /* the 64-bit double precision value */
      unsigned short sh[4];  /* overlay an array of  */
    } Double;                /* 4 16-bit  integers */
After the assignment:
    double x;                /* the double precision value */
    Double D;                /* copy of x */

    D.dp = x;
the exponent of the variable x can be extracted from D.sh[0], whose 16 bits contain the sign bit (bit 15), the 11-bit biased exponent (bits 14 through 4), and the 4 most significant bits of the mantissa (bits 3 through 0):
15 14 4

The biased exponent can be extracted from D.sh[0] by performing a bitwise logical AND with the 16-bit mask to zero-out the sign bit and the four most significant bits of the mantissa:

15 14 4
0



and then right-shifting by 4 bits:



15 11 10 0
0 0




Then the unbiased exponent .

Some processor architectures order the four 16-bit elements of the array sh in the reverse order, in other words, sh[0] is the rightmost (least significant) 16-bit word, not the leftmost (most significant). To avoid the problem, we can define the constant MSW to indicate the proper index of the left-most 16 bit array element:

    #define MSW 0     /* 0 if the left-most 16-bit short is sh[0] */
                      /* 3 if the left-most 16-bit short is sh[3] */

When the is a denormalized number a special case needs to be taken. Recall that and must be greater than . Thus, if (or equivalently, , since ) then can only be represented as a denormalized number with biased exponent and mantissa , for , where:

    (4.57)

If then can be represented by the normalized number with and .

The following function directly constructs as the appropriate normalized or denormalized number [4]:

Algorithm 4.2  
    
    static unsigned short mask[16] = { /* bit masks for bits 0 - 15 */
      0x0001, 0x0002, 0x0004, 0x0008, 0x0010, 0x0020, 0x0040, 0x0080,
      0x0100, 0x0200, 0x0400, 0x0800, 0x1000, 0x2000, 0x4000, 0x8000};

    double ulp(double x)
    {
      Double U,     /* ulp of x */
             X;     /* working copy of x */
      int    bit,   /* position of bit e-1 in 16-bit word */
             e1,    /* biased exponent - 1 */
             word;  /* index of 16-bit word containing bit e-1 */

      X.dp = x;
      X.sh[MSW] &= 0x7ff0;  /* isolate exponent in 16-bit word */

      /* X.sh[0] now holds the exponent in bits 14-4 */

      U.dp = 0.0;     /* initialize exponent and mantissa to 0 */

      if (X.sh[MSW] > 0x0340)      /* ulp is normalized number */
        U.sh[MSW] = X.sh[MSW]-0x0340; /*  set exponent to e-52 */

      /* the value 0x0340 is 52 left-shifted 4 bits, 
         i.e. 0x0340 = 832 = 52<<4 */

      else {                     /* ulp is denormalized number */
        e1 = (X.sh[MSW]>>4) - 1;       /*  biased exponent - 1 */
        word = e1>>4;   /* find 16-bit word containing bit e-1 */
        if (MSW == 0) word = 3 - word;   /* compensate for word 
                                            ordering */
        bit  = e1%16;   /*  find the bit position in this word */
        U.sh[word] |= mask[bit];          /*  set the bit to 1 */
      }

      return U.dp;          /* return ulp */
    }

(Note that the C right-shift operation is equivalent to integer division, . Similarly, the left-shift is equivalent to integer multiplication, .)

This implementation correctly and efficiently computes the . For example, for the value = +2.2250738585072014 (the smallest positive normalized double precision number) the = +4.9406564584124654 , which has the denormalized binary representation:




0 00000000000 0000 0000000000000000 0000000000000000 0000000000000001



For the value = -1.7976931348623157 (the largest negative normalized number) the = +1.9958403095347198 , which has the normalized binary representation:




0 11111001010 0000 0000000000000000 0000000000000000 0000000000000000



Next: 4.8.3 Comparison of two Up: 4.8 Rounded interval arithmetic Previous: 4.8.1 Double precision floating   Contents   Index
December 2009