To calculate
it is necessary to extract
the integer value of the exponent from the binary representation.
Recall that the value of the significand
of a double precision
number
is:
![]() |
(4.56) |
The value of
can be computed using the standard C mathematical
functions frexp() and ldexp() [195, pp. 250-51] as follows [246,4]:
#include <math.h> /* standard C math library header */ double ulp(double x) { double ulp; /* ulp of x */ int exp; /* exponent of x, where exp = E+1 */ frexp(x, &exp); /* extract exponent of x */ ulp = ldexp(0.5, exp-52); /* calculate ulp = 0.5^(exp-52) */ return ulp /* return ulp */ }
Because of the use of standard library functions, this implementation
is slow. To avoid using the library functions and to construct the
directly, recall that the biased exponent
occupies bits 62
through 52. If we could manipulate the binary representation as a
64-bit integer, we could extract
by dividing by
, which
would right-shift the bit pattern by 52 bits, placing
in bits 10
through 0. The sign bit
, which would then occupy bit 11, could be
removed by performing a bitwise logical AND with the 64-bit mask
[4].
Most commercially available processors and programming languages, however, do not support 64-bit integers; generally, only 8, 16, and 32-bit integers are available. To overcome this, we can overlay the storage location of the 64-bit double precision number with an array of four 16-bit (short) integers:
In C or C++ this can be accomplished using the union data structure:
typedef union { double dp; /* the 64-bit double precision value */ unsigned short sh[4]; /* overlay an array of */ } Double; /* 4 16-bit integers */After the assignment:
double x; /* the double precision value */ Double D; /* copy of x */ D.dp = x;the exponent of the variable x can be extracted from D.sh[0], whose 16 bits contain the sign bit
15 | 14 ![]() |
![]() |
![]() |
![]() |
![]() |
The biased exponent
can be extracted from D.sh[0] by
performing a bitwise logical AND with the 16-bit mask
to zero-out the sign bit and the four most
significant bits of the mantissa:
15 | 14 ![]() |
![]() |
0 | ![]() |
![]() |
15
![]() |
10
![]() |
0
![]() |
![]() |
Some processor architectures order the four 16-bit elements of the array sh in the reverse order, in other words, sh[0] is the rightmost (least significant) 16-bit word, not the leftmost (most significant). To avoid the problem, we can define the constant MSW to indicate the proper index of the left-most 16 bit array element:
#define MSW 0 /* 0 if the left-most 16-bit short is sh[0] */ /* 3 if the left-most 16-bit short is sh[3] */
When the
is a denormalized number a special case needs to be
taken. Recall that
and
must be greater than
. Thus, if
(or equivalently,
, since
) then
can only be
represented as a denormalized number with biased exponent
and mantissa
, for
, where:
![]() |
(4.57) |
The following function directly constructs
as the appropriate
normalized or denormalized number [4]:
static unsigned short mask[16] = { /* bit masks for bits 0 - 15 */ 0x0001, 0x0002, 0x0004, 0x0008, 0x0010, 0x0020, 0x0040, 0x0080, 0x0100, 0x0200, 0x0400, 0x0800, 0x1000, 0x2000, 0x4000, 0x8000}; double ulp(double x) { Double U, /* ulp of x */ X; /* working copy of x */ int bit, /* position of bit e-1 in 16-bit word */ e1, /* biased exponent - 1 */ word; /* index of 16-bit word containing bit e-1 */ X.dp = x; X.sh[MSW] &= 0x7ff0; /* isolate exponent in 16-bit word */ /* X.sh[0] now holds the exponent in bits 14-4 */ U.dp = 0.0; /* initialize exponent and mantissa to 0 */ if (X.sh[MSW] > 0x0340) /* ulp is normalized number */ U.sh[MSW] = X.sh[MSW]-0x0340; /* set exponent to e-52 */ /* the value 0x0340 is 52 left-shifted 4 bits, i.e. 0x0340 = 832 = 52<<4 */ else { /* ulp is denormalized number */ e1 = (X.sh[MSW]>>4) - 1; /* biased exponent - 1 */ word = e1>>4; /* find 16-bit word containing bit e-1 */ if (MSW == 0) word = 3 - word; /* compensate for word ordering */ bit = e1%16; /* find the bit position in this word */ U.sh[word] |= mask[bit]; /* set the bit to 1 */ } return U.dp; /* return ulp */ }
This implementation correctly and efficiently computes the
. For
example, for the value
= +2.2250738585072014
(the
smallest positive normalized double precision number) the
=
+4.9406564584124654
, which has the denormalized
binary representation:
For the value
= -1.7976931348623157
(the
largest negative normalized number) the
= +1.9958403095347198
, which has the normalized binary representation: