4.8.1 Double precision floating point arithmetic

(4.50) |

The significand
is a sequence of
bits
,
where
or
, with an implied binary point (analogous to a
decimal point) between bits
and
. Thus, the value of
is calculated as:

(4.51) |

For double precision arithmetic, the standard defines , , and . The number is represented as a 64-bit quantity with a 1-bit sign , an 11-bit biased exponent , and a 52-bit fractional mantissa composed of the bit string . Since the exponent can always be selected such that (and thus, ), the value of is constant and it does not need to be stored in the binary representation.

63 | 62 52 | 51 0 |

The integer value of the 11-bit biased exponent
is calculated as:

(4.52) |

The standard divides the set of representable numbers into the following five categories:

- If
and
, then the value of
is the special flag
*NaN*(not a number). - If and , then the value of is depending upon the sign bit: positive if and negative if .
- If
, then
is called a
*normalized*number, and

(4.53)

- If
and
, then
is
called a
*denormalized*number, and

(4.54)

- If and , then the value of is depending upon the sign bit. Although they have unique binary representations, arithmetically .

Table 4.2 summarizes all of the representable double precision numbers. The binary representation is presented with spaces separating the four 16-bit subsets of the 64-bit value, and the symbol separating the sign bit, exponent bits, and mantissa bits. The numbers in the first column refer to the aforementioned five categories of representable numbers.

in binary representation | |

1 11111111111 1111 1111111111111111 1111111111111111 1111111111111111 | |

1 | |

in binary representation | |

1 11111111111 0000 0000000000000000 0000000000000000 0000000000000001 | |

2 | in binary representation |

1 11111111111 0000 0000000000000000 0000000000000000 0000000000000000 | |

in binary representation | |

1 11111111110 1111 1111111111111111 1111111111111111 1111111111111111 | |

in binary representation | |

1 11111111110 0000 0000000000000000 0000000000000000 0000000000000000 | |

3 | |

in binary representation | |

1 00000000001 1111 1111111111111111 1111111111111111 1111111111111111 | |

in binary representation | |

1 00000000001 0000 0000000000000000 0000000000000000 0000000000000000 | |

in binary representation | |

1 00000000000 1111 1111111111111111 1111111111111111 1111111111111111 | |

4 | |

in binary representation | |

1 00000000000 0000 0000000000000000 0000000000000000 0000000000000001 | |

in binary representation | |

1 00000000000 0000 0000000000000000 0000000000000000 0000000000000000 | |

5 | in binary representation |

0 00000000000 0000 0000000000000000 0000000000000000 0000000000000000 | |

in binary representation | |

0 00000000000 0000 0000000000000000 0000000000000000 0000000000000001 | |

4 | |

in binary representation | |

0 00000000000 1111 1111111111111111 1111111111111111 1111111111111111 | |

in binary representation | |

0 00000000001 0000 0000000000000000 0000000000000000 0000000000000000 | |

in binary representation | |

0 00000000001 1111 1111111111111111 1111111111111111 1111111111111111 | |

3 | |

in binary representation | |

0 11111111110 0000 0000000000000000 0000000000000000 0000000000000000 | |

in binary representation | |

0 11111111110 1111 1111111111111111 1111111111111111 1111111111111111 | |

2 | in binary representation |

0 11111111111 0000 0000000000000000 0000000000000000 0000000000000000 | |

It is possible that the result of an operation on two normalized numbers will not itself be representable as a normalized number. Consider the normalized numbers and . Clearly, . However, in finite precision normalized floating point arithmetic because , which is too small to be represented as a normalized number. It is therefore rounded to the value of 0 [128, pp. 23-24].

The use of denormalized numbers ensures that the relationship

(4.55) |

always holds true for all normalized numbers. It will also hold true for denormalized numbers where , the smallest positive representable denormalized number.

The IEEE standard can represent normalized numbers, but only denormalized numbers. Denormalized numbers are generally not encountered in routine calculations. The ratio of denormalized to normalized numbers is . Furthermore, the denormalized numbers are not uniformly distributed throughout the representable floating point space; rather, they occupy two contiguous groups on either side of 0. Certain operations, however, such as root finding, iteratively generate numbers that are increasingly close to 0. Therefore it is important to allow for the possibility of encountering denormalized numbers when creating robust arithmetic software.