4.8.1 Double precision floating point arithmetic

The significand is a sequence of bits , where or , with an implied binary point (analogous to a decimal point) between bits and . Thus, the value of is calculated as:

For double precision arithmetic, the standard defines , , and . The number is represented as a 64-bit quantity with a 1-bit sign , an 11-bit biased exponent , and a 52-bit fractional mantissa composed of the bit string . Since the exponent can always be selected such that (and thus, ), the value of is constant and it does not need to be stored in the binary representation.

The standard divides the set of representable numbers into the following five categories:

Table 4.2 summarizes all of the representable double precision numbers. The binary representation is presented with spaces separating the four 16-bit subsets of the 64-bit value, and the symbol separating the sign bit, exponent bits, and mantissa bits. The numbers in the first column refer to the aforementioned five categories of representable numbers.

**Table 4.2:** Representable double-precision numbers and special values (adapted from [4])



	in binary representation
	1 11111111111 1111 1111111111111111 1111111111111111 1111111111111111
1
	in binary representation
	1 11111111111 0000 0000000000000000 0000000000000000 0000000000000001


2	in binary representation
	1 11111111111 0000 0000000000000000 0000000000000000 0000000000000000


	in binary representation
	1 11111111110 1111 1111111111111111 1111111111111111 1111111111111111

	in binary representation
	1 11111111110 0000 0000000000000000 0000000000000000 0000000000000000
3
	in binary representation
	1 00000000001 1111 1111111111111111 1111111111111111 1111111111111111

	in binary representation
	1 00000000001 0000 0000000000000000 0000000000000000 0000000000000000


	in binary representation
	1 00000000000 1111 1111111111111111 1111111111111111 1111111111111111
4
	in binary representation
	1 00000000000 0000 0000000000000000 0000000000000000 0000000000000001


	in binary representation
	1 00000000000 0000 0000000000000000 0000000000000000 0000000000000000
5	in binary representation
	0 00000000000 0000 0000000000000000 0000000000000000 0000000000000000


	in binary representation
	0 00000000000 0000 0000000000000000 0000000000000000 0000000000000001
4
	in binary representation

	0 00000000000 1111 1111111111111111 1111111111111111 1111111111111111

	in binary representation
	0 00000000001 0000 0000000000000000 0000000000000000 0000000000000000

	in binary representation
	0 00000000001 1111 1111111111111111 1111111111111111 1111111111111111
3
	in binary representation
	0 11111111110 0000 0000000000000000 0000000000000000 0000000000000000

	in binary representation
	0 11111111110 1111 1111111111111111 1111111111111111 1111111111111111


2	in binary representation
	0 11111111111 0000 0000000000000000 0000000000000000 0000000000000000

It is possible that the result of an operation on two normalized numbers will not itself be representable as a normalized number. Consider the normalized numbers and . Clearly, . However, in finite precision normalized floating point arithmetic because , which is too small to be represented as a normalized number. It is therefore rounded to the value of 0 [128, pp. 23-24].

The IEEE standard can represent normalized numbers, but only denormalized numbers. Denormalized numbers are generally not encountered in routine calculations. The ratio of denormalized to normalized numbers is . Furthermore, the denormalized numbers are not uniformly distributed throughout the representable floating point space; rather, they occupy two contiguous groups on either side of 0. Certain operations, however, such as root finding, iteratively generate numbers that are increasingly close to 0. Therefore it is important to allow for the possibility of encountering denormalized numbers when creating robust arithmetic software.