4.8.1 Double precision floating point arithmetic

The significand is a sequence of bits $b_0 b_1 \cdots b_{p-1}$ , where or , with an implied binary point (analogous to a decimal point) between bits and . Thus, the value of is calculated as:

For double precision arithmetic, the standard defines , $E_{min} = -1022$ , and $E_{max} = 1023$ . The number is represented as a 64-bit quantity with a 1-bit sign , an 11-bit biased exponent , and a 52-bit fractional mantissa composed of the bit string $b_1 b_2 \cdots b_{52}$ . Since the exponent can always be selected such that (and thus, $1 \leq B < 2$ ), the value of is constant and it does not need to be stored in the binary representation.

The standard divides the set of representable numbers into the following five categories:

Table 4.2 summarizes all of the representable double precision numbers. The binary representation is presented with spaces separating the four 16-bit subsets of the 64-bit value, and the symbol $\cdot$ separating the sign bit, exponent bits, and mantissa bits. The numbers in the first column refer to the aforementioned five categories of representable numbers.

**Table 4.2:** Representable double-precision numbers and special values (adapted from [4])



	in binary representation
	1 $\cdot$ 11111111111 $\cdot$ 1111 1111111111111111 1111111111111111 1111111111111111
1	$\;\; \cdots$
	in binary representation
	1 $\cdot$ 11111111111 $\cdot$ 0000 0000000000000000 0000000000000000 0000000000000001


2	$-\infty$ in binary representation
	1 $\cdot$ 11111111111 $\cdot$ 0000 0000000000000000 0000000000000000 0000000000000000


	$-1.7976931348623157 \times 10^{+308}$ in binary representation
	1 $\cdot$ 11111111110 $\cdot$ 1111 1111111111111111 1111111111111111 1111111111111111
	$\;\; \cdots$
	$-8.9884656743115795 \times 10^{+307}$ in binary representation
	1 $\cdot$ 11111111110 $\cdot$ 0000 0000000000000000 0000000000000000 0000000000000000
3	$\;\; \cdots$
	$-4.4501477170144023 \times 10^{-308}$ in binary representation
	1 $\cdot$ 00000000001 $\cdot$ 1111 1111111111111111 1111111111111111 1111111111111111
	$\;\; \cdots$
	$-2.2250738585072014 \times 10^{-308}$ in binary representation
	1 $\cdot$ 00000000001 $\cdot$ 0000 0000000000000000 0000000000000000 0000000000000000


	$-2.2250738585072009 \times 10^{-308}$ in binary representation
	1 $\cdot$ 00000000000 $\cdot$ 1111 1111111111111111 1111111111111111 1111111111111111
4	$\;\; \cdots$
	$-4.9406564584124654 \times 10^{-324}$ in binary representation
	1 $\cdot$ 00000000000 $\cdot$ 0000 0000000000000000 0000000000000000 0000000000000001


	in binary representation
	1 $\cdot$ 00000000000 $\cdot$ 0000 0000000000000000 0000000000000000 0000000000000000
5	in binary representation
	0 $\cdot$ 00000000000 $\cdot$ 0000 0000000000000000 0000000000000000 0000000000000000


	$+4.9406564584124654 \times 10^{-324}$ in binary representation
	0 $\cdot$ 00000000000 $\cdot$ 0000 0000000000000000 0000000000000000 0000000000000001
4	$\;\; \cdots$
	$+2.2250738585072009 \times 10^{-308}$ in binary representation

	0 $\cdot$ 00000000000 $\cdot$ 1111 1111111111111111 1111111111111111 1111111111111111

	$+2.2250738585072014 \times 10^{-308}$ in binary representation
	0 $\cdot$ 00000000001 $\cdot$ 0000 0000000000000000 0000000000000000 0000000000000000
	$\;\; \cdots$
	$+4.4501477170144023 \times 10^{-308}$ in binary representation
	0 $\cdot$ 00000000001 $\cdot$ 1111 1111111111111111 1111111111111111 1111111111111111
3	$\;\; \cdots$
	$+8.9884656743115795 \times 10^{+307}$ in binary representation
	0 $\cdot$ 11111111110 $\cdot$ 0000 0000000000000000 0000000000000000 0000000000000000
	$\;\; \cdots$
	$+1.7976931348623157 \times 10^{+308}$ in binary representation
	0 $\cdot$ 11111111110 $\cdot$ 1111 1111111111111111 1111111111111111 1111111111111111


2	$+\infty$ in binary representation
	0 $\cdot$ 11111111111 $\cdot$ 0000 0000000000000000 0000000000000000 0000000000000000

It is possible that the result of an operation on two normalized numbers will not itself be representable as a normalized number. Consider the normalized numbers $x = 1.25 \times 10^{-306}$ and $y = 1.23 \times 10^{-306}$ . Clearly, $x \neq y$ . However, in finite precision normalized floating point arithmetic because $x - y = 0.02 \times 10^{-306} = 2.0 \times 10^{-308}$ , which is too small to be represented as a normalized number. It is therefore rounded to the value of 0 [128, pp. 23-24].

The IEEE standard can represent $2046 \cdot 2^{52} \approx 9.2 \times 10^{18}$ normalized numbers, but only $2^{52}-1 \approx 4.5 \times 10^{15}$ denormalized numbers. Denormalized numbers are generally not encountered in routine calculations. The ratio of denormalized to normalized numbers is $1/2046 \approx 4.8 \times 10^{-4}$ . Furthermore, the denormalized numbers are not uniformly distributed throughout the representable floating point space; rather, they occupy two contiguous groups on either side of 0. Certain operations, however, such as root finding, iteratively generate numbers that are increasingly close to 0. Therefore it is important to allow for the possibility of encountering denormalized numbers when creating robust arithmetic software.

63	62 $\;\;\;\;\;\;\;\;\;\;$ $\cdots$ $\;\;\;\;\;\;\;\;\;\;$ 52	51 $\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$ $\cdots$ $\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$ 0