Next: 4.8.2 Extracting the exponent Up: 4.8 Rounded interval arithmetic Previous: 4.8 Rounded interval arithmetic   Contents   Index

## 4.8.1 Double precision floating point arithmetic

Most commercial processors implement floating point arithmetic using the representation defined by ANSI/IEEE Std 754-1985, Standard for Binary Floating Point Arithmetic [10]. This standard defines the binary representation of the floating point number in terms of a sign bit , an integer exponent , for , and a -bit significand , where
 (4.50)

The significand is a sequence of bits , where or , with an implied binary point (analogous to a decimal point) between bits and . Thus, the value of is calculated as:

 (4.51)

For double precision arithmetic, the standard defines , , and . The number is represented as a 64-bit quantity with a 1-bit sign , an 11-bit biased exponent , and a 52-bit fractional mantissa composed of the bit string . Since the exponent can always be selected such that (and thus, ), the value of is constant and it does not need to be stored in the binary representation.

 63 62 52 51 0

The integer value of the 11-bit biased exponent is calculated as:

 (4.52)

The standard divides the set of representable numbers into the following five categories:

1. If and , then the value of is the special flag NaN (not a number).
2. If and , then the value of is depending upon the sign bit: positive if and negative if .
3. If , then is called a normalized number, and
 (4.53)

4. If and , then is called a denormalized number, and
 (4.54)

5. If and , then the value of is depending upon the sign bit. Although they have unique binary representations, arithmetically .

Table 4.2 summarizes all of the representable double precision numbers. The binary representation is presented with spaces separating the four 16-bit subsets of the 64-bit value, and the symbol separating the sign bit, exponent bits, and mantissa bits. The numbers in the first column refer to the aforementioned five categories of representable numbers.

 in binary representation 1 11111111111 1111 1111111111111111 1111111111111111 1111111111111111 1 in binary representation 1 11111111111 0000 0000000000000000 0000000000000000 0000000000000001 2 in binary representation 1 11111111111 0000 0000000000000000 0000000000000000 0000000000000000 in binary representation 1 11111111110 1111 1111111111111111 1111111111111111 1111111111111111 in binary representation 1 11111111110 0000 0000000000000000 0000000000000000 0000000000000000 3 in binary representation 1 00000000001 1111 1111111111111111 1111111111111111 1111111111111111 in binary representation 1 00000000001 0000 0000000000000000 0000000000000000 0000000000000000 in binary representation 1 00000000000 1111 1111111111111111 1111111111111111 1111111111111111 4 in binary representation 1 00000000000 0000 0000000000000000 0000000000000000 0000000000000001 in binary representation 1 00000000000 0000 0000000000000000 0000000000000000 0000000000000000 5 in binary representation 0 00000000000 0000 0000000000000000 0000000000000000 0000000000000000 in binary representation 0 00000000000 0000 0000000000000000 0000000000000000 0000000000000001 4 in binary representation 0 00000000000 1111 1111111111111111 1111111111111111 1111111111111111 in binary representation 0 00000000001 0000 0000000000000000 0000000000000000 0000000000000000 in binary representation 0 00000000001 1111 1111111111111111 1111111111111111 1111111111111111 3 in binary representation 0 11111111110 0000 0000000000000000 0000000000000000 0000000000000000 in binary representation 0 11111111110 1111 1111111111111111 1111111111111111 1111111111111111 2 in binary representation 0 11111111111 0000 0000000000000000 0000000000000000 0000000000000000

It is possible that the result of an operation on two normalized numbers will not itself be representable as a normalized number. Consider the normalized numbers and . Clearly, . However, in finite precision normalized floating point arithmetic because , which is too small to be represented as a normalized number. It is therefore rounded to the value of 0 [128, pp. 23-24].

The use of denormalized numbers ensures that the relationship

 (4.55)

always holds true for all normalized numbers. It will also hold true for denormalized numbers where , the smallest positive representable denormalized number.

The IEEE standard can represent normalized numbers, but only denormalized numbers. Denormalized numbers are generally not encountered in routine calculations. The ratio of denormalized to normalized numbers is . Furthermore, the denormalized numbers are not uniformly distributed throughout the representable floating point space; rather, they occupy two contiguous groups on either side of 0. Certain operations, however, such as root finding, iteratively generate numbers that are increasingly close to 0. Therefore it is important to allow for the possibility of encountering denormalized numbers when creating robust arithmetic software.

Next: 4.8.2 Extracting the exponent Up: 4.8 Rounded interval arithmetic Previous: 4.8 Rounded interval arithmetic   Contents   Index
December 2009