BIS 524 Chapter 5

Floating Point Example - Storing a Non-Integer
What is the floating-point representation of -483.137₁₀ ? First we must convert this number to base-2.
Remember that any integer, N, in base-10 can be represented as the sum
{... + a₃×2³ + a₂×2² + a₁×2¹ + a₀×2⁰}, where the a_i values are either 0 or 1. If we divide N by 2, and there is a positive remainder, then a₀=1. Otherwise, a₀=0. Dividing the integer part of N/2 by 2, we can determine the value of a₁and so on. The base-2 representation of N is thus {... a₃a₂a₁a₀}. As shown below, 483₁₀ = 111100011₂.

483/2

=

241

Remainder:

1

a₀

241/2

=

120

1

a₁

120/2

=

60

0

a₂

60/2

=

30

0

a₃

30/2

=

15

0

a₄

15/2

=

7

1

a₅

7/2

=

3

1

a₆

3/2

=

1

1

a₇

½

=

0

1

a₈

A decimal number between 0 and 1, .F, in base-10 can be represented as the sum
{b₁×2^-1 + b₂×2^-2 + b₃×2^-3 + ...}, where the b_i values are either 0 or 1. If we multiply .F by 2, and the result is greater or equal to one, then b₁=1. Otherwise, b₁=0. Multiplying the fractional part of .F×2 by 2, we can determine the value of b₂, and so on. The base-2 representation of .F is thus .{b₁b₂b₃...}. As shown below, 0.137₁₀ = .00100011₂, truncated to 8-places.

0.137×2

=

0.274

+

0

b₁

0.274×2

=

0.548

+

0

b₂

0.548×2

=

0.096

+

1

b₃

0.096×2

=

0.192

+

0

b₄

0.192×2

=

0.384

+

0

b₅

0.384×2

=

0.768

+

0

b₆

0.768×2

=

0.536

+

1

b₇

0.536×2

=

0.072

+

1

b₈

Thus we have that 483.137₁₀ _ 111100011.00100011₂. In base-2 scientific notation, the approximate value is 1.1110001100100011×10¹⁰⁰⁰. Adding the bias to the exponent yields 10000000111. Taking in account the negative sign and dropping the leading 1 from the mantissa, the 3-byte floating-point representation of _483.137₁₀ is

1 1 0 0 0 0 0 0

0 1 1 1 1 1 1 0

0 0 1 1 0 0 1 0

byte 1

byte 2

byte 3

Note that not all of the mantissa can be stored. The remainder is truncated. If four or more bytes were available, more of the mantissa could be kept. In hexadecimal form, this number is C07E32₁₆. The actual base-10 number represented by this floating-point number is -483.125₁₀.

483/2	=	241	Remainder:	1	a₀
241/2	=	120		1	a₁
120/2	=	60		0	a₂
60/2	=	30		0	a₃
30/2	=	15		0	a₄
15/2	=	7		1	a₅
7/2	=	3		1	a₆
3/2	=	1		1	a₇
½	=	0		1	a₈

0.137×2	=	0.274	+	0	b₁
0.274×2	=	0.548	+	0	b₂
0.548×2	=	0.096	+	1	b₃
0.096×2	=	0.192	+	0	b₄
0.192×2	=	0.384	+	0	b₅
0.384×2	=	0.768	+	0	b₆
0.768×2	=	0.536	+	1	b₇
0.536×2	=	0.072	+	1	b₈

1 1 0 0 0 0 0 0	0 1 1 1 1 1 1 0	0 0 1 1 0 0 1 0
byte 1	byte 2	byte 3