floating point representation

Question

I faced a question that asked to choose the Correct Number Representation for Summation of Float32 × Int32

$$ A = \sum_{n=0}^{127} b[n] \cdot c[n] $$

Given:

b is an array of 32-bit IEEE-754 floating-point values (float32).
c is an array of 32-bit signed integers (int32, using two's complement).
both b and c are natural numbers (i.e. positive integers).

My Understanding So Far:

b[n] can range approximately between:

$$ (1.0 \times 2^{0}) \quad \text{to} \quad (1.111\dots1_2 \times 2^{127}) \approx 2^{128} - 2^{104} $$
c[n] can range from \$1\$ to \$2^{31} - 1\$.
Therefore, the maximum possible product \$b[n] \cdot c[n]\$ can be roughly:

$$ (2^{128} - 2^{104}) \cdot (2^{31} - 1) \approx 2^{159} $$
Summing 128 such terms could reach as high as:

$$ 128 \cdot 2^{159} \approx 2^{166} $$

Now, considering representations:

A 192-bit signed integer (int192) can represent values up to \$2^{191} - 1\$, which should cover the entire possible result exactly.
A 64-bit float (float64) has 11 bits exponent one sign the rest mantissa

Question:

Choices

(A) int64 will always give the exact result
(B) int192 will always give the exact result
(C) float32 will give a correct result, but possibly with rounding
(D) float64 will give a correct result, but possibly with rounding
(E) Both B and D are correct
(F) None of the above allow a correct result, even with rounding

The correct answer was (E). I did not understand why the exponent bits are enough and why we only have a problem with the rounding, so I would appreciate an explanation regarding the floating point representation range.

IEEE float64... Build a struct with the same bit representation and delve into the numeric meaning of each of its parts. For example do you understand why is 0 just as special as inf or nan in it? — Abel
– Abel, Commented Jul 7 at 12:00
not sure i understand the question at the end , yet i think i am starting to understand we have bias 1023 and the exponent has 11 bits so 2^11-1 -1023 we can have a max of 2^1024 *1.M for a positive number so in respect to magnitude we can get the values in the range we want — dareen
– dareen, Commented Jul 7 at 12:12
Floating point has a fixed number of digits in the mantissa. If the result cannot be represented in the number of digits, then the result is rounded. The further away from 0, the less the precision is with floating point. — Kartman
– Kartman, Commented Jul 7 at 12:17

Spehro 'speff' Pefhany · Accepted Answer · 2025-07-07 12:13:36Z

3

Exponent bits are 11 for float64, so the signed number range for the exponent does appear to be adequate. It's going to be approximately \$2^{1024}\$ maximum for the multiplier and you only need \$2^{166}\$ .

As far as rounding- there are 52 bits in the mantissa of a float64 number and you are generating a result with 32+23 = 55 significant bits, so there are not enough to exactly represent all possible results.

answered Jul 7 at 12:13

Spehro 'speff' Pefhany

449k24 gold badges380 silver badges1k bronze badges

\$\begingroup\$ yeah i think the part that confuses me is determining if we have enough significant bits for example i understand it would be larger that 32 yet why it would be excatly 32+23 is not clear to me \$\endgroup\$

dareen
– dareen

2025-07-07 12:16:15 +00:00
Commented Jul 7 at 12:16
\$\begingroup\$ If you multiply \$2^n\$ x \$2^m\$ the result is \$2^{m+n}\$. This is a bit hand-wavey because the FP representation has an implicit leading 1 and the sign bit is separate for FP and the biggest possible numbers are 1 smaller, but it works out that way. \$\endgroup\$

Spehro 'speff' Pefhany
– Spehro 'speff' Pefhany

2025-07-07 12:20:34 +00:00
Commented Jul 7 at 12:20
\$\begingroup\$ float32 has 23 bits of mantissa (plus the implicit leading 1), so that's where that came from, in case it's not obvious \$\endgroup\$

Spehro 'speff' Pefhany
– Spehro 'speff' Pefhany

2025-07-07 12:25:54 +00:00
Commented Jul 7 at 12:25
1

\$\begingroup\$ thank you alot this was really helpful \$\endgroup\$

dareen
– dareen

2025-07-07 12:28:59 +00:00
Commented Jul 7 at 12:28
2

\$\begingroup\$ I encourage you to 'dig deep'. Getting the right answers on a quiz and knowing the approximate answer are both valuable, but knowing the exact behaviour is also valuable. \$\endgroup\$

Spehro 'speff' Pefhany
– Spehro 'speff' Pefhany

2025-07-07 12:31:55 +00:00
Commented Jul 7 at 12:31

| Show 3 more comments

Stack Exchange Network

floating point representation

Given:

My Understanding So Far:

Now, considering representations:

Question:

1 Answer 1

Hot Network Questions

floating point representation

Given:

My Understanding So Far:

Now, considering representations:

Question:

1 Answer 1

Related

Hot Network Questions