I faced a question that asked to choose the Correct Number Representation for Summation of Float32 × Int32
$$ A = \sum_{n=0}^{127} b[n] \cdot c[n] $$
Given:
bis an array of 32-bit IEEE-754 floating-point values (float32).cis an array of 32-bit signed integers (int32, using two's complement).- both b and c are natural numbers (i.e. positive integers).
My Understanding So Far:
b[n]can range approximately between:$$ (1.0 \times 2^{0}) \quad \text{to} \quad (1.111\dots1_2 \times 2^{127}) \approx 2^{128} - 2^{104} $$
c[n]can range from \$1\$ to \$2^{31} - 1\$.Therefore, the maximum possible product \$b[n] \cdot c[n]\$ can be roughly:
$$ (2^{128} - 2^{104}) \cdot (2^{31} - 1) \approx 2^{159} $$
Summing 128 such terms could reach as high as:
$$ 128 \cdot 2^{159} \approx 2^{166} $$
Now, considering representations:
- A 192-bit signed integer (
int192) can represent values up to \$2^{191} - 1\$, which should cover the entire possible result exactly. - A 64-bit float (
float64) has 11 bits exponent one sign the rest mantissa
Question:
Choices
- (A)
int64will always give the exact result - (B)
int192will always give the exact result - (C)
float32will give a correct result, but possibly with rounding - (D)
float64will give a correct result, but possibly with rounding - (E) Both B and D are correct
- (F) None of the above allow a correct result, even with rounding
The correct answer was (E). I did not understand why the exponent bits are enough and why we only have a problem with the rounding, so I would appreciate an explanation regarding the floating point representation range.