3
\$\begingroup\$

I faced a question that asked to choose the Correct Number Representation for Summation of Float32 × Int32

$$ A = \sum_{n=0}^{127} b[n] \cdot c[n] $$

Given:

  • b is an array of 32-bit IEEE-754 floating-point values (float32).
  • c is an array of 32-bit signed integers (int32, using two's complement).
  • both b and c are natural numbers (i.e. positive integers).

My Understanding So Far:

  • b[n] can range approximately between:

    $$ (1.0 \times 2^{0}) \quad \text{to} \quad (1.111\dots1_2 \times 2^{127}) \approx 2^{128} - 2^{104} $$

  • c[n] can range from \$1\$ to \$2^{31} - 1\$.

  • Therefore, the maximum possible product \$b[n] \cdot c[n]\$ can be roughly:

    $$ (2^{128} - 2^{104}) \cdot (2^{31} - 1) \approx 2^{159} $$

  • Summing 128 such terms could reach as high as:

    $$ 128 \cdot 2^{159} \approx 2^{166} $$

Now, considering representations:

  1. A 192-bit signed integer (int192) can represent values up to \$2^{191} - 1\$, which should cover the entire possible result exactly.
  2. A 64-bit float (float64) has 11 bits exponent one sign the rest mantissa

Question:

Choices

  • (A) int64 will always give the exact result
  • (B) int192 will always give the exact result
  • (C) float32 will give a correct result, but possibly with rounding
  • (D) float64 will give a correct result, but possibly with rounding
  • (E) Both B and D are correct
  • (F) None of the above allow a correct result, even with rounding

The correct answer was (E). I did not understand why the exponent bits are enough and why we only have a problem with the rounding, so I would appreciate an explanation regarding the floating point representation range.

\$\endgroup\$
4
  • \$\begingroup\$ IEEE float64... Build a struct with the same bit representation and delve into the numeric meaning of each of its parts. For example do you understand why is 0 just as special as inf or nan in it? \$\endgroup\$ Commented Jul 7 at 12:00
  • \$\begingroup\$ not sure i understand the question at the end , yet i think i am starting to understand we have bias 1023 and the exponent has 11 bits so 2^11-1 -1023 we can have a max of 2^1024 *1.M for a positive number so in respect to magnitude we can get the values in the range we want \$\endgroup\$ Commented Jul 7 at 12:12
  • 1
    \$\begingroup\$ Floating point has a fixed number of digits in the mantissa. If the result cannot be represented in the number of digits, then the result is rounded. The further away from 0, the less the precision is with floating point. \$\endgroup\$ Commented Jul 7 at 12:17
  • \$\begingroup\$ @Kartman thank you cleared up what rounded means \$\endgroup\$ Commented Jul 7 at 12:19

1 Answer 1

3
\$\begingroup\$

Exponent bits are 11 for float64, so the signed number range for the exponent does appear to be adequate. It's going to be approximately \$2^{1024}\$ maximum for the multiplier and you only need \$2^{166}\$ .

As far as rounding- there are 52 bits in the mantissa of a float64 number and you are generating a result with 32+23 = 55 significant bits, so there are not enough to exactly represent all possible results.

\$\endgroup\$
8
  • \$\begingroup\$ yeah i think the part that confuses me is determining if we have enough significant bits for example i understand it would be larger that 32 yet why it would be excatly 32+23 is not clear to me \$\endgroup\$ Commented Jul 7 at 12:16
  • \$\begingroup\$ If you multiply \$2^n\$ x \$2^m\$ the result is \$2^{m+n}\$. This is a bit hand-wavey because the FP representation has an implicit leading 1 and the sign bit is separate for FP and the biggest possible numbers are 1 smaller, but it works out that way. \$\endgroup\$ Commented Jul 7 at 12:20
  • \$\begingroup\$ float32 has 23 bits of mantissa (plus the implicit leading 1), so that's where that came from, in case it's not obvious \$\endgroup\$ Commented Jul 7 at 12:25
  • 1
    \$\begingroup\$ thank you alot this was really helpful \$\endgroup\$ Commented Jul 7 at 12:28
  • 2
    \$\begingroup\$ I encourage you to 'dig deep'. Getting the right answers on a quiz and knowing the approximate answer are both valuable, but knowing the exact behaviour is also valuable. \$\endgroup\$ Commented Jul 7 at 12:31

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.