5
$\begingroup$

Given a new language, how best should we convert decimal to binary floating point? And please don't just say "use strtod(3)": The venerable C function supports a baroque variety of weird formats I might prefer to reject. Of course one option is to transform an input string into something C can deal with, but shouldn't there be a more direct way?

Let me credit David Gay with this monstrosity. It supports everything newer than Univac but contains many thousands of lines of of static data tables, conditional compilation, and otherwise-peculiar C code.

The associated paper begins by suggesting a straightforward procedure: Build an integer from up to the first 17 significant decimal digits, then multiply or divide by a suitable power of ten, which presumably the hardware can easily produce for you. And then the paper promptly gets much more difficult to understand, or to filter out the historical noise about hokey formats and ancient hardware.

Can anyone put a modern spin on this devilish detail of implementation? Is this really something no human should undertake? Or can this be done properly in drastically less code, assuming only a modern machine?

$\endgroup$
11
  • 4
    $\begingroup$ It’s not obvious to me why strtod accepting formats you prefer to reject presents a problem; you’re only going to feed valid tokens of the language into it, which by definition are in formats you do want to accept. $\endgroup$ Commented Nov 22, 2023 at 5:12
  • 1
    $\begingroup$ @MichaelHomer I suppose if strtod accepts a superset of what you want, it's not too crazy. Say you allow underscores between groups of digits: Now you've some string munging to do. It's entirely feasible but also indirect. It would be quite straightforward to work out a mantissa and exponent in one pass through the string. The question becomes under what circumstances the obvious code breaks down, and how to detect / correct it without truckloads of static magic data. $\endgroup$ Commented Nov 22, 2023 at 6:32
  • $\begingroup$ But that is the exact opposite of strtod accepting formats you prefer to reject, isn’t it? I don’t think that single-pass token construction is generally a concern for a typical language, in any case, so it would probably be good to outline the circumstances that will make it a necessity for this one. The real-world answer is almost certainly “use strtod” (e.g. Python here, which is still doing work around it), but there could still be situations where that’s available but unsuitable. $\endgroup$ Commented Nov 22, 2023 at 7:02
  • 2
    $\begingroup$ Do you also need to format floating points, or only to parse them? The former is extremely hairy -- at least if aiming for a round-trip. $\endgroup$ Commented Nov 22, 2023 at 16:25
  • $\begingroup$ @MichaelHomer I read that code and I find it rather depressing. All those gymnastics to transform a Python number into a strtod number (including locale concerns) by which time you almost might as well just finish the job. This suggests that strtod is the wrong abstraction: too "stringly-typed" in a sense for how e.g. Python wishes to represent numbers. $\endgroup$ Commented Nov 22, 2023 at 19:07

3 Answers 3

4
$\begingroup$

This is going to be a single-reference answer, but this classic paper is a "must read" for anyone writing a standard library:

William D. Clinger, How to read floating point numbers accurately, ACM SIGPLAN Notices 25:6, Jun 1990, pp 92–101.

There is some more recent work by Daniel Lemire et al that is worth checking out for a more modern approach, but the first one is still worth it.

$\endgroup$
2
  • 1
    $\begingroup$ Accepting because it actually answers the question even though it may not be the wisest plan for most people. $\endgroup$ Commented Nov 26, 2023 at 1:51
  • 1
    $\begingroup$ "Most people" perhaps, but part of the reason we write this stuff down is so that institutional knowledge isn't lost. It would warm my heart if someone in this community takes it upon themselves to implement a high quality floating point reader from scratch, if only as a learning exercise. $\endgroup$ Commented Nov 28, 2023 at 22:29
5
$\begingroup$

Is this really something no human should undertake?

Answering your question literally, it is something humans have undertaken and may well have to undertake again. But more importantly, should you do it? Or perhaps, when should you do it?

You've proposed wanting to reject formats that the existing implementation would accept. In practice this isn't an issue, because the rejection of tokens which are invalid according to your language's lexical syntax, is something that should happen before you call out to a third-party library to determine the meaning of those tokens. That is, your language's lexer should already be rejecting invalid number literals. Likewise, if you're parsing numbers dynamically at runtime (i.e. your standard library's parsing function, rather than your language's lexer), then you can check the string's format before calling out to the library. So this isn't a reason to avoid an existing implementation.

As you say, you may need to transform the string into a format accepted by the existing implementation, e.g. if your syntax allows underscores within number literals and the existing implementation doesn't accept them, then you'll need to strip them out first. But this is easy to do*, and not a reason to avoid the existing implementation.

One reason I can think of to implement your own conversion from decimal strings to binary fractional numbers, is if you need to support a different binary representation than standard floating-point. For example, your language might have a built-in arbitrary-precision number type, and allow literals of that type, in which case parsing as a fixed-width floating-point type isn't going to cut it. However, even in this case it would still make more sense to use an existing library, if there is an acceptable one for your host language.

On the other hand, there are two very good reasons to avoid "rolling your own" solution when an existing, tried and tested solution is available:

  • Your code could have a bug which gives incorrect parses in a rare case, and this would be hard to spot particularly if the error in the resulting value is small.
  • Your code could have a bug causing an infinite loop in a rare case, like this one in PHP. Because the same parsing function was being used at runtime, the fallout of that bug was that almost any web application written in PHP was vulnerable to a denial of service attack, simply by the user entering the string 2.2250738585072011e-308 anywhere a number was expected.

Given the stakes of getting this wrong, then, you should only attempt this both when no existing battle-tested implementation could meet your needs, and also when you have the capability to verify the correctness of your implementation to a very high degree of confidence (e.g. formal verification or fuzz-testing).


*In the comments it's been noted that this is not, in fact, easy for Python (source link), since a transformation is required to convert a string into the locale expected by strtod. But the same logic applies: prefer a tried-and-tested existing library that already does this (or doesn't need to), and only write your own as a last resort.

$\endgroup$
2
  • 1
    $\begingroup$ It’s not Python that handles locales there, it’s strtod that inherently uses the locale radix separator under POSIX (controlled by LC_NUMERIC), so Python has to replace source-text . with the locale-specific separator before passing the string in. Something like this would be required for any strtod-based implementation. (e.g. try tcc -run - <<<$'#include <stdio.h>\n#include <locale.h>\n#include <stdlib.h>\nint main() { setlocale(LC_ALL, "en_DK.utf8");\nprintf("%f\\n", strtod("1234.5", NULL));}' and see the wrong output). $\endgroup$ Commented Nov 23, 2023 at 5:20
  • $\begingroup$ @MichaelHomer Ah! I'll edit to fix. Surely there is a pre-existing float parser that doesn't depend on global state, though? $\endgroup$ Commented Nov 23, 2023 at 17:52
-2
$\begingroup$

It depends whether you consider programming to be a kind of engineering or a kind of politics.

  1. If you rely on strtod or anything else that someone else has written, then for any bug that appears in it, on any platform far into the future, all you can tell your users is “yes it’s a bug but it is not my fault”. To a potential customer that is unacceptable.

  2. You will have to accept all the decisions made by people you don’t know and have no control or influence over, some of which are unknowable until you discover them. For instance if they decide that numeric conversions ought to depend on (a) where the user is or (b) where the user says he is or (c) where the creators of strtod say the user ought to be - the decimal-point-or-comma is an example – then you and your customers are hostages to politics and their software will work or not work for political reasons. After all, the Unicode character set already has some characters that are not permitted to be displayed in China. Who knows if some country, some day, won’t decide that certain numbers carry a hidden message and are unacceptable?

The only responsible way out is to adopt the policy “all the bugs are my own” and keep to it from the very beginning.

Either write your own conversion routine (which you can test to death against a standard implementation) or take the source code of something like strtod and make it your own. So it won’t change from platform to platform, decade to decade, unless you want it to; and if anything goes wrong with it, you can correct it.

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.