Why would external iteration require lots of code inlining relative to internal iteration?

Question

In Graydon Hoare's blogpost The Rust I Wanted Had No Future, they state that:

Iteration used to be by stack / non-escaping coroutines, which we also called "interior" iteration, as opposed to "exterior" iteration by pointer-like things that live in variables you advance. Such coroutines are now finally supported by LLVM (they weren't at the time) and are actually a fairly old and reliable mechanism for a linking-friendly, not-having-to-inline-tons-of-library-code abstraction for iteration.

Why would external iterators require lots of inlining to be performant relative to internal iteration?

This is likely a Rust question and not a programming languages question because it cannot be answered in general. It has a lot to do with how iteration on containers is done and there are examples that implementing iterator interfaces still results in good code generation. Hence, it should be moved to stack overflow and tagged as rust. — feldentm
– feldentm, Commented Aug 6, 2023 at 13:54
@feldentm Questions about the design or implementation of specific languages are on-topic here. We already have a rust tag for questions about the design and implementation of Rust; I'm not convinced this question is specific to Rust, but if it is then that is fine. If the answer is "this only matters in Rust, because ..." then that is fine too. — kaya3
– kaya3, Commented Aug 6, 2023 at 15:46

Matthieu M. · Accepted Answer · 2023-08-08 09:42:21Z

Context

First of all, Graydon clarified the implementation they were thinking of on reddit:

This is in fact what I was talking about -- not separate stacks, but a persistent portion of the shared stack carved out for the iteratee. actually implemented in rustboot, but abandoned along the way when we moved to rustc, in part because llvm didn't support anything like it at the time. It does now.

The idea is to avoid function call overhead without inlining by jumping back and forth between coroutine code and iterator code during iteration.

External woes: stack of calls

It is common, in Rust, to create pipelines of iterators. Imagine something like:

 collection
     .into_iter()    //  Returns Person elements
     .enumerate()    //  Returns a tuple (index, Person)
     .filter(|i| i.1.age >= 18)
     .filter(|i| i.1.address.zip % 100 == 63)
     .for_each(|i| send_letter(i.0, i.1)); // or a for loop, for external.

With external iteration, however, this will result in:

N calls to the age predicate.
N' calls to the zip predicate.
N'' calls to the send letter callback.
N'' calls to Filter's next (zip).
N' calls to Filter's next (age).
N calls to Enumerate next.
N calls to Iterator next.
(3 calls to iterator methods (1 each): enumerate, filter, filter, no for_each)

With internal iteration (naive), this will result in:

N calls to the age predicate.
N' calls to the zip predicate.
N'' calls to the send letter callback.
N' calls to the wrapper around the zip predicate + tail.
N calls to the wrapper around the age predicate + tail.
(4 calls to iterator methods (1 each): enumerate, filter, filter, for_each)

With internal iteration (Graydon style), this will result in:

N calls to the age predicate.
N' calls to the zip predicate.
N'' calls to the send letter callback.
(4 calls to iterator methods (1 each): enumerate, filter, filter, for_each)

So, just comparing the number of non-user function calls:

External: N + N + N' + N'' + 3.
Internal (naive): N + N' + 4.
Internal (Graydon): 4.

The number of function calls of external iteration is just much greater... especially when compared to Graydon-style internal iteration.

Since inlining is the optimization which allows eliminating function call overhead, it benefits external iteration disproportionally.

External woes: lost context

This is best illustrated, I find, using the Chain iterator. A Chain iterator is simple: it takes two iterators, and chains them together in a single one, first yielding all elements of the first then yielding all elements of the second.

With internal iteration, it's quite simple; in Python:

class Chain:
    ...

    def for_each(self, callback):
        self.first.for_each(callback)
        self.second.for_each(callback)

And that's it.

With external iteration, however, every time the next element is queried, one must check which iterator to yield from:

class Chain:
    ...

    def next(self):
        if self.use_second == False:
            n = self.first.next()

            if n:
                return n

            self.use_second = True

        return self.second.next()

The reason is that for any complex iteration protocol, the context is lost in between each query to next, and therefore (1) must be saved internally and (2) must be reconstructed with every call.

The only way to avoid saving + reconstructing at every call... is to eliminate the call by inlining it, and then have the optimizer optimize those out¹.

Once again, inlining benefits external iteration more than it does internal iteration.

¹ As of LLVM 16, LLVM still doesn't optimize chain iterators next calls, and instead keeps a single loop with a flag check at every iteration...

Appendix: why External?

One key advantage of External iteration manifests itself with non-trivial loops: since using external iteration the control-flow is in the hands of the caller, they can do a lot: continue/break (potentially to other loops), return, zip, etc...

External iteration offers more flexibility, and the greater implementation complexity cost does not necessarily result in a performance loss with contemporary optimizers -- at least on the simpler iterators. This makes it a good trade-off, but it is a trade-off.

Eldritch Conundrum · Accepted Answer · 2023-08-06 04:27:10Z

1

Internal iteration has good codegen. Consider this simple iteration:

    void InternalIterationOnArray(string[] array, Action<string> action)
    {
        for (int index = 0; index < length; index++)
        {
            var current = array[index];
            action(current);
        }
    }

There are only efficient operations here. In constrast, with external iteration, all the work is done in the functions that implement your iterator's interface. Consider the C# desugared equivalent of a foreach loop:

    void ExternalIterationOnArray(string[] array, Action<string> action)
    {
        List<int>.Enumerator e = array.GetEnumerator();
        string current;
        while (e.MoveNext())
        {
            current = e.Current;
            action(current);
        }
        // (try-finally and IDisposable omitted)
    }

IEnumerable.GetEnumerator(): Creates an iterator object (for example, ArrayEnumerator) that contains a reference to array and the declaration of index.

Iterator.MoveNext(): Does index++ and index < length.

Iterator.Current: Does array[i].

To get something as efficient as the internal iteration, ideally you'd want not only this scattered code to be inlined all in one place, but you'd also want the iterator object to be allocated on the stack and not on the heap, which could also avoid having a separate reference to array.

edited Aug 6, 2023 at 4:27

answered Aug 6, 2023 at 2:57

Eldritch Conundrum

2,8679 silver badges26 bronze badges

1

$\begingroup$ I think this answer needs to address the stack coroutine side of the question and not just point at procedural for loops directly. $\endgroup$

Michael Homer
– Michael Homer ♦

2023-08-06 03:16:54 +00:00
Commented Aug 6, 2023 at 3:16
$\begingroup$ I don't know about LLVM's support for stack coroutines and its impact on inlining. But I would love to read a more detailed answer about it :) $\endgroup$

Eldritch Conundrum
– Eldritch Conundrum

2023-08-06 03:38:17 +00:00
Commented Aug 6, 2023 at 3:38
$\begingroup$ I think it doesn't address the question, then, because I don't believe it's asking about that trivial case, and this doesn't address any sort of internal iteration regardless. $\endgroup$

Michael Homer
– Michael Homer ♦

2023-08-06 03:46:31 +00:00
Commented Aug 6, 2023 at 3:46
1

$\begingroup$ Surely it shows that interior iteration also requires inlining? Double-inlining even: a clone of the library code into the context from which it is called (not shown here), and then the contents of action into that clone. $\endgroup$

user1030
– user1030

2023-08-06 08:25:28 +00:00
Commented Aug 6, 2023 at 8:25
1

$\begingroup$ The "inner inline" is not even an option until after InternalIterationOnArray was inlined into its context (hypothetically it could be cloned out-of-line, but that clone would have only one caller so it makes little sense to do it that way), which is why I included that, not to save one call. $\endgroup$

user1030
– user1030

2023-08-06 19:57:23 +00:00
Commented Aug 6, 2023 at 19:57

| Show 6 more comments

Stack Exchange Network

Why would external iteration require lots of code inlining relative to internal iteration?

2 Answers 2

Context

External woes: stack of calls

External woes: lost context

Appendix: why External?

You must log in to answer this question.

Linked

Hot Network Questions