Expanding on my comment above:
For a while I pursued efficient drawing of filled polygon graphics on 8-bit systems through differencing. My intention was that I would exit the 3d calculations with what amounts to a span buffer: for each line an ordered list of only the x positions at which a change of colour occurs, and the colour that is changed to. The advantage of that is that it's fairly trivial to do an O(n) comparison between the new list for a line and the last one that was drawn there and draw only the differences. So you're in effect only repainting the boundary changes. Since polygons tend to be large, that's a lot less drawing.
Demos do that sort of thing all the time but are specialised to particular scenes, e.g. a single non-overlapping shape (because it's convex with hidden face removal, or originally 2d), or a scene that is entirely or overwhelmingly precomputed. I didn't want to produce a demo.
So what I wanted was essentially identical to what you're proposing: just an ordered list of transitions.
Option one is to fill the thing from back to front. Naively assuming that my intermediate working set is the same as my final output set, to add a new span x1 to x2 I need to figure out where x1 should go in the existing list and insert it, and then I need to walk the list until x2 keeping track of colours and removing transitions, then insert one at x2 back to whatever I discovered the active colour to be there. I do that for every span, I'm done. It's a logarithmic search, then naively a linear one.
Smarter is to store left colour and right colour at each transition. Then that's two logarithmic searches, one over a smaller data set than the other, plus arbitrary removals and usually two insertions (unless you fall exactly on an existing transition, that is).
In practice, the amount of time you spend shuffling bytes back and forth in the list kills any advantage you might have hoped to get as soon as you have any number of polygons more than the amount you could just straight draw.
So I tried switching to drawing front to back. Therefore you may insert only that portion of a new span that would cover an area that is currently the background colour. That prima facie reduces inserts but makes walking the list obligatory. So you're back to logarithmic plus a linear walk. And, actually, it actually often increases inserts because you're often splitting the new span you're inserting. Though if you keep a proper tally you can spot when a line is already full and reject all further spans immediately, so looking directly at a wall suddenly becomes the optimal case rather than the worst.
Using a linked list rather than a linear makes the issues worse, not better, because it makes all searches linear rather than logarithmic.
So I abandoned the idea of creating the list in its output form and switched to front to back tree-based insertion, which is serialised only at the end. Each node is a complete span, and sticking with front-to-back means never splitting what's already in the tree, only splitting or discarding what's being inserted. An issue then is storage if you want any sort of speedy traversal. I was targetting a 256-pixel display window so I went with five bytes per node — start x, end x, link to left subtree, link to right subtree, colour, and allocated each output line its own 128-byte half-page aligned section of memory. So that's 24kb, ouch! And no more than 25 polygons on each line, though that's less troubling.
The inserts are then 'cheap' and the final serialisation is a simple stack-based in-order walk but that still very quickly became far and away the dominant processing cost.
You can resolve the fixed-24kb issue by using seven-byte records but then you're talking about walking 16-bit pointers which is more hassle for a Z80 and not really practical at all for a game on a 6502.
You can also resolve it by producing polygon spans on demand and having only a single-line buffer. That introduces a further potential optimisation: sort your span-producing objects from left to right and produce the sorted list 'directly'. The issue is that you've lost your front-to-back ordering. So actually what happens is that at each object edge you know which n objects are currently outputting pixels, you pick the frontmost, then proceed to the next transition point. Where you possibly make much the same decision, but after adding or subtracting from the active list.
That actually ended up being the most optimal thing, and was demonstrably smarter than brute forcing, but look at the steps I've had to perform:
- prepare a list of all objects that contribute to the scene up front;
- ideally, sort them by starting y position;
- step down the display one line at a time, keeping a note of which objects are active on that line without breaking the ordering;
- for that line, look at my indirect list of active objects and check that it is still properly sorted by x. If it isn't, adjust it;
- step through the transition points (i.e. start or end of object) of that that from left to right, at each making a decision about the new frontmost and writing a transition if it has changed.
My case is actually slightly simpler than sprites: my objects were horizontally convex. Each had only one start and end per line. For sprites you're going to need to have an RLE version of their boundaries.
I pursued it because it was interesting, but it was a major rabbit hole, the in-RAM requirements for doing anything in a feasible amount of time are large by the standards of the day, and it still cost quite a lot.
If you presented that to contemporaneous developers as an alternative for sprite output to the C64, NES, ColecoVision, etc school of just giving a list of sprites and accepting that they're limited to, often, 16x16 and four or eight per line, I don't think your platform would get much support. It's a lot of work, it burns a huge amount of your CPU advantage, and it's just too peculiar. Maybe you'd have found an audience for a brief while if you'd been able to leap in with that as the next development straight from the 2600, getting an audience of beam racers directly on board, but even then somebody like Coleco would have appeared and said 'we just made it simple and, look, we can port things reasonably well very quickly'.
So I like the idea, but it's actually much more complicated than it sounds, and I think that would have dissuaded development.
I'll also note that for sprite graphics you've also probably substantially increased the necessary memory bandwidth because you need to write a transition each time a sprite switches from transparent to opaque. So you need a decent precision on where the display processor should start fetching from. So you're probably talking a worst case of 16 bits per pixel for anything regular and with decent precision. If you don't bucket the transitions then it's at least 24 bits per transition as you need to be explicit about location. Something like the C64 has bandwidth for only 40 byte fetches in a line (bad lines aside) so that's 13 whole transitions. Even with just rectangles, each has a transition in and a transition out. So that's 6 whole objects. You've done worse than with the 8 hardware sprites that machine actually has.
(bonus chatter: yes, preparing the scene as above would in principle allow it to be commuted to something like a Copper or ANTIC list rather than actual pixels; no, I'm insufficiently familiar with those machines to know whether that's actually helpful; I was just looking for the platform-neutral partial redraw possibilities)