Empirically, what are the implementation-complexity and performance implications of "unboxed" primitives?

Question

I'm designing a Python-like language: bytecode-compiled, brace-free syntax, reference semantics for variables. Among many differences from Python, I want to support a limited form of static typing. I think my planned scheme is a simplified form of sound gradual typing, combined with type inference.

Without getting into too much detail, the plan is that any given variable will have one of a few set types, including one which represents a type that needs to be checked at runtime. The compiler will perform basic type checks, and generate opcodes specialized to the type of the operands involved in each step of the computation - e.g. a separate opcode to add integers (using hard-coded logic) vs. dynamically-typed "objects" (dynamically looking up a function based on some form of runtime type information, possibly involving multiple dispatch).

Because I have this typing information, I had the thought of taking a cue from Java and representing types that fit within a machine word - booleans, floating-point numbers, restricted-range integers, perhaps enumerations - by directly writing the value into the stack frame when these types are used for ordinary variables and auto-boxing them if necessary for storage purposes.

This seems like it would allow avoiding a lot of overhead for simple cases: I wouldn't have to allocate or initialize an object for a simple integer value, maintain reference counts for garbage collection purposes, follow a pointer to access the underlying data, create and rebind objects for each mathematical operation, worry about maintaining a cache of small integer values, etc.... Since I would choose a different opcode statically, and I have to pay the code of switching on the opcode anyway, it also wouldn't cost a branch misprediction (I think?) when accessing the value (and conditionally not dereferencing like usual).

However, I'm worried about the complexity this could introduce. It seems like Java is able to handle this sort of dispatch, and presumably it does see a net benefit; but Java doesn't have to worry about boxing primitives to store them as object attributes, since fields are statically typed. I'm also worried about leaking the "reference semantics" abstraction, since such types would naturally have value semantics. I think it should be fine as long as there are no routes to implicitly aliasing the "values", since there shouldn't be a way, from inside the program, of observing the difference between a value type and an unaliased, immutable reference type. But maybe I'm forgetting something?

Is there extant literature on implementing this sort of system? Is the optimization effective? Does Java see significant performance benefits from having "primitive" types that don't behave like objects? (Can/do modern JVMs represent e.g. int as java.lang.Integer and paper over the API differences?)

1) What do you mean with "the user is expected to box explicitly in order to use library collection types"? Virtually all (un)boxing in Java is compiler-generated (at least, since the generic collection types existed in Java 5). Are you envisaging a more automatic autoboxing in this model? 2) It sounds like what you're thinking of is essentially C#'s primitive ints that can behave as object types at the language level when used as such, but are machine values ordinarily. Is that right, or is there another axis to it? — Michael Homer
– Michael Homer ♦, Commented Jul 25, 2023 at 7:33
1) I'm probably mistaken - I haven't used Java in a long time, and my bad experiential memories probably warp my technical memories. I edited that part. 2) I... think that captures the essence of it, yes. However, I don't intend to implement (more realistically, specify... who knows if I'll ever get around to implementation) anything like C#'s distinction between class and struct types. — Karl Knechtel
– Karl Knechtel, Commented Jul 25, 2023 at 7:46

kaya3 · Accepted Answer · 2023-07-25 14:49:55Z

The main decision you have to make is whether your language will be sufficiently statically typed that you don't need type tags for unboxed primitives at runtime.

Java (or rather, JVM bytecode) is an example of such a language ─ every variable, including object fields and array members, is statically known to be either a reference type or a particular primitive type. For example, when the dadd opcode is executed, the compiler has already ensured that the top two values on the stack will be doubles, and there is no need to check at runtime. Hence there is no need for a type tag to enable a check at runtime.

On the other hand, Javascript is an example of the opposite. In Javascript, it is legal to have a variable whose value is not statically known to be either a primitive or an object, and its type might change at runtime. Therefore Javascript implementations need some way to distinguish the type of a value at runtime. In principle this could be done by boxing all primitives so that everything is an object (CPython does this), but for performance reasons Javascript implementations use type tags instead. This means that each value (not just on the stack, but also object fields and array members) has attached metadata which can be branched on when it is necessary to know the value's type.

Note that branching is a fundamental requirement for languages of the second kind ─ not a consequence of the choice to box or not box. If you don't statically know whether your + operands are integers, doubles or strings then you must branch at runtime, regardless of how they are represented. So the main benefit of using tagged unboxed primitives is that there are fewer objects in memory, and less pointer indirection (which can cause cache misses). The difference regarding branching is in whether you branch after or instead of following a reference to an object, not whether you branch at all.

In your question it sounds like you are trying to write a language of the first kind, but you also have objects whose fields may be references to boxed primitives, and you want those primitives to be unboxed when they're on the stack but not in other places. I am not sure how your language could statically know that x + y will use two primitives, when x and y might be loaded from object fields which aren't statically known to be primitives; but I'll skip past that and assume you have solved this problem somehow.

In this case your language is really of the second kind; you still need to branch on type tags, it's just that your type tags are stored in the boxed values (i.e. behind a layer of pointer indirection) and you branch when loading an object's field onto the stack instead of when performing an operation on values from the stack. But either way, you are checking the types of x and y at runtime before you perform a double-add on them.

"trying to write a language of the first kind, but you also have objects whose fields may be references to boxed primitives, and you want those primitives to be unboxed when they're on the stack but not in other places." Something like that, yes. It sounds as if the worst case is that I merely defer type-checks and object creations, and rarely get to avoid them. "I am not sure how your language could statically know that x + y will use two primitives" - from static analysis, basically where dynamic types "poison" expressions and any remaining "pure" variables have inferred static type. — Karl Knechtel
– Karl Knechtel, Commented Jul 25, 2023 at 19:11
I think this answer gives the right framework to think about the problem with my specific implementation details. It's also motivated me to think about ways to extend the static typing information a bit further - e.g. maybe I can have tuple expressions create some kind of nonce product type instead of moving directly into the world of dynamic objects. However, with the rest of my design it doesn't seem feasible that the dynamic types I have in mind could ever include statically-typed components. — Karl Knechtel
– Karl Knechtel, Commented Jul 25, 2023 at 19:17
(My original draft of the question had a lot more detail of my existing design - but it's too long already, and I wanted it to be useful to others as well.) — Karl Knechtel
– Karl Knechtel, Commented Jul 25, 2023 at 19:19
The "how to know if things are boxed" problem isn't that tough; you duplicate the code path after the decision point (which you can hoist early) and then specialize the branches based on knowing whether the type is boxed or not. It's fiddly and generates a whole lot of object code, but otherwise is not too awful and gives you great performance without explicit type annotations (you would typically mark the boxed path(s) as being colder code). — Donal Fellows
– Donal Fellows, Commented Jul 28, 2023 at 13:29
Branching somewhere may be unavoidable, but in some cases a language might be able to hoist type checks out of a loop, so that one ends up with a loop that will be used it computations can be performed on 32-bit integers, one that will be used if the loop index can use 32-bit integrs but other variables need to be 64-bit floats, one that will be used with everything as 64-bit float, and one that will be used otherwise; the first loop may need to branch into the second, but otherwise examination of objects before the start of a loop may only need to be done once. — supercat
– supercat, Commented Aug 13, 2024 at 15:10

coredump · Accepted Answer · 2023-07-25 15:24:26Z

This is how SBCL (Lisp) works if I understand correctly: in general values are represented with a type tag, so for example the first bit would allow you to know if you are manipulating an immediate value or a boxed one (implicitly there is a pointer). Boxed values are typically instances of classes, bignums, strings, etc.

Inside a function, however, if you statically declare types (or if type inference gather enough information), the function can use all the bits in the representation. For example let's implement a Sigmoid function for single-floats; given a parameter m (the curve's inflection), fsigmoid returns a closure that computes the associated sigmoid function for x:

(defun fsigmoid (m)
  (check-type m single-float)
  (lambda (x)
    (declare (type single-float x m) 
             (optimize (speed 3) (debug 0) (safety 1)))
    (/ 1f0 (+ 1f0 (exp (* (- m) x m))))))

Outside of fsigmoid and the generated closure, values are tagged. The call to check-type is a dynamic type check. Inside the lambda, since safety is set to 1 and not 0, the function does not trust the caller: there is an implicit dynamic type check for x (and maybe m too, looking at the generated code, not sure why), but the code that follows assumes the type is as declared. All the intermediate expressions are single floats. Note also that if you define an equivalent function but for double-floats, the compiler will warn you about having to box the return value (not an immediate value):

; note: doing float to pointer coercion (cost 13) to "<return value>"

Inside the function, all types are known and any intermediate value can be used with raw representations. Here I'm calling the disassembler on the closure produced by fsigmoid:

(disassemble (fsigmoid 0.5f0))

On x86-64 the result is as follows, it relies on SSE SIMD instructions:

; disassembly for (LAMBDA (X) :IN FSIGMOID)
; Size: 167 bytes. Origin: #x54B57218                         ; (LAMBDA (X) :IN FSIGMOID)
; 18:       488B41F9         MOV RAX, [RCX-7]
; 1C:       3C19             CMP AL, 25
; 1E:       757C             JNE L1
; 20:       66480F6ED0       MOVQ XMM2, RAX
; 25:       0FC6D2FD         SHUFPS XMM2, XMM2, #4r3331
; 29:       0F571590FEFFFF   XORPS XMM2, [RIP-368]            ; [#x54B570C0]
; 30:       F30F59CA         MULSS XMM1, XMM2
; 34:       488B41F9         MOV RAX, [RCX-7]
; 38:       3C19             CMP AL, 25
; 3A:       755D             JNE L0
; 3C:       66480F6ED0       MOVQ XMM2, RAX
; 41:       0FC6D2FD         SHUFPS XMM2, XMM2, #4r3331
; 45:       F30F59D1         MULSS XMM2, XMM1
; 49:       F30F5AD2         CVTSS2SD XMM2, XMM2
; 4D:       488BDC           MOV RBX, RSP
; 50:       4883E4F0         AND RSP, -16
; 54:       660F28C2         MOVAPD XMM0, XMM2
; 58:       B801000000       MOV EAX, 1
; 5D:       E8DE92DAFD       CALL #x52900540                  ; exp
; 62:       488BE3           MOV RSP, RBX
; 65:       660F28C8         MOVAPD XMM1, XMM0
; 69:       F20F5AC9         CVTSD2SS XMM1, XMM1
; 6D:       0FC6C9FC         SHUFPS XMM1, XMM1, #4r3330
; 71:       F30F580D57FEFFFF ADDSS XMM1, [RIP-425]            ; [#x54B570D0]
; 79:       F30F10054FFEFFFF MOVSS XMM0, [RIP-433]            ; [#x54B570D0]
; 81:       F30F5EC1         DIVSS XMM0, XMM1
; 85:       0F28C8           MOVAPS XMM1, XMM0
; 88:       660F7ECA         MOVD EDX, XMM1
; 8C:       48C1E220         SHL RDX, 32
; 90:       80CA19           OR DL, 25
; 93:       488BE5           MOV RSP, RBP
; 96:       F8               CLC
; 97:       5D               POP RBP
; 98:       C3               RET
; 99: L0:   CC4F             INT3 79                          ; OBJECT-NOT-SINGLE-FLOAT-ERROR
; 9B:       00               BYTE #X00                        ; RAX(d)
; 9C: L1:   CC4F             INT3 79                          ; OBJECT-NOT-SINGLE-FLOAT-ERROR
; 9E:       00               BYTE #X00                        ; RAX(d)
; 9F:       CC10             INT3 16                          ; Invalid argument count trap
; A1:       6A10             PUSH 16
; A3:       E85891EAFD       CALL #x52A00400                  ; SB-VM::ALLOC-TRAMP
; A8:       58               POP RAX
; A9:       0C0F             OR AL, 15
; AB:       E96EFEFFFF       JMP #x54B5711E
; B0:       6A20             PUSH 32
; B2:       E84991EAFD       CALL #x52A00400                  ; SB-VM::ALLOC-TRAMP
; B7:       59               POP RCX
; B8:       E9EAFEFFFF       JMP #x54B571A7
; BD:       CC10             INT3 16                          ; Invalid argument count trap

So as long as the typed values are contained inside a function, you can avoid tagging them, but you need to check if any of them is visible from the outside, at which point you have to convert them back to a safe representation.

Is there extant literature on implementing this sort of system?

The entry point for SBCL is The Python compiler for CMU Common Lisp, where Python is the internal name of the compiler, not the language, and CMUCL is the ancestor of SBCL.

Is the optimization effective?

Yes

This is quite encouraging. Where the compiler warning says "(cost 13)", what exactly does this mean, and how is the 13 computed? (I don't think I want my language to emit those kinds of detailed performance warnings, but it would be interesting to understand the analysis/theory in more detail.) I somehow didn't actually consider the option of having the user explicitly invoke dynamic type checking; I could have an assignment syntax that lets a static type be inferred while assigning it from a dynamically-typed parameter - which plays nicely with my plans for variadic parameter support. — Karl Knechtel
– Karl Knechtel, Commented Jul 25, 2023 at 19:21
afaik cost 13 is something written by the compiler developers as an heuristic, maybe it's a number of additional operations required I'm not sure. I'll have a look later at the source code to be sure — coredump
– coredump, Commented Jul 25, 2023 at 19:27

Stack Exchange Network

Empirically, what are the implementation-complexity and performance implications of "unboxed" primitives?

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

Empirically, what are the implementation-complexity and performance implications of "unboxed" primitives?

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions