How can I access attached data section in custom script language?

Question

Sorry if title is not clear, suggestions on better title are welcome.

For the purpose of [self-]education I am writing a toy scripting language that would compile to bytecode and be executed on a toy VM.

This is not going to be a turing-complete language, and it would only contain simple flow control structures such as if...then...else and overall be just a straight sequence of instructions.

I have already pretty much everything working except for one part -- I would like my bytecode to have a read-only data section (pretty much like .rodata in native binaries). However I am stuck on how do I reference this in opcodes? I can give the address of the beginning of data block, but how do I provide the length of data?

For example - I can have an opcode 0x01 to compare an immediate value 0x0005 with data in data section at an address 0xf002 (ignore endianness for now):

0000 0100050002
...
f002 0005000000

One possible solution I would think of is to either prepend value with length of data block (such as 0005000000 becomes 0200050000) but that leads to issue of either being limited in data block size (i.e. if use 1 byte as in this example, then it will obviously be limited by 255 bytes, which some may say is enough for everyone) or if provide size part big enough (e.g. 8 bytes), the size part very well will be bigger than the actual data in some cases, which is not desirable.

What would be a better approach?

Why does the data size need to be known? Do you need this size to load your bytecode format from a file into the VM, or are you trying to indicate how much data the VM opcode 0x01 should load from memory? Neither seems to be necessary in general. — amon
– amon, Commented Jan 22, 2018 at 15:42
@amon Since the data is already loaded as part of the bytecode, it is second case. Second example would be is if I want to print the constant string (which since it is constant would probably be good idea to store in .rodata and not in .text), in this example i of course could go C-style strings and print everything until nil-character, but there may be other application to this I guess. Or I may be looking at this from wrong angle completely? Any more detailed advice beyond not seems necessary would be highly appreciated. — Alexey Kamenskiy
– Alexey Kamenskiy, Commented Jan 22, 2018 at 15:48
@amon Or I could do a byte-by-byte comparison (so that opcode accepts only one byte per execution), but how would in that case I deal with overflowing? — Alexey Kamenskiy
– Alexey Kamenskiy, Commented Jan 22, 2018 at 15:56

amon · Accepted Answer · 2018-01-22 16:34:20Z

Plain bytes don't carry any types. Instead, the operations on these bytes impose an interpretation of this data. Most instruction sets have different instructions for performing the same operation but with different-sized types. Alternatively, these instructions always load one “word”, and require the program to mask off any unwanted bits.

For example, you might have this pseudo-assembly:

LOAD WORD [0xf002]
...
LOAD BYTE [0xf002]
...
; at address 0xf002:
ca ff ee 01 23 ...

Then, assuming 32-bit words, the first LOAD instruction would load ca ff ee 01, whereas the second LOAD would get ca only (ignoring endianess). These two load instructions would usually be encoded with a different opcode.

Instruction sets for hardware don't typically have instructions that can process variable-size data. But this restriction doesn't necessarily exist in VMs. In particular, a VM might offer instructions for operations that would be performed as syscalls or library calls on real hardware. The most flexible approach for dealing with variable sized data is providing the data and the size of this data separately. I.e. the instruction set doesn't need to have a concept of a “string type”.

The size would be provided to string-processing instructions as a separate argument. How to do that depends on your calling conventions. For example, in a stack-based VM

; print 3 bytes from string at address 0xf002
CONST 0xf002
CONST 3
PRINT

It is then the responsibility of the program generating these opcodes to know the correct length. This is possible because the compiler generates both the instructions and the data. Instead of placing the length as an immediate value, it could also be loaded from the data. E.g.:

CONST 0xf003      ; string start
LOAD BYTE 0xf002  ; load string length
PRINT

If the string is larger, then this is statically known, e.g.:

CONST 0xf006
LOAD WORD 0xf002
PRINT

This raw memory access is very annoying and not necessary for a VM. Especially for toy VMs, operating on an object model is far more convenient.

A notable example of a real-world VM with this approach is the JVM. The .class files are a binary but structured format that describe a class (fields and methods). Upon loading, a class data structure is generated in the VM. This data structure contains a constant pool table from which the code can load values by index.

A lot of data layout and loading problems don't have to be solved within the instruction set. Instead, the assembler and the loader can be extended to make sure that offsets and sizes are correct.

All good points. Writing the VM I completely forgot that this points should be handled by compiler. Looking at this more my opcode can actually take more arguments. So for example it can take a pointer to beginning and a length of data, where as compiler must ensure that it puts the correct length there. — Alexey Kamenskiy
– Alexey Kamenskiy, Commented Jan 23, 2018 at 4:02

Stack Exchange Network

How can I access attached data section in custom script language?

1 Answer 1

Hot Network Questions

How can I access attached data section in custom script language?

1 Answer 1

Related

Hot Network Questions