Introduction ------------ CVM unrolling is a mechanism for speeding up the Portable.NET runtime engine using some simple JIT techniques. This document describes what you need to do to write a CVM unroller for a new CPU architecture. The process of writing an unroller has been simplified compared to earlier versions of the runtime engine. Most of the hard work of instruction decoding, stack management, register allocation, etc, have already been done for you, and you just need to supply the CPU specifics. In particular, you need to provide the following: - CPU-specific modifications to the CVM configuration. - Lists of rules for allocating registers, using the FPU, etc. - Code generation macros for the CPU in question. If you need help, then send an e-mail message on the "pnet-developers" mailing list, or contact Rhys Weatherley directly. To subscribe to the mailing list, visit "http://www.dotgnu.org". Modifying the CVM configuration ------------------------------- The first thing to do is to modify the CVM configuration so that it knows that you will be using the unroller. Edit "pnet/engine/cvm_config.h" and add some detection logic at the top of the file to detect your architecture. There is already logic there for x86, ARM, etc. For example, the detection logic for a 32-bit architecture called "foo" with little-endian words and word-aligned longs can be defined as follows: #if defined(__foo) || defined(__foo__) #define CVM_FOO #define CVM_LITTLE_ENDIAN #define CVM_LONGS_ALIGNED_WORD #define CVM_WORDS_AND_PTRS_SAME_SIZE #endif The "CVM_FOO" macro will be used elsewhere to detect the CPU type. Now, down the bottom of "pnet/engine/cvm_config.h", you need to add some additional logic which defines the "IL_CVM_DIRECT_UNROLLED" macro. For example: #if defined(IL_CVM_DIRECT) && defined(CVM_FOO) && \ defined(__GNUC__) && !defined(IL_NO_ASM) && \ !defined(IL_CVM_PROFILE_CVM_METHODS) && \ !defined(IL_CVM_PROFILE_CVM_VAR_USAGE) && \ defined(IL_CONFIG_UNROLL) #define IL_CVM_DIRECT_UNROLLED #endif Finally, we need to add some logic to the top of "pnet/engine/cvm.c" to perform manual register assignment. It will look something like this: #elif defined(CVM_FOO) && defined(__GNUC__) && !defined(IL_NO_ASM) #define REGISTER_ASM_PC(x) register x asm ("r1") #define REGISTER_ASM_STACK(x) register x asm ("r2") #define REGISTER_ASM_FRAME(x) register x asm ("r3") The values "r1", "r2", and "r3" will probably be different for your CPU. Look up your system's documentation to find three registers that are normally used for local variables and which are saved across function calls. These three manually-assigned registers will hold the important state variables "pc", "stacktop", and "frame". If you don't know which registers to choose, then ask on the pnet-developers mailing list. If your compiler cannot assign registers manually, then there are other ways for the unroller to get the information, but they are trickier to set up. Contact pnet-developers for assistance. You should now be able to recompile the runtime engine. The compiler will give you an error if the registers you chose are unsuitable. The error might be strange, talking about "register spills". If you get such an error, go back and try different registers. At this point, the engine is set up for unrolling but it isn't actually doing any unrolling yet. Re-test the engine - you will probably already see a small performance improvement due to the manual register assignment. Writing the CPU-specific rules ------------------------------ The next step is to make a file called "pnet/engine/md_foo.h". This will contain rules that tell the unroller how to assign registers and generate code for your architecture. If you need some extra helper macros, then put them into the file "pnet/engine/md_foo_macros.h". If some of your macros are complicated, you may want to convert them into functions. Put these functions into "pnet/engine/md_foo.c" and update the "Makefile.am" file to include it. We recommend starting with the "md_arm.h" file as a template, since ARM is the simplest platform out of those that are currently supported. If you want to make things easier on yourself, don't worry about floating-point on the first pass - just get the integer operations working. ARM is a good choice here because its unroller doesn't do floating-point. The rest of this section describes the rule definitions in "md_foo.h": MD_REG_ These macros define the word registers that are used for temporarily storing values during integer computations. You can use up to 16 registers for temporary work values. Even if your CPU has more than 16 registers, it is highly unlikely that the unroller will use more than 6 or 7 registers at any one time. You can experiment with greater numbers of registers later if you like. The registers you choose must not be used for any other purpose in the system. e.g. you probably cannot use the CPU's stack pointer register as a temporary register. The order of MD_REG_ registers determines the order in which the unroller will allocate them to temporary values. Usually the order will be unimportant. The x86 CPU is an exception - more efficient code can be obtained for division and shift operations if the order starts with EAX, ECX, and then EDX. MD_FREG_ These macros define the floating-point registers that are used during floating-point computations. If your architecture doesn't have floating-point operations, or you don't wish to do floating-point at this time, then set all of them to -1. MD_FP_STACK_SIZE Some CPU's (e.g. x86) organise their floating-point registers into a stack. If this applies to you, then set this macro to the maximum height of the floating-point stack. Otherwise set this macro to zero. MD_REG_PC MD_REG_STACK MD_REG_FRAME The special registers that contain the CVM interpreter's "pc", "stacktop", and "frame" values. These must be same as the registers you chose when configuring the engine earlier. Of these three registers, MD_REG_STACK and MD_REG_FRAME have a fixed meaning throughout the unrolled code, but MD_REG_PC can be reused as a temporary work register (i.e. one of the MD_REG_ values). MD_STATE_ALREADY_IN_REGS This will normally be set to 1 unless you have the misfortune of using a compiler without the ability to manually assign registers. Contact pnet-developers in this case for assistance. MD_REGS_TO_BE_SAVED This macro is a bitmask, with each bit corresponding to one of the registers in the MD_REGS_ list. Use this if your architecture assigns special meaning to certain registers, but you wish to make use of them for temporary values anyway. MD_SPECIAL_REGS_TO_BE_SAVED This is only useful if MD_STATE_ALREADY_IN_REGS is zero. It should normally be set to zero. MD_HAS_INT_DIVISION Set this to 1 if your CPU has integer division operations. Some CPU's (e.g. ARM) don't have a simple division operator, and so the unroller should ignore integer division in this case. Note: you don't need to do anything special to handle division by zero or arithmetic overflow (MININT / -1). The unroller will check for these cases before performing the division. md_inst_ptr This is a typedef that defines the type of the instruction word. On CPU's with byte-aligned instructions, this will be "unsigned char". On word-aligned CPU's, this will typically be "unsigned int", or perhaps "unsigned long" on 64-bit architectures. Writing the code generation macros ---------------------------------- The rest of the "md_foo.h" file consists of macros for generating code for the various instructions used by the unroller. md_push_reg(inst, reg) md_pop_reg(inst, reg) Push or pop registers from the system stack. The system stack is used to save registers before they are reused for other purposes. md_discard_freg(inst, reg) Discard the contents of a floating-point register. If the FPU is organised as a stack (MD_FP_STACK_SIZE != 0), then this will normally pop the top-most item from the stack. md_load_const_32(inst, reg, value) Load a 32-bit constant into a register, sign-extending if the register is 64-bits in size. md_load_const_native(inst, reg, value) Load a native (32-bit or 64-bit) constant into a register. This will be the same as "md_load_const_32" on 32-bit platforms. md_load_const_float_32(inst, reg, value) md_load_const_float_64(inst, reg, value) Load floating point constants from memory at the address "value" into the floating point register "reg". If the system does not use floating point registers, then "reg" should be ignored. md_load_zero_32(inst, reg) md_load_zero_native(inst, reg) Load the 32-bit or native value zero into a register. This is usually more efficient than using "md_load_const_32(inst, reg, 0)" or "md_load_const_native(inst, reg, 0)". md_load_membase_word_32(inst, reg, basereg, offset) Loads the contents of the 32-bit memory location "basereg + offset" into the register "reg". On 64-bit systems, this will sign-extend. Note: "offset" could be anything. It isn't limited to any particular range. Some CPU's cannot do a direct load with an arbitrary offset in one instruction, and need to load the offset into a scratch register first. md_load_membase_word_native(inst, reg, basereg, offset) Load the contents of the native-sized memory location "basereg + offset" into the register "reg". On 32-bit systems, this will be identical to "md_load_membase_word_32". md_load_membase_byte(inst, reg, basereg, offset) md_load_membase_sbyte(inst, reg, basereg, offset) md_load_membase_short(inst, reg, basereg, offset) md_load_membase_ushort(inst, reg, basereg, offset) Load 8-bit or 16-bit values form "basereg + offset". md_load_membase_float_32(inst, reg, basereg, offset) md_load_membase_float_64(inst, reg, basereg, offset) md_load_membase_float_native(inst, reg, basereg, offset) Load floating-point values into a floating-point register. The values are always extended to the "native" floating-point size. If the FPU is organised as a stack, this will load the value onto the top of the stack and "reg" is ignored. md_store_membase_word_32(inst, reg, basereg, offset) Store the contents of "reg" to the address "basereg + offset" as a 32-bit value. On 64-bit platforms, the most significant bits are discarded. md_store_membase_word_native(inst, reg, basereg, offset) Store the contents of "reg" to the address "basereg+ offset" as a native-sized word value. md_store_membase_byte(inst, reg, basereg, offset) md_store_membase_sbyte(inst, reg, basereg, offset) md_store_membase_short(inst, reg, basereg, offset) md_store_membase_ushort(inst, reg, basereg, offset) Store 8-bit or 16-bit values from "reg" to "basereg + offset". It is OK if the value in "reg" is destroyed during the store because it will immediately forgotten by the unroller afterwards. (ARM destroys 16-bit values in the process of storing them). md_store_membase_float_32(inst, reg, basereg, offset) md_store_membase_float_64(inst, reg, basereg, offset) md_store_membase_float_native(inst, reg, basereg, offset) Store floating-point values from "reg" to "basereg + offset". If the FPU is stack based, then this will always store the top-most value on the stack, and ignore "reg". md_add_reg_imm(inst, reg, imm) md_sub_reg_imm(inst, reg, imm) Add or subtract an immediate value to or from a word register. The immediate value could be anything - it is not limited to any particular range of values. The register is assumed to be native-sized (i.e. it contains a pointer). md_add_reg_reg_word_32(inst, reg1, reg2) md_sub_reg_reg_word_32(inst, reg1, reg2) md_mul_reg_reg_word_32(inst, reg1, reg2) md_div_reg_reg_word_32(inst, reg1, reg2) md_udiv_reg_reg_word_32(inst, reg1, reg2) md_rem_reg_reg_word_32(inst, reg1, reg2) md_urem_reg_reg_word_32(inst, reg1, reg2) md_neg_reg_word_32(inst, reg) md_and_reg_reg_word_32(inst, reg1, reg2) md_or_reg_reg_word_32(inst, reg1, reg2) md_xor_reg_reg_word_32(inst, reg1, reg2) md_not_reg_word_32(inst, reg) md_shl_reg_reg_word_32(inst, reg1, reg2) md_shr_reg_reg_word_32(inst, reg1, reg2) md_ushr_reg_reg_word_32(inst, reg1, reg2) Perform arithmetic operations on 32-bit integer values. If the CPU is 64-bit, then most of these can be performed as 64-bit operations. Some (e.g. division and right shifts) require the operands to be truncated to 32-bits first. It is expected that the code generator will be able to handle any combination of registers. If an invalid combination is provided, then the code generator must save registers on the system stack to make room, perform the operation, and then restore everything to its original state. md_add_reg_reg_word_native(inst, reg1, reg2) md_sub_reg_reg_word_native(inst, reg1, reg2) md_mul_reg_reg_word_native(inst, reg1, reg2) md_div_reg_reg_word_native(inst, reg1, reg2) md_udiv_reg_reg_word_native(inst, reg1, reg2) md_rem_reg_reg_word_native(inst, reg1, reg2) md_urem_reg_reg_word_native(inst, reg1, reg2) md_neg_reg_word_native(inst, reg) md_and_reg_reg_word_native(inst, reg1, reg2) md_or_reg_reg_word_native(inst, reg1, reg2) md_xor_reg_reg_word_native(inst, reg1, reg2) md_not_reg_word_native(inst, reg) md_shl_reg_reg_word_native(inst, reg1, reg2) md_shr_reg_reg_word_native(inst, reg1, reg2) md_ushr_reg_reg_word_native(inst, reg1, reg2) Similar to above, except that these macros work on native-sized values. On 32-bit platforms, they will be identical to the above macros. md_add_reg_reg_float(inst, reg1, reg2) md_sub_reg_reg_float(inst, reg1, reg2) md_mul_reg_reg_float(inst, reg1, reg2) md_div_reg_reg_float(inst, reg1, reg2) md_rem_reg_reg_float(inst, reg1, reg2, used) md_neg_reg_float(inst, reg) Perform arithmetic operations on floating-point values. If the FPU is organised as a stack, then the register arguments are ignored and the values at the top of the stack are used. The "used" parameter on remainder is used on x86 platforms only. It can normally be ignored on other platforms. md_freg_swap(inst) Swap the two top-most values on the floating-point register stack. Not used if the FPU is not stack-based. md_cmp_reg_reg_float(inst, dreg, sreg1, sreg2, lessop) Compare the two floating-point values in "sreg1" and "sreg2". Set the word register "dreg" to -1, 0, or 1 based on the result. If "lessop" is non-zero, then use the NaN rules for "fcmpl". Otherwise use the NaN rules for "fcmpg". md_reg_to_byte(inst, reg) Truncate the contents of a register to 8 bits and zero-extend. md_reg_to_sbyte(inst, reg) Truncate the contents of a register to 8 bits and sign-extend. md_reg_to_short(inst, reg) Truncate the contents of a register to 16 bits and zero-extend. md_reg_to_ushort(inst, reg) Truncate the contents of a register to 16 bits and sign-extend. md_reg_to_word_32(inst, reg) Convert a register from a native word into a 32-bit word. This will not do anything on 32-bit platforms. md_reg_to_word_native(inst, reg) Convert a register from a 32-bit word into a native word by sign-extending. This will not do anything on 32-bit platforms. md_reg_to_word_native_un(inst, reg) Convert a register from a 32-bit word into a native word by zero-extending. This will not do anything on 32-bit platforms. md_reg_to_float_32(inst, reg) Truncate the contents of floating point register "reg" to a 32-bit floating point value and then re-extend to the native size. md_reg_to_float_64(inst, reg) Truncate the contents of floating point register "reg" to a 64-bit floating point value and then re-extend to the native size. md_jump_to_cvm(inst, pc, label) This macro is used to jump back into the CVM interpreter at the end of a block. The macro should perform the following steps: set MD_REG_PC to the value "pc" if "label" is NULL then jump to "*pc" otherwise jump to "label" md_switch(inst, reg, table) Perform a switch. The value in "reg" is used to index into "table" to find the next program counter value to jump to. The code can assume that the value in "reg" is within range. md_clear_membase_start(inst) md_clear_membase(inst, reg, offset) Used to clear a portion of the stack when allocating local variable variables (i.e. the "mk_local_" instructions). "md_clear_membase" clears a CVM stack position at "reg + offset". Some CPU's (e.g. ARM) need to initialize temporary registers before performing the clear. This can be done in "md_clear_membase_start". md_lea_membase(inst, reg, basereg, offset) Load the effective address "basereg + offset" into "reg". md_mov_reg_reg(inst, dreg, sreg) Move the value in "sreg" into "dreg". md_seteq_reg(inst, reg) md_setne_reg(inst, reg) md_setlt_reg(inst, reg) md_setle_reg(inst, reg) md_setgt_reg(inst, reg) md_setge_reg(inst, reg) Check the condition codes and set a register to 0 or 1 based on them. md_cmp_reg_reg_word_32(inst, reg1, reg2) md_ucmp_reg_reg_word_32(inst, reg1, reg2) Compare two 32-bit registers and set "reg1" to -1, 0, or 1 depending upon the comparison. md_cmp_cc_reg_reg_word_32(inst, cond, reg1, reg2) md_cmp_cc_reg_reg_word_native(inst, cond, reg1, reg2) Compare two registers and set the condition codes based on the result. The "cond" argument hints to the code generator as to what kind of condition will be tested for in a subsequent branch instruction. This is needed on some CPU's (e.g. PPC and ia64) to modify the manner in which the condition codes are set. md_reg_is_null(inst, reg) md_reg_is_zero(inst, reg) Set the condition codes based on comparing "reg" against NULL or zero. md_cmp_reg_imm_word_32(inst, cond, reg, imm) Set the condition codes based on comparing "reg" against "imm". The "cond" argument hints to the code generator as to what kind of condition will be tested for in a subsequent branch instruction. This is needed on some CPU's (e.g. PPC and ia64) to modify the manner in which the condition codes are set. md_branch_eq(inst) md_branch_ne(inst) md_branch_lt(inst) md_branch_le(inst) md_branch_gt(inst) md_branch_ge(inst) md_branch_lt_un(inst) md_branch_le_un(inst) md_branch_gt_un(inst) md_branch_ge_un(inst) Output a place-holder for a branch instruction to branch to a destination based on a particular condition code. The actual destination will be inserted with a subsequent "md_patch" call. md_branch_cc(inst, cond) Output a place-holder for a branch instruction, using a numerically defined condition code. The constants MD_CC_EQ, MD_CC_NE, etc define the condition code values. md_patch(patch, inst) Patch a branch place-holder at "patch" to jump to "inst". md_bounds_check(inst, reg1, reg2) Check an array bounds value for a single-dimensional array. The first register, "reg1", points to the array header. The second register, "reg2" contains the index value to check. The condition codes are set based on whether the index is less than the array's length. On some platforms, this macro might modify "reg1" to skip over the array header so that "reg1" is pointing at the start of the array contents for subsequent operations. md_load_memindex_word_32(inst, reg, basereg, indexreg, disp) md_load_memindex_word_native(inst, reg, basereg, indexreg, disp) md_load_memindex_word_byte(inst, reg, basereg, indexreg, disp) md_load_memindex_word_sbyte(inst, reg, basereg, indexreg, disp) md_load_memindex_word_short(inst, reg, basereg, indexreg, disp) md_load_memindex_word_ushort(inst, reg, basereg, indexreg, disp) Load a value from an indexed array into "reg". "basereg" points to the start of the array, and "indexreg" is the index value. "disp" is the size of the array header, so that it can be skipped if "md_bounds_check" didn't already do so. It is the responsibility of these macros to multiply "indexreg" by the correct amount to get the element's address. md_store_memindex_word_32(inst, reg, basereg, indexreg, disp) md_store_memindex_word_native(inst, reg, basereg, indexreg, disp) md_store_memindex_word_byte(inst, reg, basereg, indexreg, disp) md_store_memindex_word_sbyte(inst, reg, basereg, indexreg, disp) md_store_memindex_word_short(inst, reg, basereg, indexreg, disp) md_store_memindex_word_ushort(inst, reg, basereg, indexreg, disp) Store a value in "reg" to an indexed array. "basereg" points to the start of the array, and "indexreg" is the index value. "disp" is the size of the array header, so that it can be skipped if "md_bounds_check" didn't already do so. It is the responsibility of these macros to multiply "indexreg" by the correct amount to get the element's address. Debugging --------- Because debugging the unroller can be difficult, you may want to attack the problem in stages. The nice thing about the unroller is that the interpreter will automatically handle anything that you haven't handled. As described earlier, don't bother with floating-point on the first pass. You can also temporarily remove entire instruction categories by commenting out the #include's for "unroll_xxx.c" at the bottom of "unroll.c". For testing, we recommend running the "make check" in pnetlib regularly, and also running the PNetMark benchmark. If either of these cause the engine to crash, or to fail a test that works in the regular engine, then you have probably done something wrong. You can return to the regular engine at any time by commenting out "IL_CVM_DIRECT_UNROLLED" in the "pnet/engine/cvm_config.h" file. Isolating what went wrong can be difficult. Try commenting out sections of "unroll_xxx.c" until the problem disappears. Whatever you commented out last might have something to do with the problem. While you can breakpoint the unroller while it is converting code, it isn't easy to put breakpoints in the code that it outputs. A suggested strategy is to use the native trap/break instruction and using gdb to catch it. For instance, see the following example in PowerPC unroller (LOR instruction). Program received signal SIGTRAP, Trace/breakpoint trap. 0x3003b158 in ?? () (gdb) x/4i $pc 0x3003b158: trap 0x3003b15c: or r9,r9,r10 0x3003b160: or r11,r11,r8 0x3003b164: stw r9,-32(r19) (gdb) set $pc = $pc + 4 (gdb) ni Gdb provides an excellent inline disassembler which is invaluable when writing an unroller. For a bigger picture of the unrolled code, you can also uncomment "UNROLL_DEBUG" in "unroll.c". This will cause the unroller to disassemble the unrolled code before it executes methods. By staring at this output, you should hopefully be able to figure out which instructions are being unrolled incorrectly. Another problem is the CPU cache. Most CPU's need to flush the data cache prior to executing the unrolled code. Check the "pnet/support/clflush.c" file to ensure that cache flushing on your architecture is supported. If still in doubt, don't hesitate to ask for help on "pnet-developers".