Anna Voronina

Posted on Jun 2

Creating Sega Genesis emulator in C++

#gamedev #cpp #emulator #programming

This article covers the development of a Sega Genesis 16-bit console emulator in C++. A lot of exciting stuff awaits you ahead: emulating the Motorola 68000 CPU, reverse engineering games, OpenGL graphics, shaders, and much more—all using modern C++. The article is packed with images, so even just browsing through them should be fun.

The design of Sega Genesis

The architecture of Sega Genesis (source)

Here's a description of each component in the diagram, listed in random order:

ROM is cartridge data, its maximum memory size is 4MB.
VDP (Video Display Processor) is an ASIC developed by the Sega company, a video controller chip. It has 64KB of VRAM;
The YM2612 is a six-channel FM synthesizer from Yamaha;
PSG Sound is an ASIC from Texas Instruments (SN76489), it has three meander channels and synthesizes sound. Compatibility with the 8-bit Sega Master System requires it.
The Motorola 68000 processor is a CPU that handles most of the work. It has 64KB of RAM.
Zilog Z80 is an audio co-processor. Its job is to write commands to the YM2612 registers at the right time. It has 8KB of RAM.
Input/Output are controllers. First, there was a "three-button gamepad", then a "six-button" one was added, followed by a dozen rarer devices.

The core component is Motorola 68000 (m68k) that has 24-bit addressing at 0x000000–0xFFFFFFFF. This processor handles any memory access via a bus (labeled 68000 BUS in the diagram) that transfers the address to different locations. You can see the address mapping here.

This article covers the emulation of all components except Z80 and the sound component.

Motorola 68000 emulation

Some facts about m68k

The m68k processor used to be popular: for decades, Macintosh, Amiga, and Atari computers leveraged it, as well as the Sega Genesis console and other devices.

The processor architecture already has elements of 32-bitness, but with limitations.

There are 16 32-bit registers in total (and one 16-bit register). Although the "address" registers (A0–A7) are 32-bit, 24 low-order bits are used for the address. In other words, 16 megabytes of memory space is addressed.

The processor supports basic virtualization features for multitasking systems. The access to the A7 register is actually an access to either the user stack pointer (USP) or the supervisor stack pointer (SSP), depending on the status register flag.

Unlike (almost all) modern architectures, m68k adheres to the big-endian byte order. The address and size of an instruction are always divisible by two. With a few exceptions, we can read memory only from an address that is divisible by two as well. Floating-point arithmetic is not supported.

The m68k instruction table (source)

Registers of m68k

Let's create basic types:

using Byte = uint8_t;
using Word = uint16_t;
using Long = uint32_t;
using LongLong = uint64_t;

using AddressType = Long;

A class for working with big-endian:

The BigEndian class

Since the m68k architecture adheres to big-endian order, changing the byte order is often necessary (assuming our computer uses x86_64 or ARM, which use little-endian by default). To do this, let's create a type:

template<typename T>
class BigEndian {
public:
  T get() const {
    return std::byteswap(value_);
  }

private:
  T value_;
};

Then, imagine we need to retrieve the Word value from an array, we can do this:

const auto* array_ptr =
  reinterpret_cast<const BigEndian<Word>*>(data_ptr);
// ...
x -= array_ptr[index].get();

Since the processor is constantly writing something to or reading from memory, appropriate entities are required (i.e., what to write and where to write it).

The DataView and MutableDataView classes

The best way to do this is to use std::span, which is a pointer to data and its size. For the immutable version, it's still a good idea to create a helper that calls .as<Word>() and so on:

using MutableDataView = std::span<Byte>;

class DataView : public std::span<const Byte> {
public:
  using Base = std::span<const Byte>;
  using Base::Base;

  template <std::integral T>
  T as() const {
    return std::byteswap(*reinterpret_cast<const T*>(data()));
  }
};

Let's create a type for the m68k registers. An object of this type fully describes the state of the CPU, independent of memory:

The structure of Registers

struct Registers {
  /**
   * Data registers D0 - D7
   */
  std::array<Long, 8> d;

  /**
   * Address registers A0 - A6
   */
  std::array<Long, 7> a;

  /**
   * User stack pointer
   */
  Long usp;

  /**
   * Supervisor stack pointer
   */
  Long ssp;

  /**
   * Program counter
   */
  Long pc;

  /**
   * Status register
   */
  struct {
    // lower byte
    bool carry : 1;
    bool overflow : 1;
    bool zero : 1;
    bool negative : 1;
    bool extend : 1;
    bool : 3;

    // upper byte
    uint8_t interrupt_mask : 3;
    bool : 1;
    bool master_switch : 1;
    bool supervisor : 1;
    uint8_t trace : 2;

    decltype(auto) operator=(const Word& word) {
      *reinterpret_cast<Word*>(this) = word;
      return *this;
    }

    operator Word() const {
      return *reinterpret_cast<const Word*>(this);
    }
  } sr;
  static_assert(sizeof(sr) == sizeof(Word));

  /**
   * The stack pointer register depend on the supervisor flag
   */
  Long& stack_ptr() {
    return sr.supervisor ? ssp : usp;
  }
};
static_assert(sizeof(Registers) == 76);

This 76-byte structure fully describes the CPU state.

Error handling

Various errors may happen: an unaligned (non-divisible by 2) program counter address, read address, or write address; an unknown instruction; or an attempt to write to a protected address space.

I decided to handle errors without using exceptions (try/throw/catch). Usually, I don't mind standard exceptions, but this approach makes debugging a bit more convenient.

So, let's create a class for errors:

The Error class

class Error {
public:
  enum Kind {
    // no error
    Ok,

    UnalignedMemoryRead,
    UnalignedMemoryWrite,
    UnalignedProgramCounter,
    UnknownAddressingMode,
    UnknownOpcode,

    // permission error
    ProtectedRead,
    ProtectedWrite,

    // bus error
    UnmappedRead,
    UnmappedWrite,

    // invalid action
    InvalidRead,
    InvalidWrite,
  };

  Error() = default;
  Error(Kind kind, std::string what)
    : kind_{kind}
    , what_{std::move(what)}
  {}

  Kind kind() const {
    return kind_;
  }
  const std::string& what() const {
    return what_;
  }

private:
  Kind kind_{Ok};
  std::string what_;
};

A member function that may fail must now have a return type of std::optional<Error>.

If the member function can either fail or return an object of the T type, its return type must be std::expected<T, Error>. This pattern appeared in C++23 and is useful for this approach.

Memory read/write interface

As mentioned in the section on Sega Genesis architecture, the semantics of reading from or writing to addresses can differ depending on the address. To abstract the behavior in terms of m68k, we'll create the Device class:

class Device {
public:
  // reads `data.size()` bytes from address `addr`
  [[nodiscard]] virtual std::optional<Error> read(AddressType addr,
                                                  MutableDataView data) = 0;

  // writes `data.size()` bytes to address `addr`
  [[nodiscard]] virtual std::optional<Error> write(AddressType addr,
                                                   DataView data) = 0;

  // ....
};

The expected behavior is clear from the comments. We'll add the Byte, Word, and Long read/write helpers to this class.

  template<std::integral T>
  std::expected<T, Error> read(AddressType addr) {
    T data;
    auto err = read(addr,
                    MutableDataView { reinterpret_cast<Byte*>(&data),
                                      sizeof(T) });
    if (err)
    {
      return std::unexpected{std::move(*err)};
    }
    // swap bytes after reading to make it little-endian
    return std::byteswap(data);
  }

  template<std::integral T>
  [[nodiscard]] std::optional<Error> write(AddressType addr, T value) {
    // swap bytes before writing to make it big-endian
    const auto swapped = std::byteswap(value);
    return write(addr,
                 DataView { reinterpret_cast<const Byte*>(&swapped),
                            sizeof(T) });
  }

The m68k execution context

The execution context of m68k is registers plus memory:

struct Context {
  Registers& registers;
  Device& device;
};

The m68k operand representation

Each instruction has 0 to 2 operands, aka targets. There are a lot of ways they can point to an address in memory or a register. The operand class has variables like these:

Kind kind_;      // one of 12 addressing types (the addressing mode)
uint8_t index_;  // the "index" value for index addressing types
Word ext_word0_; // the first extension word
Word ext_word1_; // the second extension word
Long address_;   // the "address" value for addressable
                 // addressing types

There are also 2 or 3 variables. I stayed within 24 bytes.

This class has read/write member functions:

[[nodiscard]] std::optional<Error> read(Context ctx, MutableDataView data);
[[nodiscard]] std::optional<Error> write(Context ctx, DataView data);

You can see the implementation here: lib/m68k/target/target.h.

The most complex addressing types were Address with Index and Program Counter with Index. This is how their address is evaluated:

Target::indexed_address

Long Target::indexed_address(Context ctx, Long baseAddress) const {
  const uint8_t xregNum = bits_range(ext_word0_, 12, 3);
  const Long xreg = bit_at(ext_word0_, 15) ? a_reg(ctx.registers, xregNum)
                                           : ctx.registers.d[xregNum];
  const Long size = bit_at(ext_word0_, 11) ? /*Long*/ 4 : /*Word*/ 2;
  const Long scale = scale_value(bits_range(ext_word0_, 9, 2));
  const SignedByte disp = static_cast<SignedByte>(
                            bits_range(ext_word0_, 0, 8)
                          );

  SignedLong clarifiedXreg = static_cast<SignedLong>(xreg);
  if (size == 2) {
    clarifiedXreg = static_cast<SignedWord>(clarifiedXreg);
  }

  return baseAddress + disp + clarifiedXreg * scale;
}

The m68k instruction representation

The instruction class includes the following variables:

Kind kind_;      // one of 82 opcodes
Size size_;      // Byte, Word, or Long
Condition cond_; // one of 16 conditions for brunch instructions
Target src_;     // a source operand
Target dst_;     // a destination operand

There are also 2 or 3 variables. I've got it down to a total of 64 bytes.

Parsing the m68k instructions

The instruction class has a static member function that parses the current instruction.

static std::expected<Instruction, Error> decode(Context ctx);

You can see its implementation here: lib/m68k/instruction/decode.cpp

To avoid copy-pasting a bunch of "error" checks, I used the following macros:

#define READ_WORD_SAFE                    \
  const auto word = read_word();          \
  if (!word) {                            \
    return std::unexpected{word.error()}; \
  }

I also pattern-checked the opcode in an easy-to-use format:

The HAS_PATTERN macro

Functions for calculating a mask:

consteval Word calculate_mask(std::string_view pattern) {
  Word mask{};
  for (const char c : pattern) {
    if (c != ' ') {
      mask = (mask << 1) | ((c == '0' || c == '1') ? 1 : 0);
    }
  }
  return mask;
}

consteval Word calculate_value(std::string_view pattern) {
  Word mask{};
  for (const char c : pattern) {
    if (c != ' ') {
      mask = (mask << 1) | ((c == '1') ? 1 : 0);
    }
  }
  return mask;
}

The HAS_PATTERN macro:

#define HAS_PATTERN(pattern) \
  ((*word & calculate_mask(pattern)) == calculate_value(pattern))

And then we have this, for example:

if (HAS_PATTERN("0000 ...1 ..00 1...")) {
  // this is MOVEP
  // ...
}

The code above checks whether the bits in the opcode satisfy the pattern. In other words, it checks whether the corresponding bits (ones without a dot) are 0 or 1. In our case, this is the pattern for the MOVEP opcode.

This works as quickly as typing the code manually: consteval ensures that the call is executed at compile time.

Executing the m68k instructions

The instruction class has a member function that executes. Registers change at runtime, and there is optional memory access:

[[nodiscard]] std::optional<Error> execute(Context ctx);

You can see its implementation here: lib/m68k/instruction/execute.cpp. This is the most complex code in the emulator.

You can find a description of the instructions in this markdown documentation. If that isn't enough, you can read the extensive description in this book.

Writing instruction emulation is an iterative process. Creating every instruction is difficult at first, but as more patterns and common code accumulate, it becomes easier.

There are some obnoxious instructions, such as MOVEP, and also BCD arithmetic instructions, such as ABCD. In BCD arithmetic, hexadecimal numbers are treated as decimal numbers. For example, the BCD addition looks like this: 0x678 + 0x535 = 0x1213. I spent over four hours working on these BCD instructions because their logic is extremely complex and not explained properly anywhere.

Testing the m68k emulator

Testing is the most important part. Even a small error in a status flag can lead to disasters during emulation. Large applications are prone to unexpected breakdowns, so developers need to test all instructions.

The tests in this repository have been very helpful. There are over 8,000 tests for each instruction, covering every possible case. The total number of tests is just over a million.

They can detect even a slightest error. Often, approximately 20 out of 8,000 tests fail.

For example, the MOVE (A6)+ (A6)+ instruction (the A6 register is accessed with a post-increment) shouldn't work the way I implemented it. So, I created a workaround to make it work properly.

The emulator operates correctly most of the time now. No more than ten tests fail in isolated cases when there's a bug in the tests or another issue.

Emulating C++ programs

You can emulate your own programs. Let's write a simple program that reads two numbers and writes all the values within that range in a loop:

    void work() {
        int begin = *(int*)0xFF0000;
        int end = *(int*)0xFF0004;

        for (int i = begin; i <= end; ++i) {
            // if we don't write "volatile",
            // the compiler optimizes it in one entry!
            *(volatile int*)0xFF0008 = i; 
        }
    }

Both the GCC and Clang can cross-compile your code to the m68k architecture. Let's do it with Clang (the a.cpp file will become the a.o one):

clang++ a.cpp -c --target=m68k -O3

You can view the object file assembly code using the following command. Note that you will most likely need to install the binutils-m68k-linux-gnu package first:

m68k-linux-gnu-objdump -d a.o

This assembly code will be displayed.

This object file is packaged in ELF format, so we need to unpack it. Let's extract the assembly code (the .text section) to the a.bin file:

m68k-linux-gnu-objcopy -O binary --only-section=.text a.o a.bin

The hd a.bin command ensures that the correct files are extracted.

We can now emulate this assembly code. The emulator code is here, and the emulation logs are here. In this example, the numbers from 1307 to 1320 are written at the 0xFF0008 address.

More emulation: The Sieve of Eratosthenes

In the next program, I had to tinker with compilers. Using the sieve of Eratosthenes, I calculated prime numbers up to 1,000.

This required an array filled with zeros. The compilers tried to use the memset member function from the standard library in the regular bool notPrime[N+1] = {0} declaration. This should be avoided since no libraries are linked. As a result, the code looked like this:

    void work() {
        constexpr int N = 1000;

        // avoiding calling "memset" -_-
        volatile bool notPrime[N + 1];
        for (int i = 0; i <= N; ++i) {
            notPrime[i] = 0;
        }

        for (int i = 2; i <= N; ++i) {
            if (notPrime[i]) {
                continue;
            }
            *(volatile int*)0xFF0008 = i;
            for (int j = 2 * i; j <= N; j += i) {
                notPrime[j] = true;
            }
        }
    }

And it is built using GCC (with the g++-m68k-linux-gnu package):

m68k-linux-gnu-g++ a.cpp -c -O3

This is what the assembly code looks like, and this is what the emulator output looks like.

Non-trivial programs are difficult to emulate because the environment is too synthetic. For example, there are two issues with writing a string in a program like this:

    void work() {
        strcpy((char*)0xFF0008, "Der beste Seemann war doch ich");

\}

The first issue is calling a member function before it's attached to the object file. The second issue is the string, the location of which in memory is still unknown.

With enough effort, you can emulate Linux for m68k. QEMU can do it!

The ROM-file format

I use ImHex to analyze unknown formats and protocols so that I can better understand their content.

Imagine that you have downloaded the ROM file of your favorite childhood game. A Google search of the ROM format reveals that the first 256 bytes are occupied by the m68k vector table. It contains addresses for various cases, such as division by zero. The next 256 bytes contain the ROM header with information about the game.

Let's draft a hex pattern using the internal ImHex language for parsing binary files and look at the contents:

The sega.hexpat pattern

The be part before the type means big-endian:

struct AddressRange {
    be u32 begin;
    be u32 end;
};

struct VectorTable {
    be u32 initial_sp;
    be u32 initial_pc;
    be u32 bus_error;
    be u32 address_error;
    be u32 illegal_instruction;
    be u32 zero_divide;
    be u32 chk;
    be u32 trapv;
    be u32 privilege_violation;
    be u32 trace;
    be u32 line_1010_emulator;
    be u32 line_1111_emulator;
    be u32 hardware_breakpoint;
    be u32 coprocessor_violation;
    be u32 format_error;
    be u32 uninitialized_interrupt;
    be u32 reserved_16_23[8];
    be u32 spurious_interrupt;
    be u32 autovector_level_1;
    be u32 autovector_level_2;
    be u32 autovector_level_3;
    be u32 hblank;
    be u32 autovector_level_5;
    be u32 vblank;
    be u32 autovector_level_7;
    be u32 trap[16];
    be u32 reserved_48_63[16];
};

struct RomHeader {
    char system_type[16];
    char copyright[16];
    char title_domestic[48];
    char title_overseas[48];
    char serial_number[14];
    be u16 checksum;
    char device_support[16];
    AddressRange rom_address_range;
    AddressRange ram_address_range;
    char extra_memory[12];
    char modem_support[12];
    char reserved1[40];
    char region[3];
    char reserved2[13];
};

struct Rom {
    VectorTable vector_table;
    RomHeader rom_header;
};

Rom rom @ 0x00;

Picture N4 – ImHex "parsed" the beginning of the file

We can also disassemble any number of instructions starting with initial_pc (the entry point) to see what happens in the first instructions:

Picture N5 – The disassembler in ImHex

Once everything is clear, we can convert the structures from the hex pattern to C++. The example is here (I've removed unnecessary data members): lib/sega/rom_loader/rom_loader.h.

Unlike many other formats where headers aren't an integral part of the content, the 512-byte header in ROM files is essential. This means that the ROM file needs to be loaded into memory as a whole. According to the address mapping, the 0x000000 - 0x3FFFFFFF area is assigned to it.

A bus device

To improve address mapping, we can implement BusDevice as a child class of Device and have it redirect write and read commands to a more accurate device:

class BusDevice : public Device {
public:
  struct Range {
    AddressType begin;
    AddressType end;
  };
  void add_device(Range range, Device* device);

  /* ... more `read` and `write` override methods */

private:
  struct MappedDevice {
    const Range range;
    Device* device;
  };
  std::vector<MappedDevice> mapped_devices_;
};

An object of this class is fed to the m68K emulator. The full implementation is here: lib/sega/memory/bus_device.h.

GUI

Initially, the emulation output was displayed only in the terminal, and control was also performed through the terminal. However, this is inconvenient for the emulator, so moving everything to the GUI is necessary.

I used the mega cool ImGui library for the GUI. It's feature-rich, allowing developers to create any interface they want.

Picture N6 – The window example: the m68k emulator status

This enables to display the whole state of the emulator in separate windows, which makes debugging much easier.

Working in Docker

To avoid issues with outdated operating systems (when all packages are obsolete, and even modern C++ can't compile) and to prevent cluttering a PC with third-party packages, it's better to develop under Docker.

First, we create a Dockerfile, and then we recreate the image when changing it.

sudo docker build -t segacxx .

Then, we go to the container with the directory mount (-v) and other necessary parameters:

sudo docker run --privileged \
                -v /home/eshulgin:/usr/src \
                -v /home/eshulgin/.config/nvim:/root/.config/nvim \
                -v /home/eshulgin/.local/share/nvim:/root/.local/share/nvim \
                -v /tmp/.X11-unix:/tmp/.X11-unix \
                -e DISPLAY=unix${DISPLAY} \
                -it \
                segacxx

Pitfalls:

There may be an issue with the GUI not having default access. However, after some research, I modified the command to include the -v for X11 and -e DISPLAY parameters.
Also, the GUI won't work unless the xhost + command is run from the PC to disable access control.
To access controllers (see the section below for details), I added--privileged to the command.

Picture N7 – NeoVim running under the docker container

Reverse engineering games in Ghidra

Let's say we configured the m68K emulation via ROM. We read some documentation and connected some basic devices to the bus, such as ROM, RAM, the trademark register, etc. Then, we emulated one instruction at a time while looking at the disassembler.

It's a painful endeavor, and we want to get a higher-level picture. We can reverse engineer a game to do this. I use Ghidra for that:

Picture N8 – Reverse engineering a game for Sega Genesis

A plugin created by @DrMefistO helps get started. It marks well-known addresses and creates segments.

As you can see, since games were originally written in assembly language, they have a specific look.

Code and data are mixed: there's a code snippet, then there are byte fragments, e.g. for color, then more code, and so on. It's all the von Neumann architecture.

To make a frame, we need to use LINK and UNLK in the m68k assembler. In reality, this is almost never the case: in most functions, arguments are passed via semi-random registers. Some functions place the result in the status register flag (e.g., in ZF). Fortunately, in Ghidra, one can manually specify what the function does in such cases, enabling the decompiler to display more accurate output. There's also switch of functions when they have the same content, but the first few instructions are different. An example is in the screenshot:

Picture N9 – A "switch" of functions

To get a general idea of what's going on and create a more accurate Sega emulator, we don't need to reverse engineer the entire game—5-10% is enough. It's better to reverse engineer a game that you remember well from your childhood so that it's not a "black box."

This skill will come in handy in the future when it comes to quickly debugging emulation failures in other games.

Emulating interrupts

Let's say we have some basic functional emulation configured. We run the emulator, and, as expected, it goes into an endless loop. After reverse engineering a code fragment, we discovered that a flag in RAM is zeroed, then the loop waits for the flag to remain zero:

Picture N10 – The reverse-engineered WaitVBLANK function

We check other fragments where this code is accessed and see that the code is located at the VBLANK interrupt. Let's reverse engineer VBLANK:

Picture N11 – The reverse-engineered VBLANK function

Have you heard of the legendary VBLANK and its popular grandson, HBLANK?

Depending on whether it is NTSC or PAL/SECAM, a video controller renders a frame pixel by pixel on the old TV 60 or 50 times per second.

Frame rendering (source)

The HBLANK interrupt triggers when the current line is drawn and the ray moves to the next line (the green lines in the picture above). On a real console, only 18 bytes can physically be sent to the video memory during this time, though I don't set such a limit in the simulator, and not all games use this interrupt.

The VBLANK interrupt triggers when the entire frame is rendered, and the ray reaches the beginning of the screen (the blue line). A maximum of 7 kilobytes of data can be sent to the video memory during this time.

Let's say we hardcoded the use of NTSC (60 FPS). To trigger the interrupt, we need to embed a check into the instruction execution loop that checks whether the conditions are met:

VBLANK interrupt is enabled by the video processor;
The Interrupt Mask value in the status register is less than six, which indicates the importance level of the current interrupt.
1s/60 time has passed since the previous interrupt.

If so, we jump to the function. It looks like this:

std::optional<Error> InterruptHandler::call_vblank() {
  // push PC (4 bytes)
  auto& sp = registers_.stack_ptr();
  sp -= 4;
  if (auto err = bus_device_.write(sp, registers_.pc)) {
    return err;
  }

  // push SR (2 bytes)
  sp -= 2;
  if (auto err = bus_device_.write(sp, Word{registers_.sr})) {
    return err;
  }

  // make supervisor, set priority mask, jump to VBLANK
  registers_.sr.supervisor = 1;
  registers_.sr.interrupt_mask = VBLANK_INTERRUPT_LEVEL;
  registers_.pc = vblank_pc_;

  return std::nullopt;
}

The full code is here: lib/sega/executor/interrupt_handler.cpp.

The way games run revolves around this interrupt; it's a sort of game engine.

We also need to configure the GUI to re-render the screen when the VBLANK interrupt is received.

Video Display Processor

Video Display Processor (aka VDP) is the second most complex emulator component after m68k. To better understand how it works, I recommend checking out these websites:

Plutiedev is not just about VDP but about programming for Sega Genesis in general. There are many insights into how pseudo-float and other math are implemented in games.
Raster Scroll is an awesome description of VDP with lots of pictures. I suggest reading them just for the fun of it.

This processor has 24 registers responsible for various tasks and 64 kilobytes of VRAM for storing graphics information.

The m68k processor stores data in VRAM and can also change registers. This process mostly occurs during VBLANK. The VDP then renders an image on the TV based on the sent data. That's it—it doesn't do anything else.

VDP has a pretty complicated color system. Four palettes are active at any given time, each containing 16 colors. Each color occupies nine bits (three bits per R/G/B, for a total of 512 unique colors).

The first color in the palette is always transparent, so there are actually 15 colors available in the palette plus transparency.

In VDP, the basic unit is a tile, which is an 8x8 pixel square. The trick is that each pixel doesn't specify a color, but its number in the palette. So, it takes four bits per pixel (a value ranging from 0 to 15), for a total of 32 bytes per tile. You may ask, "Where's the palette number specified?" Well, it isn't specified in a tile, but in a higher-level plane (or a sprite) entity.

The screen can be 28 or 30 tiles high and 32 or 40 tiles wide.

VDP has two entities called Plane A and Plane B (there's also a Window Plane), which are the front and back backgrounds, sized no larger than 64x32 tiles.

They can adjust the shift relative to the camera at different rates (e.g., +2 pixels for the foreground and +1 for the background) to create a 3D effect in the game.

For a plane, it's possible to set the shift separately for a line of eight pixels or line by line to achieve different effects.

The plane defines a list of tiles and specifies a palette for each one. Overall, data for the plane can consume a significant amount of VRAM.

The VDP has the sprite entity, which is a composite chunk of tiles ranging in size from 1x1 to 4x4. For example, there can be 2x4 or 3x2 sprites. It has a position on the screen and a palette that determine how the tiles are rendered. We can mirror the sprite vertically and/or horizontally to avoid duplicating tiles. Many objects are rendered in multiple sprites if one sprite isn't enough.

A VDP can contain a maximum of 80 sprites. Each sprite has the link data member, which is the value of the next sprite to be rendered, so it's like a linked list. The VDP first renders the zero sprite, then the sprite to which the zero sprite's link points, and so on until the next link is null. This ensures the correct sprite depth.

Depending on the circumstances, there's enough memory in the VRAM for 1,400–1,700 tiles. This seems like a decent number, but it's not that much. For example, filling the background with unique tiles would require about 1,100 tiles, leaving no space for anything else. So, the level designers had to tightly duplicate tiles for rendering.

The VDP has many rules, including two levels of layer prioritization:

Picture N13 – The VDP graphics layer prioritization

It's better to iteratively render the VDP. First, we can render the palettes and assume that they do change correctly over time, meaning that the colors are roughly the same as the contents of the splash screen or main menu:

Picture N14 – A window in GUI, color palettes

Then, we can render all the tiles:

Pictures N15 – All tiles in the zero palette and a fully rendered frame

The same tiles in other palettes

Picture N16 – First palette

Picture N17 – Second palette

Picture N18 – Third palette

We can then render planes in individual windows:

Picture N19 – Two separate planes (below) and a fully rendered frame (above)

There's also the window plane, which is rendered a little differently:

_Picture N20 – The window plane (right) and a fully rendered frame (left)
_
Then it's the sprites' turn:

Picture N21 – The beginning of the sprite list (right) and a fully rendered frame (left)

The full implementation of the renderer is here: lib/sega/video/video.cpp.

A frame must be computed pixel by pixel. To make the pixels visible in ImGui, we need to create a 2D OpenGL texture and put every frame in there:

ImTextureID Video::draw() {
  glBindTexture(GL_TEXTURE_2D, texture_);
  glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA,
               width_ * kTileDimension, height_ * kTileDimension,
               0, GL_RGBA, GL_UNSIGNED_BYTE, canvas_.data());
  return texture_;
}

Testing the VDP renderer

Although we can run the game to see what's rendered, doing so can be inconvenient. It's better to start with interesting cases, collect many dumps, and create a test that uses a single command to generate pictures from the dumps. The git status command shows which images have changed. This is convenient because we can fix VDP bugs without having to run the emulator.

For this purpose, I added a Save Dump button to the GUI that saves the state of the video memory (VDP, VRAM, CRAM, and VSRAM registers). I saved these dumps in the bin/sega_video_test/dumps directory and wrote a README explaining how to regenerate them using a single command.

Of course, this works only if the data has been correctly transferred to the video memory (this isn't the case for a couple of the dumps at the link).

The std_image library is useful for saving images as PNG files.

Retro controller support

Since we aren't taking the easy route, we can support retro controllers that are identical to the Sega ones.

I googled what I could buy nearby and bought a controller for $25:

_Picture N22 – The controller
_
The vendor claimed support for Windows but didn't mention Linux. ImGui, on the other hand, claimed support for Xbox, PlayStation, and Nintendo Switch controllers, so I was ready to reverse engineer the controller as well.

Fortunately, everything worked out. I managed to support the three-button Sega controller by pressing the buttons and seeing what code each one corresponded to:

Keyboard and retro controller mapping

void Gui::update_controller() {
  static constexpr std::array kMap = {
      // keyboard keys
      std::make_pair(ImGuiKey_Enter, ControllerDevice::Button::Start),

      std::make_pair(ImGuiKey_LeftArrow, ControllerDevice::Button::Left),
      std::make_pair(ImGuiKey_RightArrow, ControllerDevice::Button::Right),
      std::make_pair(ImGuiKey_UpArrow, ControllerDevice::Button::Up),
      std::make_pair(ImGuiKey_DownArrow, ControllerDevice::Button::Down),

      std::make_pair(ImGuiKey_A, ControllerDevice::Button::A),
      std::make_pair(ImGuiKey_S, ControllerDevice::Button::B),
      std::make_pair(ImGuiKey_D, ControllerDevice::Button::C),

      // Retroflag joystick buttons
      std::make_pair(ImGuiKey_GamepadStart, ControllerDevice::Button::Start),

      std::make_pair(ImGuiKey_GamepadDpadLeft, ControllerDevice::Button::Left),
      std::make_pair(ImGuiKey_GamepadDpadRight,
                     ControllerDevice::Button::Right),
      std::make_pair(ImGuiKey_GamepadDpadUp, ControllerDevice::Button::Up),
      std::make_pair(ImGuiKey_GamepadDpadDown, ControllerDevice::Button::Down),

      std::make_pair(ImGuiKey_GamepadFaceDown, ControllerDevice::Button::A),
      std::make_pair(ImGuiKey_GamepadFaceRight, ControllerDevice::Button::B),
      std::make_pair(ImGuiKey_GamepadR2, ControllerDevice::Button::C),
  };

  auto& controller = executor_.controller_device();
  for (const auto& [key, button] : kMap) {
    if (ImGui::IsKeyPressed(key, /*repeat=*/false)) {
      controller.set_button(button, true);
    } else if (ImGui::IsKeyReleased(key)) {
      controller.set_button(button, false);
    }
  }
}

A little side story about a case of bad luck

I have a HyperX Alloy Origins Core keyboard (this also isn't an ad). It allows for the customization of the RGB lighting with complex patterns, such as animations or click responses, and the addition of macros. However, the customization software is available only on Windows, and I'd like to change the lighting on Linux based on certain events as well.

Then, I took USB dumps in Wireshark and reverse engineered the behavior.

For example, we can assign a static red color to one button, get what is written, and see which bytes relate to that button, and so on.

Unless we reverse engineer the .exe file, there's nowhere to look—it seems like the protocol was invented in the AliExpressTech basement, so there's no documentation. There's an incomplete reverse for this keyboard in OpenRGB (it turns out there's a project for reverse engineering all sorts of colorful stuff).

Pixel shaders

We could create all kinds of pixel shaders to make it look cool.

This was a real pain: shaders are poorly supported in ImGui, and changing that requires a terrible workaround. Additionally, I had to install the GLAD library to call the function that compiles the pixel shader. Also, the shader code must be of the GLSL 130 version, and the only external variable is uniform sampler2D Texture;—the rest are constants.

The goal was to create a CRT shader that would simulate an old TV and to add some other shaders if possible.

Since I am a total noob at shaders, I used ChatGPT to create them, considering the limitations described above. The sources are here: lib/sega/shader/shader.cpp. I didn't even dig into the shader code, just read the comments.

The CRT shader features generated by the AI:

Barrel Distortion is a bulge effect;
Scanline Darkness makes every second line darker;
Chromatic Aberration is an RBG layer distortion;
Vignette darkens the color around the edges.

The shader result:

Picture N23 – Click to see the full image

Fred Flintstone before and after adding the shader (enhanced):

Picture N24 – Fred Flintstone

I asked ChatGPT to create other shaders, but they're not as interesting:

Shaders

Picture N25 – No shaders

Picture N26 – The Desaturate shader

Picture N27 – The Glitch shader

Picture N28 – The Night Vision shader

I mostly played the emulator without shaders, but sometimes I used CRT.

Optimizations for the release build

It may not seem obvious, but rendering a frame is quite a resource-intensive task if done suboptimally. Let's say the screen size is 320x240 pixels. We're iterating pixel by pixel. There are always up to 80 sprites, plus three plane sprites, on the screen. They have priority, which means each of them must be traversed twice. First, we need to find the corresponding pixel in each sprite or plane and check whether it is within the bounding box. Then, we need to take the tile out of the tileset and check whether the pixel is opaque. All of this must be calculated 60 times per second—fast enough to still have time for ImGui and the m68k emulator.

So, the computations must contain no redundant code, memory allocations, and so on.

In reality, having the release build with the optimization settings enabled is enough.

set(CMAKE_BUILD_TYPE Release)

First, let's disable unused features and unnecessary warnings:

add_compile_options(-Wno-format)
add_compile_options(-Wno-nan-infinity-disabled)
add_compile_options(-fno-exceptions)
add_compile_options(-fno-rtti)

We'll switch to the Ofast build mode and build the code for a native architecture, sacrificing binary portability, with link-time optimization, loop unwinding, and "fast" math.

set(CMAKE_CXX_FLAGS_RELEASE
    "${CMAKE_CXX_FLAGS_RELEASE} \
     -Ofast \
     -march=native \
     -flto \
     -funroll-loops \
     -ffast-math"
)

This is enough to achieve stable 60 FPS, and even 120 FPS if you play at double speed (when the interval for VBLANK interrupts is halved).

The only process that can be parallelized is the evaluation of pixels on one line. Evaluating on different lines at the same time is impossible because HBLANK works between lines, where colors can be swapped. This is why I wouldn't recommend it. We'll need to use a lock-free algorithm to parallelize it and ensure good resource utilization, but we don't want to do that unless it's absolutely necessary.

Testing the emulator with games

Almost every game introduced something new to the emulator: one game used a rare VDP feature that I implemented incorrectly, another one was doing something strange, and so on. In this section, I've described some of the quirks I've encountered while running a few dozen games.

Those that worked right away

I've basically built the emulator around the Cool Spot (1993) game: I reverse engineered it, debugged VDP gimmicks, and so on. The Cool Spot character is the 7 Up lemonade mascot (he's known only in the US, the mascot is different in other regions). It's a beautiful platformer that I played through many times as a kid.

Picture N29 – Cool Spot (1993)

Earthworm Jim (1994). The worm is scavenging through the dumpsters—wow, looks cool!

Picture N30 – Earthworm Jim (1994)

Alladin (1993). I didn't really get into it—the graphics and gameplay weren't the best.

Picture N31 – Alladin (1993)

Reading the VDP status register

Some games read the VDP status register: if we add an incorrect bit, the game either hangs or malfunctions.

This was the case in Battle Toads (1992). The game was doing this:

    do {
      wVar2 = VDP_CTRL;
    } while ((wVar2 & 2) != 0);

Picture N32 – Battle Toads (1992)

The Window Plane looks different when its width is set to 32 tiles

One of the most poorly documented things is the window plane behavior. It appears that, if the window width is 32 tiles, and the width of all planes is 64 tiles, a tile for the window plane should be searched for, considering its width is still 32 tiles. I couldn't find this documented anywhere, so I left the workaround there.

It appears, for example, in Goofy's Hysterical History Tour (1993). The gameplay of this game is pretty mediocre.

Picture N33 – Goofy's Hysterical History Tour (1993), the yellow line at the bottom came from Window Plane

Auto increment errors in DMA

The most annoying thing about VDP is DMA (Direct Memory Access), which is designed to move memory blocks from the m68k RAM to the VRAM. It has a few modes and settings, so it's easy to make a mistake. The most common error type is auto increment. There are non-obvious conditions regarding when a memory pointer should be incremented by this number.

In Tom and Jerry: Frantic Antics (1993), when a character moves on the map, new layers are added to the plane via a rare auto increment (128 instead of the usual 1). I had the code to make it look like there's always a 1 in there, because the plane didn't change much except for the top line. I debugged it by examining the plane window closely and determining that the layer was added as if it were vertical.

Picture N34 – Tom and Jerry - Frantic Antics (1993)

Out of all the games I've run, this one is probably the worst. Seems like its developers didn't try at all, making it for an older generation of consoles.

Oversized write to the VSRAM memory

This isn't shown on the top-level scheme of the Sega Genesis architecture, but the VRAM (main video memory, 64 KB), CRAM (128 bytes, 4 color palettes), and VSRAM (80 bytes, vertical shift) are separate for some reason. These independent blocks of memory look even funnier when we consider that the horizontal shift lies entirely in VRAM, but that's not the point.

Tiny Toon Adventures (1993) uses the same algorithm to zero CRAM and VSRAM. So, 128 bytes are written to the 80-byte VSRAM... If we don't handle it somehow, a segfault error will occur. The console offers a great deal of freedom, and that's just the tip of the iceberg.

Picture N35 – Tiny Toon Adventures (1993)

The game has nice graphics, the gameplay is average, and it has a hardcore Sonic-esque feel to it.

Calling DMA when it is disabled

The Flinstones (1993) had some strange behavior: the plane moved up just as much as it moved to the right. In other words, there were strange entries in the VSRAM. The solution was simple: for DMA to work—or not to work—a certain bit had to be set in a VDP register. I considered it and fixed the issue. The game tried to create DMA write operations while the DMA was disabled. The authors somehow wrote the logic incorrectly.

Picture N36 – The Flinstones (1993)

Single-byte register reads

Most guides say that registers are usually read in two bytes, but in Jurassic Park (1993), the VDP register is read in one byte. I had to support that.

Picture N37 – Jurassic Park (1993)

Attempting to write to read-only memory

If you decompile one fragment of Spot Goes to Hollywood (1995), this happens:

  if (psVar4 != (short *)0x0) {
    do {
      sVar1 = psVar4[1];
      *(short *)(sVar1 + 0x36) = *(short *)(sVar1 + 0x36) + -2;
      *psVar4 = sVar1;
      psVar4 = psVar4 + 1;
    } while (sVar1 != 0);
    DAT_fffff8a0._2_2_ = DAT_fffff8a0._2_2_ + -2;
  }

So, there's an off-by-one error here, and the entry is made at the 0x000036 address. Sega just doesn't do anything about it—there's no segfault analog. Wait, we could do that all along? As it turns out, we can. Such quirks happen quite often: instead of returning Error it has to write into a log and do nothing.

Picture N38 – Spot goes to Hollywood (1995)

Changing endianness at DMA in the VRAM fill mode

In the Contra: Hard Corps (1994) game, I saw the broken plane shifts. I added logs and saw that it uses a rare VRAM fill mode to fill the horizontal shift table. After taking several closer looks, I confirmed that the written bytes somehow change the endianness... I had to create a cringey workaround:

    // change endianness in this case (example game: "Contra Hard Corps")
    if (auto_increment_ > 1) {
      if (ram_address_ % 2 == 0) {
        ++ram_address_;
      } else {
        --ram_address_;
      }
    }

Picture N39 – Contra: Hard Corps (1994)

The Z80 RAM dependency and other dependencies

The emulator doesn't support Z80 yet, but some games require it. For example, Mickey Mania (1994) freezes after starting. Opening the decompiler reveals that it reads the 0xA01000 address indefinitely until a non-zero byte appears. This is a z80 RAM zone, so the game creates an implicit link between m68k and z80.

Let's implement a new cringey workaround and return a random byte if it's a Z80 RAM read.

Unfortunately, there's another issue: the game now reads VDP H/V Counter at the 0xC00008 address.

Well, we'll create another workaround. Now, the game shows the splash screen and crashes again when it reads another unmapped address. Let's put the game away for a while before we reach a critical number of workarounds.

Picture N40 – The Mickey Mania (1994) splash screen

Another example is the Sonic the Hedgehog (1991) game, where I get into some sort of a debug mode because there are weird numbers in the upper left corner.

Picture N41 – Sonic the Hedgehog (1991) with two planes

Fortunately, the first Sonic game has long been reverse-engineered (GitHub). So, if you want to have fun, there's a way to fully support it.

Supporting Z80

What does Z80 do?

As previously mentioned, Zilog Z80 is a coprocessor designed for music playback. It has its own 8Kb RAM and is connected to the YM2612 sound synthesizer.

Z80 is an ordinary processor that was used in previous generations of consoles.

How was the music for Genesis games created? Sega distributed a tool called GEMS under MS-DOS among developers. With GEMS, devs could create all kinds of sounds and use the development board to check what they would sound like on Genesis (what you hear is what you get).

However, many developers didn't bother to compose their own music but used default samples. This resulted in many unrelated games having the same sounds.

The sound was translated into a program called Sound Driver in Z80 assembly language and packed into the ROM cartridge with other data. While the game was running, m68k would read the sound driver from the cartridge ROM and load it into the Z80 RAM. Then, the Z80 processor would start producing sound via the program, which ran independently of m68k. So much for concurrency... You can watch this video to learn more about the music in Genesis.

How to support Z80

First, one must learn the 332-page manual and create a Z80 emulator similar to the m68k one, flood it with tests, and run some programs on Z80. Then, they must learn the sound theory, YM2612 registers, and write a sound generator for Linux.

In terms of scope, it encompasses everything that I've previously described (m68k + VDP), or at least half of it—that's a lot to do.

What else can we do?

The article describes a setup that can run many games. However, you can do all sorts of little things, except for the sound.

Support the two-player mode

Currently, it's a one player game, but support for the dual gamepad mode is possible.

Supporting HBLANK

Currently, VBLANK is called, but HBLANK must be called after each line. Actually, only few games use it. The most common use case is the palette change in the middle of the image.

The Ristar (1994) example

The Ristar (1994) game leverages this feature. Note the waves on the water surface and the wobbly columns below.

Picture N42 – Ristar (1994) running on my emulator, no HBLANK)

And here's what it should look like, as shown in a YouTube walkthrough:

Picture N43 – Ristar (1994) running on a proper emulator

This is particularly evident when Ristar is submerged in water, and the palette is always aquatic there:

Picture N44 – On the left is almost under the water level, on the right is completely under the water level

Supporting other controllers

Currently, only the three-button gamepad is supported. However, a six-button controller can be supported, as well as the rarer ones like Sega Mouse, Sega Multitap, Saturn Keyboard, Ten Key Pad, and even a printer.

The cooler debugger

The built-in debugger could be improved to allow users to view memory, set read/write breaks, and unwind the stack trace. This would ultimately allow for much faster debugging.

Top comments (1)

Charles Koffler • Jun 5

Incredible work for me, bravo, I haven't all read yet, but congratulations. Very impressive