5 Things I Learned Building a Database File Format from Scratch

#database #programming #go

Last month, I decided to build a key-value database from scratch. Not because the world needs another database, but because I wanted to understand what actually happens behind the scenes when you call db.get() or db.put().

After weeks of wrestling with file formats, serialization, and the surprisingly complex world of "simple" storage systems, I've learned some hard lessons that no textbook quite prepared me for. Here are the five biggest insights that changed how I think about databases.

1. Your File Format Design Choices Haunt You Forever

When I started, I thought file format design would be the easy part. "Just throw some bytes in a file, right?" Wrong. Every single decision you make in your file format becomes permanent baggage that you'll carry for the life of your database.

I initially designed a complex header with 15 different fields:

Creation timestamp, last modified time, record count,
page count, configuration flags, user metadata,
version numbers, checksums...

It felt thorough and professional. Then I tried to implement it.

The reality: Most of those fields were never used, and maintaining them added complexity everywhere. Worse, I realized I'd committed to this format forever—any change would break compatibility with existing files.

The lesson: Start with the absolute minimum. For my key-value database, I ended up with just 16 bytes:

Bytes 0-7:   File signature ("KVDB2024")
Bytes 8-11:  First freelist page pointer
Bytes 12-15: First data page pointer

That's it. Everything else can be added later if you actually need it. Your future self will thank you for keeping it simple.

2. Endianness Will Bite You When You Least Expect It

I'm embarrassed to admit how long it took me to figure out why my database worked perfectly on my laptop but produced garbage on my friend's ARM-based server.

The culprit? I was storing integers without specifying byte order:

// This breaks on different architectures
func writePageNumber(buf []byte, pageNum uint32) {
    *(*uint32)(unsafe.Pointer(&buf[0])) = pageNum
}

The number 305419896 was being read as 2018915346 on the ARM machine. Same bits, different interpretation.

The fix was simple but crucial:

func writePageNumber(buf []byte, pageNum uint32) {
    binary.LittleEndian.PutUint32(buf, pageNum)
}

The lesson: Always, always, ALWAYS specify your byte order explicitly. Even if you only plan to run on one type of machine, you'll eventually want to share database files or deploy somewhere else. Make endianness a conscious choice from day one.

3. Error Handling Is More Important Than the Happy Path

My first implementation focused entirely on making things work correctly. Reading files, writing data, parsing headers—when everything went right, it was beautiful.

Then I started testing edge cases:

What if the file gets truncated?
What if someone tries to open a JPEG as a database?
What if the disk runs out of space mid-write?
What if the process crashes during a header update?

My database crashed, corrupted data, or silently accepted garbage input in every single scenario.

The turning point came when I realized that error handling isn't just about making your code robust—it's about making your database trustworthy. A database that sometimes loses data is worse than no database at all.

func openExistingDatabase(file *os.File) (*Database, error) {
// Read header
    headerData := make([]byte, 16)
    if _, err := file.ReadAt(headerData, 0); err != nil {
        return nil, fmt.Errorf("failed to read header: %w", err)
    }

// Validate every field
    if len(headerData) < 16 {
        return nil, fmt.Errorf("header too short")
    }

// Check signature
    expectedSig := [8]byte{'K', 'V', 'D', 'B', '2', '0', '2', '4'}
    if !bytes.Equal(headerData[0:8], expectedSig[:]) {
        return nil, fmt.Errorf("invalid file signature")
    }

// More validation...
}

The lesson: Write your error handling first, then implement the happy path. If your database can't fail gracefully, it can't be trusted with real data.

4. File I/O Is Asynchronous (Even When It Looks Synchronous)

This one nearly gave me a heart attack during testing.

I was running a simple test: write some data, immediately cut power to the machine, then check if the data survived. It didn't. Even though my file.Write() calls returned successfully, the data never made it to disk.

The problem: Operating systems buffer writes for performance. When you call write(), the OS says "sure, I'll get to that" and immediately returns success. Your data might sit in a buffer for seconds before actually hitting the disk.

For most applications, this is fine. For databases, it's catastrophic.

The solution: Learn to love fsync():

func writeHeader(file *os.File, header *DatabaseHeader) error {
    headerData := header.Serialize()

// Write to OS buffer
    if _, err := file.WriteAt(headerData, 0); err != nil {
        return err
    }

// Force write to disk
    return file.Sync()// This is the magic
}

The lesson: If you care about durability, you must explicitly force data to disk. Every critical operation should end with a sync. Yes, it's slower. No, you can't skip it if you want your database to survive power failures.

5. Simplicity Is a Feature, Not a Bug

Throughout this project, I constantly felt pressure to add features. "Real databases have indexing, so I need indexing." "Production systems need compression, so I need compression." "Enterprise databases support transactions, so I need transactions."

This feature creep nearly killed my project.

The breakthrough came when I stepped back and asked: "What's the simplest thing that could possibly work?"

For a key-value database, that turned out to be surprisingly minimal:

A file header pointing to the start of data
Fixed-size pages containing variable-length records
A simple append-only storage model

No fancy indexing (yet). No compression (yet). No complex transactions (yet). Just a system that can reliably store and retrieve key-value pairs.

The surprise: This simple design was faster, more reliable, and easier to debug than any of my complex attempts. More importantly, it actually worked.

The lesson: Every feature you don't implement is a feature that can't break. Build the simplest thing first, then add complexity only when you actually need it. Your simple database that works is infinitely better than your complex database that doesn't.

What I'd Tell My Past Self

If I could go back and give myself advice before starting this project:

Start with file format design, but keep it minimal
Specify endianness explicitly from day one
Write error handling before implementing features
Always sync critical writes to disk
Resist the urge to add features until the basics work perfectly

Building a database taught me that the hardest part isn't the algorithms or data structures—it's handling all the ways things can go wrong in the real world. Files get corrupted, processes crash, disks fill up, and users try to open the wrong files.

A good database isn't just a system that works when everything goes right. It's a system that fails gracefully when everything goes wrong.