Unpacking CompactBinaryData (CBD): The Lean, Mean Binary Serialization Machine

5 min readMay 25, 2025

Introduction to CBD

CompactBinaryData (CBD) is a lightweight, binary serialization format crafted as an alternative to JSON for scenarios where minimizing data size and processing overhead is critical. Unlike JSON, which prioritizes human-readability at the cost of verbosity, or other binary formats like BSON and MessagePack, which may introduce complexity or insufficient compactness, CBD achieves a delicate balance. It offers up to 40–60% size reduction compared to JSON for typical datasets, maintains JSON-like data structures, and avoids the computational overhead of external compression algorithms like gzip or Brotli.

The project, is an open-source initiative under the MIT License, hosted on GitHub. Its design goals — compactness, performance, JSON compatibility, extensibility, and simplicity — make it a compelling choice for applications ranging from IoT devices to high-performance APIs.

Core Features and Design Philosophy

CBD’s feature set is tailored to address the inefficiencies of existing serialization formats:

Compact Representation: By employing dictionary encoding for repetitive keys and a 3-bit type system, CBD significantly reduces data size. For instance, a JSON object like {“name”:”John”,”age”:30,”name”:”Jane”,”age”:25} stores “name” and “age” once in a dictionary, replacing them with 1-byte IDs, achieving substantial savings.
Fast Serialization/Deserialization: The format’s bit-packed structures and variable-length encoding minimize CPU overhead, making it faster than decompressing gzipped JSON.
JSON Compatibility: CBD supports all JSON data types (objects, arrays, strings, numbers, booleans, null) with lossless round-tripping, ensuring seamless integration with existing systems.
Extensibility: Reserved type codes and header bits allow for future data types and custom extensions, ensuring longevity.
No External Compression: Unlike formats that rely on gzip, CBD achieves compactness natively, reducing processing time.
Debugging and Conversion Tools: A human-readable debugging mode and utilities for converting between CBD, JSON, MessagePack, and BSON enhance usability.

The design philosophy emphasizes simplicity and efficiency. By avoiding complex compression algorithms and focusing on native compactness, CBD is particularly suited for resource-constrained environments.

Technical Structure of the CBD Format

The CBD format, detailed in format_specification.md, is structured into three main sections: Header, Dictionary, and Data. This modular design ensures both compactness and ease of parsing.

Header (5 Bytes)

The header is a concise 5-byte structure:

Magic Number (2 bytes): 0xCBD1 identifies CBD files and aids in corruption detection.
Version (1 byte): Currently 0x01, allowing for future format evolution.
Dictionary Size (2 bytes): A big-endian unsigned integer indicating the number of dictionary entries (up to 65,535 unique keys).

Dictionary

The dictionary compresses repetitive keys by storing them as UTF-8 encoded strings, each prefixed with a variable-length integer (1–2 bytes) for length. Keys are assigned 1-based numeric IDs, which are used in the data section to minimize redundancy. For example, in a dataset with multiple “name” and “age” fields, these keys appear only once in the dictionary, replaced by compact IDs elsewhere.

Data Section

The data section encodes the actual data structure using a 3-bit type system embedded in an 8-bit type byte:

Type Byte: Bits 7–5 encode the type (e.g., 000 for null, 001 for boolean, 101 for object). Bit 0 indicates if the type is a container (array/object) or scalar.
Number Encoding: Numbers use variable-length encoding (varints) for integers and floats, with formats for 32-bit/64-bit integers and floats.
String Encoding: Strings are UTF-8 encoded, prefixed with a varint for length.
Array/Object Encoding: Both are length-prefixed, with arrays containing sequential elements and objects containing key-value pairs (keys as dictionary indices).

The use of varints is particularly noteworthy. For example, the number 300 (0x12C) is encoded as two bytes: 0xAC (128 | 44) and 0x02, optimizing space for small values.

Example Encoding

Consider the JSON object:

{
  "name": "John",
  "age": 30,
  "scores": [95, 87, 92],
  "active": true
}

CBD encodes this as:

Header: 5 bytes (0xCBD1, 0x01, 0x0004 for 4 keys).
Dictionary: 22 bytes for keys “name”, “age”, “scores”, “active”.
Data: 19 bytes, including type bytes, varints, and values.
Total: 46 bytes, compared to 54 bytes for JSON, demonstrating significant savings.

Implementation and Usage

A Python implementation is available, with installation via:

git git clone https://github.com/makalin/CBD.git
pip install -r requirements.txt

Serialization and deserialization are straightforward:

from cbd import CBD
data = {"name": "John", "age": 30, "scores": [95, 87, 92], "active": True}
binary_data = CBD.serialize(data)
original_data = CBD.deserialize(binary_data)

Format conversion utilities support interoperability with JSON, MessagePack, and BSON, while a command-line tool simplifies file conversions. A benchmark suite measures serialization/deserialization times, data sizes, and memory usage, with results visualized in plots.

Performance and Benchmarks

Preliminary benchmarks highlight CBD’s strengths:

Size Reduction: 40–60% smaller than JSON for datasets with repetitive keys, due to dictionary compression.
Parsing Speed: 20–30% faster than gzipped JSON, as no decompression is needed.
Comparison to MessagePack: Comparable in size but simpler to implement.

These metrics make CBD ideal for applications where bandwidth and processing power are limited, such as IoT devices or real-time data streaming.

Limitations and Future Directions

Version 0.1.0 has limitations, including support only for unsigned integers (floating-point numbers are planned) and potential overhead for small datasets due to the dictionary. The roadmap includes:

Libraries in JavaScript and Rust.
Support for custom data types (e.g., dates, binary blobs).
Streaming and schema support.
Enhanced benchmarks and conversion tools.

The reserved type codes (110, 111) and header bits ensure extensibility, positioning CBD for future enhancements like compact float encoding and streaming support.

Why CBD Matters

CBD fills a niche in the serialization landscape. JSON’s verbosity and external compression overhead make it suboptimal for constrained environments. Existing binary formats like MessagePack or BSON, while efficient, may introduce complexity or lack sufficient compactness for specific use cases. CBD’s dictionary-based compression, bit-packed types, and native compactness offer a compelling alternative, particularly for datasets with repetitive keys or resource-constrained systems.

Conclusion

CompactBinaryData (CBD) is a promising serialization format that balances compactness, performance, and simplicity. Its technical design — leveraging a structured header, dictionary compression, and a flexible type system — makes it a versatile choice for modern applications. As the project evolves with planned extensions and additional language support, CBD has the potential to become a go-to solution for efficient data serialization. For developers seeking to “serialize smarter, not harder,” CBD is worth exploring.

For more details, visit github.com/makalin/CBD or contact makalin@gmail.com.