The Sunlight CT log

Previously “A different kind of CT log” or “The $4k log”

Filippo Valsorda <sunlight@filippo.io>
Created: 6 November 2023 | Updated: 13 March 2024
https://filippo.io/a-different-CT-log

This is a design document for a radically cheaper and easier to operate Certificate Transparency log that is backed by a consistent object storage, and can scale to 30x the current issuance rate for 2-10% of the costs with no merge delay. It exposes the standard submission APIs, and produces standard RFC 6962 SCTs and STHs, guaranteeing complete compatibility with CAs and TLS clients. However, leaves and proofs are made available to monitors and auditors not through RFC 6962 read APIs but as tiles, which are easy to cache and compress.

Additional resources, including test logs, a complete implementation, and a formal specification of the monitoring API are available at sunlight.dev. If you have feedback on the design, please join the conversation on the ct-policy mailing list, or in the #sunlight channel of the transparency-dev Slack.

API differences

The log exposes RFC 6962-compatible add-chain, add-pre-chain, and get-roots endpoints on the write path.

On the read path, get-sth, get-sth-consistency, get-proof-by-hash, get-entries, and get-entry-and-proof endpoints are replaced by exposing highly cacheable log tiles, like in the Go Checksum Database, a checkpoint (for the STH), and a bundle of issuer certificates.

In short, log tiles are static files containing concatenations of the “bottom row” of a portion of the Merkle tree (or the corresponding leaves, at “level -1”). Clients can obtain inclusion and consistency proofs by fetching a few tiles in parallel (in this design, up to five even accounting for 30x the current issuance rate, totaling 32B × 256 × 5 = 40kB at most). [Diagram by Russ Cox.]

In particular, get-sth-consistency, get-entries, and get-entry-and-proof can all be replaced with a few parallel tile fetches. get-proof-by-hash would require a full copy of the tree, but instead this log takes advantage of the lack of merge delay to embed the leaf index in an SCT extension. Clients can then use that to perform the equivalent of a get-entry-and-proof instead. The extension costs just 8 bytes.

Tiles are far easier to serve than dynamic endpoints, as they require no database, are cacheable as static assets, and can be served compressed. Arguably, even without SCT index extension, tiles are a better read model for most clients, too, as usually the read endpoints are used by services that need a full view of the tree anyway (Cert Spotter, crt.sh, UAs doing SCT auditing).

Tiles are just a different “serialization format” for a Merkle tree, all the hashes in the tree and the signatures are the same as in RFC 6962. In fact, it’s possible to run one or more read replicas for full RFC 6962 compliance by maintaining a reliable leaf hash to index lookup table and performing tile fetches (and chain builds) in response to RFC 6962 requests. These read replicas would increase operational complexity, double local node storage requirements, and give up the caching and bandwidth advantages of tiles. We argue that not shifting all this complexity away from log operators would be a missed opportunity to improve the health and resiliency of the CT ecosystem.

Tile, STH, and SCT extension formats

The STH is serialized as a checkpoint with a RFC6962NoteSignature. Note that such a checkpoint can be converted to a get-sth response and vice-versa.

example.com/TestLog
1
Rj6bscQal4ZK99UHjYbd8YW8hDgfEI32efLi5KaD/DQ=

— example.com/TestLog xxpbCAAAAYu6clr9BAMASDBGAiEAongGvaDmW6s9nUYvOob1+CD57FjNxGliZSjLHz8nNVYCIQCnnBvC1g4khQv5DYw5BFIZRD11lH14mDfeSMc0pUxB3Q==

Level -1 (“data”) tiles are sequences of TileLeaf structures. The format is somewhat redundant, but the overhead compresses well, and embedding a TimestampedEntry reduces the mangling required to verify leaves.

struct {
        TimestampedEntry timestamped_entry;
        select(entry_type) {
                case x509_entry: Empty;
                case precert_entry: PreCertExtraData;
        } extra_data;
} TileLeaf;

struct {
        ASN.1Cert pre_certificate;
        opaque PrecertificateSigningCertificate<0..2^24-1>;
} PreCertExtraData;

Data tiles are served with Content-Encoding gzip, saving storage and bandwidth.

The CTExtensions field (currently opaque) is redefined to be a list of Extensions, similarly to RFC 5246, but with a one-byte ExtensionType. We define the leaf_index extension. The extension_data field of this extension contains a LeafIndex value.

enum {
        leaf_index(0), (255)
} ExtensionType;

struct {
        ExtensionType extension_type;
        opaque extension_data<0..2^16-1>;
} Extension;

Extension CTExtensions<0..2^16-1>;

uint8 uint40[5];
uint40 LeafIndex;

Architecture

The log is served by a single service on a single node (see “On availability” below), backed by a strongly consistent object storage such as S3 or a high-availability filesystem, and a database with compare-and-swap semantics such as DynamoDB or a filesystem with file locking.

The object storage backend uses one bucket per log to store tiles, the latest STH, and a bundle of every issuer (intermediate and root CAs) observed in submitted chains. Serving the monitoring API directly from object storage is possible, although different deployments might make different CDN and proxying decisions.
The compare-and-swap database is a private global table that stores the latest STH of every log. See “On robustness” below, for the purpose of this component.

The node maintains in memory the right edge of the Merkle tree, which is sufficient for appending entries and producing new tiles.

Incoming add-chain requests are pooled in memory, and held until their leaves are integrated into the log. Every second, the pool is drained, new tiles are produced, and they are uploaded (in parallel) to object storage. Then, a new STH is signed and uploaded first to the database with a compare-and-swap, and then to object storage. At that point, SCTs are signed and returned for the pending add-chain requests.

The node maintains a local per-log SQLite table mapping certificate hash to timestamp and leaf index (for the SCT index extension), to deduplicate add-chain requests. Deterministic ECDSA (RFC 6979) is used to sign SCTs, removing the need to store signatures. (Storing the signatures would double the size of the local cache, and ECDSA signatures are fast enough.) Importantly, this cache can be best-effort and lossy: if some entries were lost, the log can just accept a few duplicates (and even if the local cache were completely lost, that would only double the size of the log). This allows using a replication mechanism like Litestream or EBS snapshots with the potential for partial data loss.

On availability

Being served by a single node, this log has naturally lower ideal availability than multi-node leader election-based logs. We believe this is an acceptable tradeoff in the CT ecosystem.

The CT log policy SLO is 99% availability. Targeting 99.5% for a healthy margin, that allows 50 minutes of downtime per week. Planned maintenance is going to take dramatically less than that, as a cold start involves 8 GET requests to the object storage (see below). In case of catastrophic availability failure (such as an availability zone outage), the log can be quickly (in minutes, with a well-rehearsed playbook) restarted from its configuration, object storage, and the deduplication cache replica.

At the ecosystem level, CAs are already submitting to more than two logs to optimize issuance time and tolerate log downtime. That’s what makes a 99% SLO tolerable: CT is effectively already a high-availability distributed system with redundancy at the log level. Taking complexity to provide high-availability at the log level is in fact probably harming the ecosystem, by reducing the number of operators. We also point out that multiple current multi-node logs struggle to reach 99% of availability, today.

Note that single-node logs are not intrinsically more vulnerable to disqualifying bugs unrelated to the availability SLO, like bit flips. This means there will be no need for changes in the CT policy (such as number of SCTs, or compliance monitoring periods). Moreover, each of these logs can scale vertically to 30x the current overall WebPKI issuance rate (and probably significantly more), so there is no need for CAs to spread the load either.

On robustness

This design doesn’t actually require the compare-and-swap database, and could run exclusively on top of a strongly consistent object storage. However, the compare-and-swap backend protects against two potential lethal operational issues: first, if two instances of the log are run against the same object storage location, which can only be detected with a compare-and-swap; second, if a log is mistakenly initialized with a key that was previously used. The latter is prevented because unlike the object storage location, which is designed to be per-log so that buckets can be discarded once a log shard is Retired, the compare-and-swap database is designed to be global and permanent: it is keyed by log ID (the hash of the public key), and it stores only the latest checkpoint, requiring a negligible amount of storage.

The compare-and-swap backend also helps relax the criticality of the object storage consistency guarantees. For example, Amazon recommends hedging requests to reduce tail latency: if a request is taking too long, fire off a second identical request in parallel, and cancel one when the other succeeds. This is fine if cancellation is effective, but what if it isn’t? Then the “losing” request might succeed later, in theory even after a subsequent checkpoint was uploaded, rolling it back. (Or, more mundanely, operational error might lead to recovering object storage from a backup.) If the node were to restart during this rolled back state, the tree would fork if it didn’t have the compare-and-swap backend to load the correct STH from.

Overall, as long as all log instances point to the same database and the database is not deleted or rolled back, it is impossible for operational error to lead to a tree split. Margin for error can be further reduced by provisioning only one suitable database in the relevant cloud account or system.

Implementation overview

The log service starts and it

downloads the current STH from the compare-and-swap database and verifies it
downloads the current STH from object storage and compares it with the one above

(If the STH in object storage is older, but for the same tree, loading can continue using the STH fetched from the compare-and-swap database.)

downloads the right edge of the tree from tiles in object storage
downloads the issuers bundle from object storage

(These are not the trusted roots, which are configured by the operator and can change over time, but an append-only bundle of all the roots and intermediates observed in verified submitted chains, to allow clients to chain-build the leaves in the tree.)

checks the right edge is consistent with the STH

(This should be sufficient to prevent an adversarial backend from permanently breaking the log or injecting leaves.)

When receiving an add-chain or add-pre-chain request, the log

checks for a duplicate in the local cache, if found it signs and returns an SCT with the old timestamp and leaf index extension
verifies the certificates
if an intermediate or root in the chain is not in the issuer bundle, takes a mutex and uploads a new version of the bundle to object storage (this is going to be infrequent)

(Clients who care are expected to be able to chain-build using these bundles, instead of relying on the unsigned, out-of-tree chain in the get-entries extra_data.)

checks if the current pool is full, and if so applies backpressure by returning a 503 Service Unavailable with a Retry-After header
adds the certificate to the current in-memory pool
takes the pool’s mutex (which can take up to 1s + one sequencing round)
returns an SCT with the leaf index provided through the pool

Every 1s (regardless of whether new certificates were added), a concurrent background thread

replaces the current pool with an empty one
serializes the certificates and produces new tiles
uploads the new tiles to object storage in parallel

(Note that if the node crashes at this point, clients might observe tiles that extend beyond the size of any STH, and that contain bogus data. This is not a problem because those hashes are not part of any STH.)

signs and uploads the new STH to the database with a compare-and-swap

(If the node crashes at this point, it can recover by ignoring the older checkpoint in object storage.)

uploads the new STH to object storage
updates the local deduplication cache
stores the new leaves indexes in the pool and unlocks its mutex
if more than, say, 500ms elapsed, raises an alert

(This alert means the sequencer is falling behind and the pool size should be reduced. If the sequencing were to take longer than 1s, the pool will fill, naturally causing backpressure to be applied.)

The local deduplication cache is replicated to object storage with Litestream. (This can also be an SQLite-level backup, or an atomic filesystem- or block-level snapshot.)

A single service can serve multiple logs, each at its own API endpoint and each with its own object storage bucket and deduplication cache. The compare-and-swap database is unique and global.

Cost analysis

All the following figures are based on 2,100 add[-pre]-chain requests per second, 30x the current rate.

A six-month shard grows to about 33B entries. For a tile height of 8, that’s 5 tiles high.

The object storage size is dominated by the ASN.1 certificates, which assuming an average certificate size of 2,340 bytes (kind of made up), add up to 78TB. The log overhead (64 + 8 + 24 + 16 bits of TimestampedEntry + 8 bytes of SCT extension per leaf in level -1 tiles, 256 bits per leaf in level 0 tiles, negligible data in level 1+ tiles, negligible data in STH and issuer bundles) adds up to less than 2TB. Compression (HTTP-level gzip) experimentally reduces the size of data tiles by 60-70%.

The log produces 8 PUT requests per second (21M per month): up to three level -1 and 0 tiles, on average one level 1 tile, and the STH. The upload bandwidth is in the single digits MB/s. (If backed by disks, this translates to a few hundred write IOPS.)

The local cache takes 128 bits of key, 40 bits of index, and 64 bits of timestamp per entry, which adds up to about 1TB, plus database overhead. The write IOPS and Mbit/s are trivial (in the single digits).

Memory requirements are trivial. CPU load is dominated by chain verification and SCT signing. The test logs are running on a single shared-4x-cpu@1024MB Fly.io machine.

At the current WebPKI issuance rate, it should be possible to run the write side of this log for $4k/year.

Acknowledgements

This design is based on the original Certificate Transparency specification, on the Go Checksum Database developed with Russ Cox, and on the feedback of many individuals in the WebPKI community, and in particular of the Sigsum, Google TrustFabric, and ISRG teams.

Sunlight's development was sponsored by Let's Encrypt.

Changelog

Nov 7th: added “On availability” section
Nov 10th: added “Tile and STH formats” section
Nov 14th: removed get-sth from supported RFC 6962 APIs
Nov 14th: merged the intermediates and roots bundles
Nov 14th: formally specified the SCT extension
Nov 15th: added “Prototype” section
Nov 27th: switched tile height to 8
Dec 2nd: serve data tiles with Content-Encoding gzip
Dec 19th: deterministic ECDSA and Litestream for the cache
Dec 22nd: added compare-and-swap database
Feb 8th: incorporated editorial comments and updated title
Mar 3rd: switched rome.ct.filippo.io to Fly.io
Mar 7th: added c2sp.org/sunlight specification
Mar 12th: added architecture diagram
Mar 13th: edited for publication