The Sunlight CT log
Previously “A different kind of CT log” or “The $4k log”
Filippo Valsorda <sunlight@filippo.io>
Created: 6 November 2023 | Updated: 13 March 2024
https://filippo.io/a-different-CT-log
This is a design document for a radically cheaper and easier to operate Certificate Transparency log that is backed by a consistent object storage, and can scale to 30x the current issuance rate for 2-10% of the costs with no merge delay. It exposes the standard submission APIs, and produces standard RFC 6962 SCTs and STHs, guaranteeing complete compatibility with CAs and TLS clients. However, leaves and proofs are made available to monitors and auditors not through RFC 6962 read APIs but as tiles, which are easy to cache and compress.
Additional resources, including test logs, a complete implementation, and a formal specification of the monitoring API are available at sunlight.dev. If you have feedback on the design, please join the conversation on the ct-policy mailing list, or in the #sunlight channel of the transparency-dev Slack.
The log exposes RFC 6962-compatible add-chain, add-pre-chain, and get-roots endpoints on the write path.
On the read path, get-sth, get-sth-consistency, get-proof-by-hash, get-entries, and get-entry-and-proof endpoints are replaced by exposing highly cacheable log tiles, like in the Go Checksum Database, a checkpoint (for the STH), and a bundle of issuer certificates.
In short, log tiles are static files containing concatenations of the “bottom row” of a portion of the Merkle tree (or the corresponding leaves, at “level -1”). Clients can obtain inclusion and consistency proofs by fetching a few tiles in parallel (in this design, up to five even accounting for 30x the current issuance rate, totaling 32B × 256 × 5 = 40kB at most). [Diagram by Russ Cox.]
In particular, get-sth-consistency, get-entries, and get-entry-and-proof can all be replaced with a few parallel tile fetches. get-proof-by-hash would require a full copy of the tree, but instead this log takes advantage of the lack of merge delay to embed the leaf index in an SCT extension. Clients can then use that to perform the equivalent of a get-entry-and-proof instead. The extension costs just 8 bytes.
Tiles are far easier to serve than dynamic endpoints, as they require no database, are cacheable as static assets, and can be served compressed. Arguably, even without SCT index extension, tiles are a better read model for most clients, too, as usually the read endpoints are used by services that need a full view of the tree anyway (Cert Spotter, crt.sh, UAs doing SCT auditing).
Tiles are just a different “serialization format” for a Merkle tree, all the hashes in the tree and the signatures are the same as in RFC 6962. In fact, it’s possible to run one or more read replicas for full RFC 6962 compliance by maintaining a reliable leaf hash to index lookup table and performing tile fetches (and chain builds) in response to RFC 6962 requests. These read replicas would increase operational complexity, double local node storage requirements, and give up the caching and bandwidth advantages of tiles. We argue that not shifting all this complexity away from log operators would be a missed opportunity to improve the health and resiliency of the CT ecosystem.
The STH is serialized as a checkpoint with a RFC6962NoteSignature. Note that such a checkpoint can be converted to a get-sth response and vice-versa.
example.com/TestLog
1
Rj6bscQal4ZK99UHjYbd8YW8hDgfEI32efLi5KaD/DQ=
— example.com/TestLog xxpbCAAAAYu6clr9BAMASDBGAiEAongGvaDmW6s9nUYvOob1+CD57FjNxGliZSjLHz8nNVYCIQCnnBvC1g4khQv5DYw5BFIZRD11lH14mDfeSMc0pUxB3Q==
Level -1 (“data”) tiles are sequences of TileLeaf structures. The format is somewhat redundant, but the overhead compresses well, and embedding a TimestampedEntry reduces the mangling required to verify leaves.
struct {
TimestampedEntry timestamped_entry;
select(entry_type) {
case x509_entry: Empty;
case precert_entry: PreCertExtraData;
} extra_data;
} TileLeaf;
struct {
ASN.1Cert pre_certificate;
opaque PrecertificateSigningCertificate<0..2^24-1>;
} PreCertExtraData;
Data tiles are served with Content-Encoding gzip, saving storage and bandwidth.
The CTExtensions field (currently opaque) is redefined to be a list of Extensions, similarly to RFC 5246, but with a one-byte ExtensionType. We define the leaf_index extension. The extension_data field of this extension contains a LeafIndex value.
enum {
leaf_index(0), (255)
} ExtensionType;
struct {
ExtensionType extension_type;
opaque extension_data<0..2^16-1>;
} Extension;
Extension CTExtensions<0..2^16-1>;
uint8 uint40[5];
uint40 LeafIndex;
The log is served by a single service on a single node (see “On availability” below), backed by a strongly consistent object storage such as S3 or a high-availability filesystem, and a database with compare-and-swap semantics such as DynamoDB or a filesystem with file locking.
The node maintains in memory the right edge of the Merkle tree, which is sufficient for appending entries and producing new tiles.
Incoming add-chain requests are pooled in memory, and held until their leaves are integrated into the log. Every second, the pool is drained, new tiles are produced, and they are uploaded (in parallel) to object storage. Then, a new STH is signed and uploaded first to the database with a compare-and-swap, and then to object storage. At that point, SCTs are signed and returned for the pending add-chain requests.
The node maintains a local per-log SQLite table mapping certificate hash to timestamp and leaf index (for the SCT index extension), to deduplicate add-chain requests. Deterministic ECDSA (RFC 6979) is used to sign SCTs, removing the need to store signatures. (Storing the signatures would double the size of the local cache, and ECDSA signatures are fast enough.) Importantly, this cache can be best-effort and lossy: if some entries were lost, the log can just accept a few duplicates (and even if the local cache were completely lost, that would only double the size of the log). This allows using a replication mechanism like Litestream or EBS snapshots with the potential for partial data loss.
Being served by a single node, this log has naturally lower ideal availability than multi-node leader election-based logs. We believe this is an acceptable tradeoff in the CT ecosystem.
The CT log policy SLO is 99% availability. Targeting 99.5% for a healthy margin, that allows 50 minutes of downtime per week. Planned maintenance is going to take dramatically less than that, as a cold start involves 8 GET requests to the object storage (see below). In case of catastrophic availability failure (such as an availability zone outage), the log can be quickly (in minutes, with a well-rehearsed playbook) restarted from its configuration, object storage, and the deduplication cache replica.
At the ecosystem level, CAs are already submitting to more than two logs to optimize issuance time and tolerate log downtime. That’s what makes a 99% SLO tolerable: CT is effectively already a high-availability distributed system with redundancy at the log level. Taking complexity to provide high-availability at the log level is in fact probably harming the ecosystem, by reducing the number of operators. We also point out that multiple current multi-node logs struggle to reach 99% of availability, today.
Note that single-node logs are not intrinsically more vulnerable to disqualifying bugs unrelated to the availability SLO, like bit flips. This means there will be no need for changes in the CT policy (such as number of SCTs, or compliance monitoring periods). Moreover, each of these logs can scale vertically to 30x the current overall WebPKI issuance rate (and probably significantly more), so there is no need for CAs to spread the load either.
This design doesn’t actually require the compare-and-swap database, and could run exclusively on top of a strongly consistent object storage. However, the compare-and-swap backend protects against two potential lethal operational issues: first, if two instances of the log are run against the same object storage location, which can only be detected with a compare-and-swap; second, if a log is mistakenly initialized with a key that was previously used. The latter is prevented because unlike the object storage location, which is designed to be per-log so that buckets can be discarded once a log shard is Retired, the compare-and-swap database is designed to be global and permanent: it is keyed by log ID (the hash of the public key), and it stores only the latest checkpoint, requiring a negligible amount of storage.
The compare-and-swap backend also helps relax the criticality of the object storage consistency guarantees. For example, Amazon recommends hedging requests to reduce tail latency: if a request is taking too long, fire off a second identical request in parallel, and cancel one when the other succeeds. This is fine if cancellation is effective, but what if it isn’t? Then the “losing” request might succeed later, in theory even after a subsequent checkpoint was uploaded, rolling it back. (Or, more mundanely, operational error might lead to recovering object storage from a backup.) If the node were to restart during this rolled back state, the tree would fork if it didn’t have the compare-and-swap backend to load the correct STH from.
Overall, as long as all log instances point to the same database and the database is not deleted or rolled back, it is impossible for operational error to lead to a tree split. Margin for error can be further reduced by provisioning only one suitable database in the relevant cloud account or system.
The log service starts and it
When receiving an add-chain or add-pre-chain request, the log
Every 1s (regardless of whether new certificates were added), a concurrent background thread
The local deduplication cache is replicated to object storage with Litestream. (This can also be an SQLite-level backup, or an atomic filesystem- or block-level snapshot.)
A single service can serve multiple logs, each at its own API endpoint and each with its own object storage bucket and deduplication cache. The compare-and-swap database is unique and global.
All the following figures are based on 2,100 add[-pre]-chain requests per second, 30x the current rate.
A six-month shard grows to about 33B entries. For a tile height of 8, that’s 5 tiles high.
The object storage size is dominated by the ASN.1 certificates, which assuming an average certificate size of 2,340 bytes (kind of made up), add up to 78TB. The log overhead (64 + 8 + 24 + 16 bits of TimestampedEntry + 8 bytes of SCT extension per leaf in level -1 tiles, 256 bits per leaf in level 0 tiles, negligible data in level 1+ tiles, negligible data in STH and issuer bundles) adds up to less than 2TB. Compression (HTTP-level gzip) experimentally reduces the size of data tiles by 60-70%.
The log produces 8 PUT requests per second (21M per month): up to three level -1 and 0 tiles, on average one level 1 tile, and the STH. The upload bandwidth is in the single digits MB/s. (If backed by disks, this translates to a few hundred write IOPS.)
The local cache takes 128 bits of key, 40 bits of index, and 64 bits of timestamp per entry, which adds up to about 1TB, plus database overhead. The write IOPS and Mbit/s are trivial (in the single digits).
Memory requirements are trivial. CPU load is dominated by chain verification and SCT signing. The test logs are running on a single shared-4x-cpu@1024MB Fly.io machine.
At the current WebPKI issuance rate, it should be possible to run the write side of this log for $4k/year.
This design is based on the original Certificate Transparency specification, on the Go Checksum Database developed with Russ Cox, and on the feedback of many individuals in the WebPKI community, and in particular of the Sigsum, Google TrustFabric, and ISRG teams.
Sunlight's development was sponsored by Let's Encrypt.