rota

Cert renewal for self-hosters. One Rust binary, CLI plus dashboard, pluggable CAs, registrars, and install targets.

Why this exists

Public TLS certificate lifetimes are getting shorter on a fixed schedule. CA/Browser Forum Ballot SC-081, adopted in April 2025, drops the maximum validity of publicly trusted certs from 397 days down to 47 days over four years:

EffectiveMax validity
2026-03-15200 days
2027-03-15100 days
2029-03-1547 days

Apple championed the ballot. Their argument (Apple Platform Security) is that shorter validity reduces the damage window when a key gets compromised, encourages more automation, and reduces dependence on revocation infrastructure that has been historically unreliable.

Fair enough as far as it goes. Where it falls down for me:

  1. Revocation isn't broken so much as underfunded. OCSP stapling and CRLite are real and deployed.
  2. The historic CA failures (DigiNotar, Symantec, TrustCor) weren't validity failures. They were infrastructure compromises and policy failures. Shorter cert lifetimes wouldn't have helped.
  3. The cost falls hardest on small operators. Cloud customers absorb 47-day renewal automatically; air-gapped, embedded, IoT, and self-hosted setups pay the entire complexity tax. The rational response from a small operator is to hand DNS to a managed proxy and stop self-hosting. That's a power shift away from individuals running their own infrastructure.

I run my own stuff and I'd like to keep doing that. So I'm writing the tool I'd want.

What it does

Watches your CA-issued certs and knows when they're close to expiry. Generates fresh CSRs against persistent private keys you control. Submits reissue or renewal requests to the CA over that CA's API. Completes domain-control validation by writing TXT records at your registrar (DNS-01) or dropping a token under .well-known/acme-challenge/ for an existing webserver to serve (HTTP-01). Installs issued certs where they need to land: Synology DSM, plain filesystem, nginx reload, HAProxy runtime API hot-swap, Kubernetes Secret. Logs every step. Surfaces a real-time dashboard. Alerts before failures, not after.

Optionally federates across multiple rotad instances: one node renews, peers pick up the cert from a shared SurrealDB and install locally.

Who this is for

Operators who run their own webservers, mail servers, dashboards, hobby boxes, homelabs. People who want renewal automation without surrendering DNS or HTTPS termination to a managed proxy. Anyone who'd rather drop a single Rust binary on a box than wrangle a Python toolchain just to keep certs fresh.

If you're already happy with certbot, this isn't a replacement. It's a different tradeoff for the operator who wants pluggable backends and built-in operational surface (audit log, dashboard, alerts, metrics, federation) in one process.

License

Apache 2.0.

Getting started

Stand up a single-node rota that renews one cert against Let's Encrypt via DNS-01, with the cert installed to the local filesystem.

Install

cargo install rota

That builds two binaries: rotad (the daemon) and rota (the CLI client that talks to the daemon over a UNIX socket).

Minimal config

Drop this at /etc/rota/rota.yaml:

daemon:
  database_path: /var/lib/rota/rota.db
  listen_addr: 127.0.0.1:7878
  socket_path: /var/run/rota.sock
  check_interval_seconds: 3600
  renew_threshold_days: 30

acme:
  directory_url: https://acme-v02.api.letsencrypt.org/directory
  contact_email: ops@example.com
  account_credentials_file: /etc/rota/secrets/acme-account.json

cloudflare:
  api_token_file: /etc/rota/secrets/cloudflare.token

certs:
  - id: example-public
    description: example.com marketing site
    domains: [example.com, www.example.com]
    key_path: /var/lib/rota/keys/example.com.key
    ca:
      kind: acme
    dcv:
      kind: cloudflare
    install:
      kind: filesystem
      directory: /etc/ssl/example

Three things to provision before starting rotad:

  1. Cloudflare API token at /etc/rota/secrets/cloudflare.token. Scope: Zone.DNS:Edit on every zone rota will publish DCV records in. rota only supports tokens, not the legacy Global API Key.
  2. ACME account credentials file. The first run creates this automatically; just make sure the parent directory is writable.
  3. Private key directory (/var/lib/rota/keys/) with 0700 mode. The first run also generates the per-cert key automatically. rota reuses the same key on every renewal so cert-pinning operators don't break.

First run

rotad --config /etc/rota/rota.yaml

The daemon opens the audit DB (SQLite by default), connects to the configured CAs, registrars / DCV solvers, and install backends, then sweeps every cert on the configured check_interval_seconds. The first sweep happens after one full interval, not immediately, so the daemon doesn't hammer the CA on startup. Any cert whose installed copy is closer to notAfter than renew_threshold_days gets renewed.

Talking to the daemon

rota status

Prints a one-line summary per cert: id, domains, days until expiry, last renewal status. The same data is on the dashboard at http://127.0.0.1:7878/.

rota renew example-public

Force a renewal regardless of expiry. Useful when you've just rotated DNS and want to confirm the pipeline end to end.

rota log example-public

Print the most recent renewal's audit trail (CSR generated, CA submitted, DCV published, cert issued, cert installed, DCV removed).

Where to go next

Architecture overview

rota is one Rust binary running as a daemon, with two thin clients sharing its state.

rota CLI ──── socket ────▶ rotad (daemon)
                           scheduler, audit, API
                           HTTP + WS, SQLite or SurrealDB
Dashboard ──── HTTP ─────▶
(htmx + SSR)   WS                │   │   │   │
                          ┌──────┘   │   │   └──────┐
                          ▼          ▼   ▼          ▼
                        CABackend  Dcv Install   Alert
                                   Backend Backend Backend

                        Namecheap  Namecheap  DSM        Email (SMTP)
                        ACME       Cloudflare Filesystem Webhook
                                   Webroot    nginx
                                              HAProxy
                                              Kubernetes

Daemon, CLI, and dashboard all build from one Cargo workspace and ship as a single binary. No Node, no Deno, no npm.

Four trait surfaces

The renewal pipeline composes generically across vendors. Adding support for a new CA, DCV strategy, install target, or alert sink is one trait impl, not a fork of the renewal logic.

CABackend

Issues certificates from a Certificate Authority. Two methods:

  • submit(domains, csr_pem, preferred_kinds). Submits a CSR. Returns one or more DcvChallenge values the caller must satisfy via the DCV backend. preferred_kinds lets the caller hint at DNS-01 vs HTTP-01; CAs that offer a choice walk the list and pick.
  • await_issuance(domains). Polls until the cert is signed.

Today: NamecheapCa (traditional reissue, DNS-01 only) and AcmeCa (RFC 8555: Let's Encrypt, ZeroSSL with EAB, BuyPass, any directory that speaks the spec).

DcvBackend

Solves the CA's domain-control challenge.

  • supported_kinds(). Which ChallengeKinds the backend can satisfy: Dns01, Http01.
  • supports(challenge). Whether this specific challenge is satisfiable. Default impl matches against supported_kinds().
  • publish(challenge). Make the response visible to the CA. Idempotent.
  • remove(challenge). Clean up after issuance. Idempotent.

DcvChallenge is a tagged enum:

  • Dns01 { record_name, record_value, ttl }. TXT record at record_name. Solvers: NamecheapDcv, CloudflareDcv.
  • Http01 { domain, token, key_authorization }. key_authorization body served at http://<domain>/.well-known/acme-challenge/<token>. Solver: WebrootDcv (drops the file under a directory served by an existing webserver).

InstallBackend

Places the issued cert and chain where the system serving the domain can read them. Implementations may also trigger a service reload.

  • install(cert, private_key_pem, domains). Land the artifacts.
  • current_cert_pem(cert_id). Read back the installed leaf cert for the scheduler's days-until-expiry calculation. Default returns None; backends opt in.

Today: DsmInstall (Synology), FilesystemInstall (mode-600 key + mode-644 cert + chain + fullchain), NginxInstall (filesystem + reload subprocess), HaproxyInstall (filesystem + runtime API hot-swap), K8sSecretInstall (server-side-apply a kubernetes.io/tls Secret).

AlertBackend

Daemon-wide notification sinks. Every renewal failure fans out to every configured sink.

  • dispatch(event). Deliver. Errors are logged but never break the renewal pipeline.

Today: EmailAlert (lettre, STARTTLS / implicit TLS / plaintext) and WebhookAlert (generic JSON envelope POST).

Renewer pipeline

For one cert, one renewal:

  1. Load (or generate) the persistent private key from key_path.
  2. Generate a CSR against that key.
  3. ca.submit() returns DCV challenges.
  4. Pre-flight check: dcv.supports() for each challenge. Fast-fail if the configured solver can't handle what the CA returned.
  5. dcv.publish() every challenge.
  6. ca.await_issuance() waits for the CA to validate and sign.
  7. dcv.remove() cleans up. Runs unconditionally, even if issuance failed, so a partial run doesn't leave stray records.
  8. Persist the issued cert and chain to the audit store (for cluster cert distribution).
  9. install.install() writes locally.
  10. The audit log records every step.

Scheduler

Ticks every check_interval_seconds. For each cert: read the install backend's current_cert_pem, parse notAfter, compare to renew_threshold_days, queue a renewal if due. A per-cert failure cooldown prevents a flaky CA from getting hammered every tick.

In a cluster, the scheduler's sweep is gated on cluster.is_leader(). Followers skip silently; the leader keeps doing the work.

Audit store

Every renewal opens a row, appends step events, and closes with a status. Two backends:

  • SqliteAuditStore (default). Single-file SQLite, no external service. Good for single-node deployments.
  • SurrealAuditStore. Connects to an existing SurrealDB. Required for cluster federation: the lock and cert distribution rows live in the same database.

Cluster

When cluster.enabled = true and audit is SurrealDB, the daemon runs a SurrealClusterCoordinator that holds a lock at cluster_lock:singleton with a TTL refresh. The leader's renewer pipeline writes successful issuances to issued_cert rows. An InstallSyncTask on every node polls the table and runs the local install backend with the operator-pre-provisioned private key when the audit cert is fresher than what's installed locally. Private keys are never distributed through the audit store.

See the federation runbook for the operator-side walkthrough.

Backends

What ships today, what's coming, and the design choices behind each.

CA backends

Namecheap (traditional reissue)

kind: namecheap with ssl_id: <numeric SSL id>. Uses Namecheap's namecheap.ssl.reissue and namecheap.ssl.getInfo flow. Activation is one-shot: rota only handles reissue within an existing SSL subscription. First-time activation requires a long list of admin-contact fields rota does not model in the config. Operators activate once by hand in the Namecheap dashboard; rota handles every renewal after that.

DNS-01 only. The Namecheap reissue API folds every SAN under one DCV record, so rota's multi-challenge trait sees a single-element vec.

ACME (RFC 8555)

kind: acme. Speaks Let's Encrypt, ZeroSSL (with External Account Binding), BuyPass, and any directory that follows the spec. Uses the instant-acme crate.

rota manages its own persistent ECDSA key per cert because operators rely on key continuity for cert pinning. The ACME submit path uses finalize_csr(csr_der) so the operator's key stays canonical across renewals.

The ACME backend walks the configured DCV solver's supported_kinds() to pick a challenge type per authorization. So dcv: { kind: webroot } automatically gets HTTP-01; dcv: { kind: cloudflare } automatically gets DNS-01.

DCV backends

Namecheap DNS

kind: namecheap. DNS-01 via namecheap.domains.dns.{getHosts,setHosts}.

Watch out: Namecheap's setHosts is a full replacement of every record on the domain, not a per-record edit. Publishing one TXT therefore requires reading every existing record first, merging the new one in, and writing the merged set back. rota does this transparently.

Cloudflare DNS

kind: cloudflare. DNS-01 via Cloudflare's v4 API with Bearer-token auth. Token scopes: Zone.DNS:Edit on every zone rota will manage. rota does not support the legacy Global API Key.

Cloudflare's per-record edit API means rota doesn't have to read every record on the zone first. The flow: resolve the apex zone for the record name, look for an existing TXT match (idempotency), POST the record if absent. Removal mirrors the lookup-and-delete shape.

Webroot (HTTP-01)

kind: webroot with directory: <document root>. rota writes the key authorization to <directory>/.well-known/acme-challenge/<token> (mode 644) and removes it after issuance. The operator's existing webserver (nginx, Caddy, Apache, anything that serves static files over HTTP on port 80) is responsible for actually exposing the directory.

Why webroot rather than a daemon-internal listener: most self-hosters already run a webserver on 80 and 443. Asking rota to bind 80 means coordinating port handoff (or running rota as root) for one purpose: serving a five-byte file the existing webserver could serve in its sleep.

Defensive against malformed challenge tokens: rota refuses path-shaped tokens (/, \, .., empty) so a misbehaving CA can't traverse out of the challenge directory.

Install backends

Synology DSM

kind: dsm with description: <DSM panel label>. Uses synowebapi to install the cert into DSM's certificate store. The cert id surfaces in the DSM Control Panel under the configured description.

Filesystem

kind: filesystem with directory: <path>. Lays the issued cert, chain, and private key down under predictable filenames so any service that reads disk-based PEM (nginx, HAProxy, Caddy, custom Rust + rustls) can pick them up.

Filenames mirror the certbot convention so existing reload scripts that grep for fullchain.pem and privkey.pem work unchanged. Writes are atomic per file: each artifact goes to a sibling .tmp, fsync, rename.

nginx

kind: nginx with directory: <path> and optional reload_command: [<argv>]. Filesystem write plus an nginx reload subprocess.

Default reload is ["nginx", "-s", "reload"]. Operators on systemd typically override with ["systemctl", "reload", "nginx"] and a sudoers rule that keeps the daemon unprivileged. The reload runs without a shell wrapper, so argv entries are not interpreted: no globbing, no env interpolation. A non-zero exit surfaces as an Install error so the renewer records the failure on the audit log.

HAProxy

kind: haproxy with directory:, socket_path:, and cert_storage_name:. Filesystem write plus HAProxy runtime API hot-swap.

The runtime API sequence:

set ssl cert <storage_name> <<EOL
<leaf + chain + key bundle>
EOL
commit ssl cert <storage_name>

No reload, no dropped TCP connections. HAProxy hands the new certificate to live SNI lookups on the next handshake. Requires HAProxy 2.x or later with the admin socket exposed:

global
    stats socket /run/haproxy/admin.sock mode 660 level admin

Kubernetes Secret

kind: k8s_secret with namespace:, secret_name:, and optional kubeconfig_path:. Server-side applies a kubernetes.io/tls Secret. Drop-in for Ingress, Gateway, and any controller that consumes the standard TLS Secret shape.

Auth resolution:

  • kubeconfig_path omitted: in-cluster ServiceAccount (run rotad as a Pod).
  • kubeconfig_path set: load the named kubeconfig (run rotad outside the cluster).

Required RBAC on secrets in the target namespace: get, create, patch. Server-side apply with FieldManager "rota" so concurrent managers (cert-manager, helm, etc.) get clean conflict signaling rather than silent overwrites.

Alert backends

Email

kind: email. Lettre-backed SMTP. Submission ports (587 STARTTLS, 465 SMTPS) both supported via tls:. Auth is username + password from a file the daemon reads at runtime; the password never sits in the parsed config tree.

Webhook

kind: webhook. POSTs a generic JSON envelope to a URL:

{"cert_id": "...", "kind": "renewal_failed", "message": "...", "timestamp": "RFC3339"}

Vendor-neutral on the wire. Slack-incoming, Discord, Microsoft Teams, and similar opinionated formats are out of scope: point a small relay (n8n, Pipedream, your own service) at this URL and translate. Keeps rota's wire format flat instead of growing a per-vendor bestiary.

Optional Bearer token auth from a file. Per-request timeout defaults to 10 seconds.

Roadmap

More CAs: Sectigo direct, GoDaddy. More DNS-01 solvers: Route 53, DigitalOcean, Porkbun. More install targets: a native HTTP-01 listener (instead of webroot), more reload integrations as operators surface needs.

Configuration reference

rota.yaml is the single source of truth for daemon settings, CAs, DCV solvers, install targets, alerts, and federation. The path defaults to /etc/rota/rota.yaml; override with --config <path> or the ROTA_CONFIG env var.

Top-level shape

daemon: {...}            # daemon-wide settings
audit: {...}             # optional; defaults to SQLite at daemon.database_path
namecheap: {...}         # account-wide, required if any cert names namecheap
cloudflare: {...}        # account-wide, required if any cert names cloudflare
acme: {...}              # account-wide, required if any cert names acme
cluster: {...}           # optional federation block
alerts: [...]            # optional list of notification sinks
certs: [...]             # required list of cert configs

daemon

daemon:
  database_path: /var/lib/rota/rota.db
  listen_addr: 127.0.0.1:7878
  socket_path: /var/run/rota.sock
  check_interval_seconds: 3600
  renew_threshold_days: 30
FieldDefaultNotes
database_path/var/lib/rota/rota.dbSQLite audit DB. Auto-created mode 600.
listen_addr127.0.0.1:7878Dashboard HTTP listen. Bind behind a reverse proxy for external access.
socket_path/var/run/rota.sockUNIX socket the rota CLI talks to.
check_interval_seconds3600Scheduler sweep cadence.
renew_threshold_days30Renew when the installed cert's notAfter is closer than this.

audit

Omit for SQLite at daemon.database_path (single-node default). For SurrealDB:

audit:
  kind: surrealdb
  endpoint: ws://surreal.internal:8000
  namespace: rota
  database: prod
  username: rota
  password_file: /etc/rota/secrets/surreal.password

endpoint accepts mem://, file://path, ws://, wss://, http://, https://. Embedded engines (mem://, file://) skip auth; remote engines need username and password_file.

Secrets and environment variables

Every *_file: field in this config can read from an environment variable instead of a file by setting the path to env:VAR_NAME. Every operator-set String field accepts ${VAR} interpolation against the process environment at config-load time. An unset referenced variable is a fatal startup error.

namecheap:
  api_key_file: env:NAMECHEAP_API_KEY     # secret comes from env, not a file
  username: ${NAMECHEAP_USERNAME}         # inline interpolation
  client_ip: ${NAMECHEAP_CLIENT_IP}

This pairs with any secret-injection mechanism that exports env vars to the daemon process. Common shapes:

  • Doppler: run rotad as doppler run --project rota --config prd -- rotad --config /etc/rota/rota.yaml. Doppler injects NAMECHEAP_API_KEY, etc. into the child process.
  • systemd LoadCredentialEncrypted=: drops decrypted material into $CREDENTIALS_DIRECTORY/<name>. Reference via env: if you also export the value through Environment=, or point *_file: directly at the credentials path.
  • HashiCorp Vault Agent: writes a templated env file the systemd unit reads via EnvironmentFile=.
  • Plain compose env_file: for hosts where Doppler is overkill.

Refusing to start when a referenced variable is unset means a misconfigured deploy fails loud at boot rather than silently calling a vendor API with an empty key.

CA accounts

namecheap

namecheap:
  api_key_file: /etc/rota/secrets/namecheap-api.key
  username: your-namecheap-username
  api_user: optional-sub-account-user   # defaults to username
  client_ip: 192.0.2.1

client_ip must be on the account's whitelisted IPs in Namecheap, or the API rejects every call. The same credentials authenticate both the CA backend (reissue) and the DCV backend (DNS).

cloudflare

cloudflare:
  api_token_file: /etc/rota/secrets/cloudflare.token

Token scope: Zone.DNS:Edit on every zone rota manages. rota does not support the legacy Global API Key.

acme

acme:
  directory_url: https://acme-v02.api.letsencrypt.org/directory
  contact_email: ops@example.com
  account_credentials_file: /etc/rota/secrets/acme-account.json
  external_account_binding:           # optional; ZeroSSL et al.
    kid: <CA-assigned key id>
    hmac_key_file: /etc/rota/secrets/zerossl.hmac

Common directory URLs:

  • Let's Encrypt prod: https://acme-v02.api.letsencrypt.org/directory
  • Let's Encrypt staging: https://acme-staging-v02.api.letsencrypt.org/directory
  • ZeroSSL: https://acme.zerossl.com/v2/DV90
  • BuyPass: https://api.buypass.com/acme/directory

account_credentials_file is created on first run; treat like a private key (mode 0o600).

cluster

Omit for single-node. To enable federation:

cluster:
  enabled: true
  node_id: host-a       # unique per node
  lease_seconds: 60     # refresh cadence is lease/3 (~20s here)

Requires audit.kind: surrealdb because the lock and cert blobs live in that database. See the federation runbook for end-to-end setup.

alerts

A list. Every event fans out to every entry, so operators can mix sinks:

alerts:
  - kind: email
    smtp_host: smtp.example.com
    smtp_port: 587
    tls: starttls            # starttls (587), implicit (465), or none
    username: alerts@example.com
    password_file: /etc/rota/secrets/smtp.password
    from: rota@example.com
    to: [oncall@example.com]
  - kind: webhook
    url: https://hooks.example.com/incoming/abc
    bearer_token_file: /etc/rota/secrets/webhook.token  # optional
    timeout_seconds: 10                                  # optional, default 10

certs

Each cert picks one CA, one DCV solver, one install target:

certs:
  - id: example-public                # stable; used in logs, CLI, dashboard
    description: example.com marketing site
    domains: [example.com, www.example.com]
    key_path: /var/lib/rota/keys/example.com.key
    ca:
      kind: <namecheap | acme>
    dcv:
      kind: <namecheap | cloudflare | webroot>
    install:
      kind: <dsm | filesystem | nginx | haproxy | k8s_secret>

ca variants

ca: { kind: namecheap, ssl_id: 12345678 }
ca: { kind: acme }

dcv variants

dcv: { kind: namecheap }
dcv: { kind: cloudflare }
dcv: { kind: webroot, directory: /var/www/example }

install variants

install: { kind: dsm, description: My Public Site }
install: { kind: filesystem, directory: /etc/ssl/example }
install:
  kind: nginx
  directory: /etc/nginx/certs/example
  reload_command: [systemctl, reload, nginx]      # optional, default [nginx, -s, reload]
install:
  kind: haproxy
  directory: /etc/haproxy/certs
  socket_path: /run/haproxy/admin.sock
  cert_storage_name: /etc/haproxy/certs/example.pem
install:
  kind: k8s_secret
  namespace: ingress-nginx
  secret_name: example-tls
  kubeconfig_path: /etc/rota/kubeconfig            # optional, omit for in-cluster SA

Migration from earlier versions

v0.5 to v0.6

  • rota.yaml: rename registrar: to dcv: on every cert. The kind values (namecheap, cloudflare) are unchanged; only the parent field name moves.
  • New optional cluster: block enables multi-host federation.
  • Wire protocol bumped from 1 to 2 (CertSummary.registrar_backend becomes dcv_backend). The rota CLI must upgrade alongside rotad; older clients hit a clean version-mismatch error rather than silent misparse.

Federation runbook

Multiple rotad instances pointing at the same SurrealDB elect a single leader to run the renewal scheduler. Followers stand by for failover and install cluster-distributed certs locally with operator-pre-provisioned private keys.

Why this exists

Two operator-side use cases:

  1. High-availability renewer. A single rotad is a single point of failure. If its host goes down within renew_threshold_days of a cert's notAfter, the cert lapses. A two-node cluster with leader election keeps renewal pulled forward through host failures.
  2. Multi-host install. A cert that fronts multiple machines (load balancers, service mesh ingress, redundant API servers) needs to land on each host. With federation, one node renews and every node installs locally.

Architecture

                ┌──────────────────────────┐
                │   SurrealDB (operator)   │
                │   namespace: rota        │
                │   database: prod         │
                │                          │
                │   cluster_lock:singleton │
                │   issued_cert:<rows>     │
                │   renewal:<rows>         │
                │   renewal_event:<rows>   │
                └──────────────────────────┘
                       ▲       ▲       ▲
                       │       │       │
              ┌────────┘       │       └────────┐
              │                │                │
        ┌─────┴────┐     ┌─────┴────┐     ┌─────┴────┐
        │ rotad A  │     │ rotad B  │     │ rotad C  │
        │ leader   │     │ follower │     │ follower │
        └──────────┘     └──────────┘     └──────────┘
        ↓ scheduler        ↓ install_sync   ↓ install_sync
        ↓ renewer          ↓ poll           ↓ poll
        ↓ install (local)

All nodes share one SurrealDB. (rota doesn't care whether that's a single instance or a SurrealDB cluster.) One node holds the lock at cluster_lock:singleton and runs the renewal scheduler; the others have their schedulers gated on is_leader() and skip silently.

The leader's renewer pipeline persists every successful issuance to the issued_cert table. Every node (including the leader, but the leader's install_sync self-suppresses) runs an InstallSyncTask that polls issued_cert and runs the local InstallBackend when the audit cert is fresher than what's installed locally.

Trust model

The audit store carries cert PEM and chain PEM. Private keys are never written to the audit store.

Each cluster member's key_path private key is provisioned out-of-band: config-management, secrets manager, manual scp, whatever the operator already uses for sensitive material. The shared SurrealDB is in the trust boundary for cert metadata and renewal history but not for key material. If the database is compromised, an attacker can read which certs exist and when they were renewed; they cannot forge requests against the CA or impersonate any host.

Setup

1. Provision SurrealDB

Operators who already run SurrealDB skip ahead. Otherwise the simplest is one surreal instance behind a reverse proxy on a stable host:

surreal start --user root --pass <root-password> file:///var/lib/surrealdb

Then create the namespace and database for rota:

surreal sql --user root --pass <root-password> --ns rota --db prod
> DEFINE NAMESPACE rota;
> DEFINE DATABASE prod;

2. Provision per-cert private keys on each node

Pick the key_path directory each node will use. Mode 0700. Place the same private key file on every cluster member that participates in installing this cert:

# On every node:
install -d -m 0700 /var/lib/rota/keys
install -m 0600 example.com.key /var/lib/rota/keys/example.com.key

The keys must be byte-identical across nodes. rota uses one key per cert (no per-node keys) so the cert validates against any node's TLS handshake.

3. Configure each node

Each rota.yaml is the same except for cluster.node_id:

daemon:
  database_path: /var/lib/rota/rota.db   # local SQLite for local audit only; the shared audit lives in SurrealDB
  listen_addr: 127.0.0.1:7878
  socket_path: /var/run/rota.sock
  check_interval_seconds: 3600
  renew_threshold_days: 30

audit:
  kind: surrealdb
  endpoint: wss://surreal.internal:8000
  namespace: rota
  database: prod
  username: rota
  password_file: /etc/rota/secrets/surreal.password

cluster:
  enabled: true
  node_id: host-a            # different per node: host-a, host-b, host-c
  lease_seconds: 60

# ... ca / dcv / alerts / certs blocks identical across nodes

4. Start each node

# host-a:
rotad --config /etc/rota/rota.yaml &

# host-b:
rotad --config /etc/rota/rota.yaml &

# host-c:
rotad --config /etc/rota/rota.yaml &

Whichever node wins the initial lock acquisition becomes leader. The others log cluster: still follower and stand by.

Verifying

Who's the leader?

# On every node:
rota status

Each node shows the same cert table (it's pulled from the shared audit). rotad's logs differentiate:

INFO cluster: acquired leader lock     # leader
INFO cluster: still follower           # followers

A direct query against SurrealDB:

SELECT * FROM cluster_lock:singleton;

returns the holder's node_id and lease expiry.

Did the cert distribute?

After a successful renewal:

SELECT * FROM issued_cert WHERE cert_id = 'example-public' ORDER BY issued_at DESC LIMIT 1;

shows the fresh cert blob. Each follower's install_sync task picks it up on its next poll (one check_interval_seconds), installs locally, and the last_renewal_status on each node's rota status output reflects the fresh cert.

Failover

When the leader dies (host crash, kernel oom, network partition), its lease lapses after lease_seconds (default 60s). The next polling follower acquires the lock and becomes the new leader; renewals pick back up automatically. No operator intervention needed.

If a leader recovers from a transient failure and re-acquires the lock, no harm done: the record_issued_cert writes are append-only, and latest_issued_cert is monotonic by issued_at.

Failure modes worth knowing

SurrealDB unreachable from the leader. The lease loop logs lock-check failures and demotes defensively. Followers see no leader; on their next sweep one of them tries to acquire and may succeed (if their network sees SurrealDB) or also fail. Renewals pause until SurrealDB is reachable from at least one node.

Private key drift across nodes. If the per-node key_path differs, follower installs will succeed locally but the served cert won't match any other node's chain. Audit this with a cross-node openssl x509 -in and openssl rsa -in modulus comparison.

Cert distribution lag. Followers poll on check_interval_seconds. With the default 1h, a follower can be up to 1h behind the leader's renewal. Tune the interval down if you need tighter sync (the cost is more SurrealDB traffic, but it's a single SELECT per cert per tick).

Rolling back to single-node

Set cluster.enabled: false (or remove the block entirely) on the surviving node and restart it. The leader lock will lapse; no other node tries to acquire. The audit store retains its history. Point the surviving node at SQLite instead of SurrealDB if you want to fully decouple.