Operations¶
Running gdsgate day to day: logs, audit, release verification, backup, and publishing this documentation.
Logs¶
gdsgate emits structured logs to stderr (stdout is reserved for
data — for example the proxy-ssh byte stream). Filter with RUST_LOG
(default info):
RUST_LOG=info gdsgate --config auth.toml auth
RUST_LOG=gdsgate_proxy=debug,info gdsgate --config proxy.toml proxy
RUST_LOG=gdsgate_ssh=trace,info gdsgate --config agent.toml agent
Collect stderr with your platform's log pipeline (journald, the container log driver, Loki, etc.). OpenTelemetry export is a planned addition behind the single telemetry entry point; structured stderr logging is the supported path today.
Audit¶
Every privileged action — every authorisation decision (allow and deny), every node registration, every administrative operation, every opened session — is appended to a hash-chained audit log in Auth's state store. Each record links to the previous record's hash, so any tampering or gap is detectable. Auth refuses any grant whose audit record cannot first be made durable — persist before grant.
Export formats¶
Records export to the formats your SIEM expects:
- Canonical JSON — event fields plus hex hashes of this and the previous record.
- Splunk HEC — the JSON event wrapped in a HEC envelope.
- CEF / syslog — a single CEF line for classic SIEMs.
The export helpers live in the gdsgate-audit library; in a real
deployment a small sidecar tails the store and ships records to the
SIEM in whichever format is required.
Chain verification¶
Ship the log to your SIEM and periodically verify the chain on the source store. A verification gap (missing serials, broken hash link, fork) is an alert. A good cadence is once an hour from a host that has read access to the store but is otherwise outside the control plane.
What lands in audit¶
| Event class | Examples |
|---|---|
| Authorisation | sshConnect, dbConnect, k8sAccess, tcpConnect, view, sshForwardLocal, sshForwardRemote, mintOnwardSshCert — each with the principal, resource, decision, and Cedar reason. |
| Capability grant | Node registration (one-time token consumed; signed transport certificate handed out). |
| Administration | rotateCA, editPolicy, createUser, disableUser, JIT approvals. |
| Session data | SSH session recording metadata, database query log, MCP tool calls. |
Release verification¶
Releases are signed and byte-for-byte reproducible. Re-verify
what you run, and (for the highest assurance) rebuild from the
tagged commit and confirm the binary's sha256 matches the published
one.
Verifying signatures¶
Bind cosign.pub to the release tooling, not to a developer's local
key.
If verification fails, the artefacts are not from the project (or were tampered with) — stop and investigate.
Reproducing the build¶
Each release pins the toolchain version and the build environment so
an independent rebuild from the same source tag produces a sha256
identical to the published one. Build determinism is achieved through
fixed SOURCE_DATE_EPOCH, TZ=UTC, a C.UTF-8 locale, a pinned
rustc, and a reproducible container image.
SBOM¶
A CycloneDX SBOM is published per release:
Consume it from your supply-chain pipeline (Trivy, Grype, Snyk, etc.).
Backup and disaster recovery¶
- Back up Auth's state store. It holds the audit chain, the transport CA's private key, the SSH CAs' private keys, the registration-token registry, the resource catalog, and the access-request queue. Losing it loses cluster trust continuity. A restored store lets Auth resume the same transport CA, so already-registered nodes keep their identity.
- Persist each node's
state_dir. Combined with the store, nodes reuse their identity across restarts without re-registering. This is also where an agent's persistent SSH host key lives — so the user'sknown_hostsentry keeps verifying. - Export the audit log off-box (above) so history survives store loss.
- Credentials are short-lived: after a control-plane outage, clients
simply re-
loginand nodes re-register near expiry.
Restoring a cluster¶
- Restore the store from backup.
- Bring Auth up against the restored store. Auth resumes the
transport CA from the store; existing client
transport-ca.pemanchors still verify. - Bring Proxy and Agents up against their persisted
state_dirs. They reuse their certificates without re-registering. - Clients reuse their cached identity tokens until they expire,
then re-
loginagainst the identity provider.
A restored cluster that came up cleanly is a passing chain verification on the new run.
Operational sanity checks¶
A few things worth wiring into ongoing monitoring:
- Auth liveness — gRPC health check on
endpoints.auth. - Proxy liveness — gRPC health check on
endpoints.proxy_public. - Tunnel registrations — Proxy logs an event per agent registration / loss; alert on a long-lived loss.
- Audit chain depth — monotonic; alert on stalls.
- CA rotation — expected
propagation_secsandretire_secswindows; alert on a stuck rotation.