Admin guide¶

This page is for operators bringing up a gdsgate cluster and keeping it running. It covers the supported deployment shapes, installation, registration and PKI lifecycle, network zoning, the state store, audit export, and the hardening checklist.

For the what and why, see Concepts. For every config field with defaults and examples, see Configuration.

Deployment shapes¶

All-in-one¶

One process runs Auth, Proxy, and an embedded agent against one state store. It is the simplest topology — useful for development, demos, and a small single-node deployment. Set store_url to a persistent location (a file-backed SQLite or PostgreSQL) and the cluster survives restarts: the transport CA is restored from the store and registered nodes keep trusting it. With an in-memory store_url (default), every restart yields a fresh CA — only useful for ephemeral tests.

An all-in-one cluster also accepts externally registered agents (your laptop, a remote sidecar), because the registration listener and the transport CA used on the internal channel come from the same persistent store. There is no separate "join-capable" mode.

Pros:

one binary, one config, one process to operate;
the simplest path to first traffic.

Cons:

single process, single failure domain — there is no internal failover;
the public listener stays plaintext by default in this mode (because the same anchor is used internally; turning on public TLS still works but you choose to distribute the CA anchor to clients).

Multi-node¶

Separate auth, proxy, and agent processes that authenticate to each other with mutual TLS. New nodes obtain their transport identity by registration (a one-time bootstrap token) and persist it. This is the production shape.

Pros:

horizontal scaling: many proxies in front of one Auth, many agents in the protected zone;
failure isolation: a proxy or agent restart does not touch the control plane;
HA: several Auth instances sharing one store, with a single audit write-leader at a time.

High availability¶

Set [ha].enabled = true on every Auth instance and point them at one shared PostgreSQL store_url. Each instance starts as a follower (refusing to write) and races for the audit write-leader lease through the store. The single leader handles authorisation writes; the followers serve only reads. A follower takes over within roughly lease_ttl_secs of the leader's death. This is failover for the single linear audit chain, not horizontal write scaling.

Installation¶

gdsgate ships as one binary. The CLI is a subcommand of it — there is no separate client to install.

From a release¶

Two artefacts per release:

Artefact	Notes
`gdsgate-<tag>-x86_64-unknown-linux-gnu`	dynamically linked (glibc)
`gdsgate-<tag>-x86_64-unknown-linux-musl`	static, self-contained

Download the binary and the integrity files, then verify before use. Verification needs cosign and the project's published cosign.pub public key.

# 1. checksums file signed by the project's cosign key
cosign verify-blob --key cosign.pub --signature SHA256SUMS.sig SHA256SUMS

# 2. artefacts match the checksums
sha256sum -c SHA256SUMS

# 3. (optional) inspect the CycloneDX SBOM
jq '.metadata.component.name, (.components | length)' gdsgate-<tag>.cdx.json

install -m 0755 gdsgate-<tag>-x86_64-unknown-linux-musl /usr/local/bin/gdsgate
gdsgate --version

If step 1 fails, the artefacts are not from the project (or were tampered with) — stop.

Each release is byte-for-byte reproducible: an independent rebuild from the same commit produces an identical sha256. See Operations → Release verification.

As a systemd unit¶

Run one role per unit, each with its own config. Example for an agent:

# /etc/systemd/system/gdsgate-agent.service
[Unit]
Description=gdsgate agent
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
ExecStart=/usr/local/bin/gdsgate --config /etc/gdsgate/agent.toml agent
# Bootstrap token (one-time secret) — keep out of the long-lived config file.
# Use an EnvironmentFile with mode 0600 to ship it on boot.
EnvironmentFile=-/etc/gdsgate/agent.env
DynamicUser=yes
StateDirectory=gdsgate
WorkingDirectory=/var/lib/gdsgate
CapabilityBoundingSet=
AmbientCapabilities=
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
ReadWritePaths=/var/lib/gdsgate
Restart=on-failure
RestartSec=2s
LimitNOFILE=65535

[Install]
WantedBy=multi-user.target

/etc/gdsgate/agent.env:

GDSGATE_ENROLL_TOKEN=<one-time token>

chmod 0600 /etc/gdsgate/agent.env
systemctl daemon-reload
systemctl enable --now gdsgate-agent

Point [enroll].state_dir at /var/lib/gdsgate (the StateDirectory=) so the persisted identity survives restarts.

As a container¶

The same binary runs in a minimal image. It needs no root and no Linux capabilities, so run it unprivileged with a read-only root filesystem:

FROM debian:bookworm-slim
COPY gdsgate /usr/local/bin/gdsgate
RUN useradd --uid 10001 --create-home gdsgate
USER gdsgate
ENTRYPOINT ["gdsgate"]

docker run --rm \
  --read-only --tmpfs /tmp \
  --cap-drop ALL --security-opt no-new-privileges \
  -v gdsgate-state:/var/lib/gdsgate \
  -v /etc/gdsgate:/etc/gdsgate:ro \
  -e GDSGATE_ENROLL_TOKEN="$token" \
  myorg/gdsgate --config /etc/gdsgate/agent.toml agent

Bringing up a multi-node cluster¶

This section walks through a production-shaped cluster, end to end.

1 — Auth + the state store¶

Auth holds the audit chain, the certificate authorities (their private keys), and the registration-token registry. Give it a persistent store:

# auth.toml
profile = "prod"
store_url = "postgres://gdsgate:…@store-db:5432/gdsgate"

[endpoints]
auth = "0.0.0.0:50051"          # mTLS control plane
auth_enroll = "0.0.0.0:50050"   # plaintext bootstrap registration

[policy]
path = "/etc/gdsgate/policy.cedar"

[oidc]
issuer = "https://idp.example.com/realms/gdsgate"
client_id = "gdsgate"

gdsgate --config auth.toml auth

On first start Auth generates and persists the transport CA, the User SSH CA, and the Onward SSH CA. On restart it resumes the same authorities, so already-registered nodes keep trusting it.

2 — Bootstrap tokens¶

Auth's control plane is mTLS-only, so a standalone Proxy or Agent must register before it can talk to it. Generate a one-time token per node; the command shares Auth's store URL:

gdsgate --config auth.toml auth create-token --role proxy --ttl 3600
gdsgate --config auth.toml auth create-token --role agent --ttl 3600

Each prints one token on stdout (logs go to stderr). The token is single-use and short-lived. Hand it to the joining node out of band (a systemd EnvironmentFile, a container secret, a Vault retrieval — not a long-lived config file).

3 — Register the Proxy¶

# proxy.toml
[endpoints]
auth = "auth:50051"
proxy_public = "0.0.0.0:50061"
proxy_internal = "0.0.0.0:50062"
proxy_ws = "0.0.0.0:50063"

[enroll]
endpoint = "http://auth:50050"
state_dir = "/var/lib/gdsgate/proxy"

GDSGATE_ENROLL_TOKEN="$proxy_token" \
    gdsgate --config proxy.toml proxy

The Proxy presents the token plus a certificate-signing request, receives a signed transport leaf and the trust bundle, and persists them to state_dir. From now on:

the public listener serves TLS verified by the cluster transport CA;
the internal listener requires mutual TLS;
the Auth client runs mutual TLS off the same identity.

4 — Register the Agent¶

The agent declares the resources it serves and the tunnel target:

# agent.toml
[endpoints]
auth = "auth:50051"
proxy_internal = "proxy:50062"
proxy_ws = "proxy:50063"

[enroll]
endpoint = "http://auth:50050"
state_dir = "/var/lib/gdsgate/agent"

[agent]
id = "edge-1"

[[agent.backends]]
resource = "prod-db"
kind = "postgres"
addr = "10.0.0.5:5432"

[[agent.backends]]
resource = "jump-host"
kind = "ssh"

GDSGATE_ENROLL_TOKEN="$agent_token" \
    gdsgate --config agent.toml agent

The Agent registers, opens the reverse tunnel to the Proxy, registers its resources, and is ready to serve.

5 — Give clients the transport CA¶

Clients verify the Proxy's public TLS against the cluster transport CA. That anchor is the transport-ca.pem an enrolled node writes into its state_dir (every node receives the same cluster CA). Distribute that file to your users and point their client config at it:

# client.toml
[endpoints]
proxy_public = "gdsgate.example.com:50061"

[client]
transport_ca = "/etc/gdsgate/transport-ca.pem"

[oidc]
issuer = "https://idp.example.com/realms/gdsgate"
client_id = "gdsgate"

After that, the User guide is everything a user needs.

Registration lifecycle and PKI¶

What gets persisted where¶

Auth's state store — the audit chain, the transport CA private key, the User SSH CA and Onward SSH CA private keys (all rotations retained), the registration-token registry, the resource catalog (the seeded [discovery]), the access-request queue.
Each node's state_dir — its own transport key and certificate, the cluster's transport CA anchor (transport-ca.pem), and (for agents serving SSH model A) the persistent SSH host key.

Together they keep the cluster consistent across restarts. Back up the store; persist the state dir. A restored store lets Auth resume the same transport CA, so already-registered nodes keep working.

Re-registration near expiry¶

Node transport certificates are short-lived. A long-running node should have a fresh token available near expiry so it can re-register. Two common ways:

ship a fresh token through the systemd EnvironmentFile on the next restart, and let your supervisor restart the process well before expiry;
if you orchestrate from CI, generate a new token on schedule and roll the unit.

CA rotation¶

Two CAs are operator-rotatable, on top of the persistent transport CA:

User SSH CA — issued access certificates. Rotate it to limit the blast radius of a compromised signer or just on a calendar. Auth drives a paced double-signing rotation: the new generation is published as a candidate, verifiers refresh their trust bundle (propagation_secs), the candidate is promoted to the signing CA, and the old generation is retired only after retire_secs (set this above the issued certificate TTL so existing certificates expire naturally before their CA is dropped).
Onward SSH CA — issued OpenSSH user certificates the agent presents to downstream sshd in the model-B jump-host path. The same paced rotation; downstream sshd configurations must trust the rotating CA bundle (gdsgate auth onward-ca-pubs prints every active and retiring public key, one OpenSSH line each — drop into the sshd's TrustedUserCAKeys).

Enable scheduled rotation via [ca_rotation], or trigger one manually:

gdsgate --config auth.toml auth rotate-ca          # User SSH CA
gdsgate --config auth.toml auth rotate-onward-ca   # Onward SSH CA

The transport CA is not runtime-rotatable in v1: rotating it would invalidate every enrolled node at once. Plan it as a controlled re-key event (re-register every node).

Just-in-time access — configuring approvers¶

JIT approval has two independent controls.

Control	Decides	Configured in
Cedar `approveRequest`	Who may sign off on a pending request	`[policy].path`
`[approvals]` cascade	How many distinct approvers are required	`[approvals]` and `[[discovery.resources]].min_approvers`

Who: Cedar `approveRequest`¶

Permit the approver group(s) in your Cedar policy. The simplest case:

permit(principal, action == Action::"approveRequest", resource)
when { principal in Group::"sre" };

Tighten with MFA-age, an open ticket, and so on. See Policy → Approving access requests for the patterns and current Cedar-context limitations.

How many: the `[approvals]` cascade¶

The number of distinct approvers a request needs is resolved by the narrowest-wins cascade, evaluated against the resource the request targets:

Per-resource — min_approvers on the catalogue entry:

[[discovery.resources]]
id            = "db-orders_prod"
kind          = "postgres"
min_approvers = 3

Per-environment — [approvals].per_environment:

[approvals]
min_approvers = 1

[approvals.per_environment]
prod    = 2
staging = 1
dev     = 1

Global — [approvals].min_approvers. The floor is 1.

The resource's environment comes from the catalogue (seeded from [discovery]) — so JIT thresholds rely on [discovery] being populated for every JIT-gated resource.

Day-to-day flow¶

# Requester
gdsgate request-access db-orders_prod \
    --reason "PROD-1234, read-only on orders" --ttl 3600

# Approvers
gdsgate requests              # list pending
gdsgate approve <request-id>

Every review (allow and deny) is recorded in the audit log. Once the threshold is reached, the request transitions to approved with expires_at = now + requested_ttl. An active approval is exposed to the Cedar context of the requester's subsequent access call as context.approved_request.expires — your policy then unblocks the connect for the remaining TTL (see Policy → JIT approval as the only way in).

A worked deployment¶

# auth.toml — operator-side cascade
[approvals]
min_approvers = 1

[approvals.per_environment]
prod    = 2
staging = 1

# Per-resource override for the strictest backend
[[discovery.resources]]
id            = "db-orders_prod"
kind          = "postgres"
min_approvers = 3

// policy.cedar (fragment)
permit(principal, action == Action::"approveRequest", resource)
when { principal in Group::"sre" };

permit(principal, action == Action::"dbConnect", resource)
when {
    resource.environment == "prod"
    && context has approved_request
    && context.approved_request.expires > context.timestamp
};

forbid(principal, action == Action::"dbConnect", resource)
when {
    resource.environment == "prod"
    && !(context has approved_request)
};

Validate before deploy:

gdsgate auth policy validate /etc/gdsgate/policy.cedar

Operational notes¶

Self-approval is not blocked by Cedar in v1 — the request's requester is not in the Cedar context for approveRequest. Enforce it operationally: keep requesters and approvers in distinct Cedar groups, and audit-monitor for the requester also appearing as a reviewer.
There is no built-in notifications — wire alerts off the audit log's createAccessRequest and approveRequest events.
Per-environment thresholds work today; per-environment approver groups (e.g. prod-approvers for prod-only) cannot yet be expressed in Cedar because the target resource of the request is not in the context for approveRequest.

Network zoning¶

Place the components in segmented networks so the topology itself enforces the access path. A reference layout uses four zones:

Zone	Members	Purpose
`edge`	client, Proxy public listener, identity provider	Public client access
`control`	Auth, Proxy, state store, identity provider (JWKS)	Control plane + state
`backend`	agent, resources	Protected zone — resources reachable only here
`uplink`	agent → Proxy, Auth	Agent egress: registration + reverse tunnel

flowchart LR
  client([client]) --- edge
  edge --- proxy
  proxy --- control
  control --- auth & store[(store)]
  agent --- backend
  backend --- resources[(resources)]
  agent --- uplink
  uplink --- proxy

Key properties:

Agent sits in backend with the resources and reaches out over uplink to the Proxy. Nothing connects into the backend zone — the reverse tunnel is outbound-only.
Proxy, Auth, store are not on backend — even they cannot reach a resource directly. The only path is the agent's tunnel.
Clients are on edge only — they can reach neither the protected zone nor the control plane; only the Proxy's public TLS.

With real firewalling: "allow egress from agent to proxy; deny all inbound to the protected zone".

State store¶

Auth's store is the cluster's source of truth.

Store	When	Notes
`sqlite::memory:`	development, unit tests, `gdsgate all` quick demo	Lost on restart — every restart rolls fresh CAs.
`sqlite:///var/lib/gdsgate/state.db?mode=rwc`	single-node, small deployments	File-backed. Back up the file. Survives restarts; external nodes can register.
`postgres://…`	multi-node, HA	Required for multiple Auth instances sharing the audit chain. Use a real backup strategy.

Whatever you choose, back it up. A restored store lets Auth resume the same transport CA, so already-registered nodes keep their identity across a restore. Without it you would re-register every node and re-issue every credential.

Audit export¶

Records are appended to a hash-chained log. They export as:

Canonical JSON — event fields plus hex hashes of this and the previous record;
Splunk HEC — the JSON event wrapped in a HEC envelope;
CEF / syslog — a single CEF line for classic SIEMs.

Ship the log to your SIEM and periodically verify the chain (a gap or a broken link is an alert). See Operations → Audit.

Hardening checklist¶

The reference stand runs every gdsgate node with all of the below and stays green — proving the hardening does not break the data plane.

Network¶

[ ] Backends only on the protected network; the agent bridges out, nothing in.
[ ] Proxy / Auth not on the backend network.
[ ] Clients reach only the proxy's public TLS, verified by the cluster transport CA.

Transport¶

[ ] Internal mutual TLS on every control hop, off the transport CA.
[ ] Registration tokens are single-use and short-lived; minted per node, out of band.
[ ] Short credential TTLs; a renewal path for long-running nodes.

Process¶

[ ] Non-root user; no Linux capabilities; NoNewPrivileges=yes.
[ ] Read-only root filesystem; only the state directory writable; tmpfs /tmp.
[ ] One role per unit / container.

State¶

[ ] Auth's state store private to the control plane; backed up.
[ ] Each node's state directory persisted on durable storage.
[ ] Audit log exported off-box; chain verification scheduled.

Policy¶

[ ] A real Cedar policy is loaded (without one Auth runs deny-all).
[ ] The Cedar policy is strict-validated before deploy (gdsgate auth policy validate).
[ ] Per-environment thresholds set in [approvals] for sensitive resources.