Auto cert provisioning by lwlee2608 · Pull Request #14 · EternisAI/silo-proxy

lwlee2608 · 2026-02-09T08:56:28Z

Summary by CodeRabbit

New Features
- Automated certificate provisioning for agents via one-time provision keys; new agent CLI provision subcommand and server provision endpoint
- Admin endpoints to create, list, and revoke provision keys; provisioning returns agent and CA certificates
Configuration
- New provision section: enable toggle, key TTL, and cleanup interval
Documentation
- Comprehensive cert-provisioning docs with phased design and implementation guides
Tests
- Added unit and HTTP tests covering key store and provisioning flows
Tools
- Helper scripts to create, list, and revoke provision keys via REST API

Adds a one-time provision key mechanism: admin creates a key tied to an agent ID, shares it with the agent operator, and the agent uses it to obtain certificates via REST — no admin API key needed. - Add in-memory KeyStore with create/validate/revoke/cleanup - Add admin endpoints (POST/GET/DELETE /api/v1/provision-keys) - Add agent endpoint (POST /api/v1/provision) authenticated by key - Add agent CLI subcommand: silo-proxy-agent provision --server --key - Add provision config section (disabled by default)

coderabbitai · 2026-02-09T08:56:38Z

Caution

Review failed

The pull request is closed.

Walkthrough

Adds a certificate provisioning subsystem and agent provisioning CLI support. Introduces an in-memory one-time provision KeyStore with TTL and cleanup, admin HTTP endpoints to create/list/revoke provision keys, and a provision HTTP endpoint that validates a key, signs a CSR via a CertSigner, marks keys used, and returns agent and CA certificates. Adds agent CLI provision command that posts a CSR and provision key to the server and writes returned certs to disk. Adds configuration options (provision.enabled, key_ttl_hours, cleanup_interval_minutes), corresponding wiring into server startup, DTOs, tests, scripts, and documentation for phased implementation.

Sequence Diagram(s)

sequenceDiagram
    participant AgentCLI as Agent CLI
    participant ServerHTTP as Server HTTP
    participant KeyStore as KeyStore
    participant CertService as CertService

    AgentCLI->>ServerHTTP: POST /api/v1/provision { key, csr }
    ServerHTTP->>KeyStore: Validate(key)
    KeyStore-->>ServerHTTP: provision key (valid) / error
    alt key valid
        ServerHTTP->>CertService: Sign CSR for agentID
        CertService-->>ServerHTTP: agent_cert PEM
        ServerHTTP->>KeyStore: MarkUsed(key)
        ServerHTTP-->>AgentCLI: 200 { agent_id, cert_pem, key_pem, ca_cert_pem }
        AgentCLI->>AgentCLI: write certs to disk
    else invalid/expired/used
        KeyStore-->>ServerHTTP: error
        ServerHTTP-->>AgentCLI: 401/400 error
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I found a sk_ key in a twilight patch,
I sent a brave CSR to the server's hatch,
The CA stamped a seal, a soft certificate tune,
I saved tiny files by the light of the moon,
Hooray — hop, provision complete! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Auto cert provisioning' is directly related to the main changes in the PR, which implement automated certificate provisioning for silo proxy agents.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/agent-cert-provision

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🤖 Fix all issues with AI agents

In `@cmd/silo-proxy-agent/provision.go`:
- Around line 37-42: Replace the use of the default http.Post with a bespoke
http.Client that has a timeout to avoid indefinite hangs: create an
http.Client{Timeout: time.Second * <N>} (choose appropriate N), build the
request for url (using http.NewRequest with reqBody) and call client.Do(req)
instead of http.Post, handle errors the same way and defer resp.Body.Close() as
before; update imports to include "time". Use the existing local symbols url,
reqBody and resp handling code in provision.go to slot this change in.

In `@cmd/silo-proxy-server/main.go`:
- Around line 96-107: When provisioning is enabled, validate
config.Provision.KeyTTLHours and config.Provision.CleanupIntervalMinutes before
creating provision.NewKeyStore and starting keyStore.StartCleanup: ensure both
values are > 0 (reject zero/negative), log a clear error via slog.Error and
os.Exit(1) if invalid, and only then compute ttl :=
time.Duration(config.Provision.KeyTTLHours)*time.Hour, cleanupInterval :=
time.Duration(config.Provision.CleanupIntervalMinutes)*time.Minute and proceed
to call provision.NewKeyStore and go keyStore.StartCleanup(context.Background(),
cleanupInterval); keep existing certService nil check intact.

In `@docs/cert-provisioning/overview.md`:
- Around line 22-54: The fenced code blocks in
docs/cert-provisioning/overview.md are missing language identifiers (MD040);
update each triple-backtick block (for example the large ASCII diagram and the
blocks containing "POST /api/v1/provision-keys", "GET /api/v1/provision-keys",
"DELETE /api/v1/provision-keys/:agent_id", "POST /api/v1/provision", and the
certs tree) to include a language tag such as text (e.g., ```text) so
markdownlint passes; scan the file for all similar blocks (noted around the
diagram and ranges 74-116, 122-149, 199-216) and add the same tag consistently.
- Around line 86-92: Replace any example secret-like values used in the examples
(e.g., the "provision_key" value "sk_a1b2c3d4e5f6..." and similar strings) with
clearly fake placeholders such as "sk_example_..." or "provision_key_example"
and likewise use non-secret example values for "agent_id" (e.g.,
"agent_example") in both occurrences mentioned; update all instances (including
the second occurrence) so example values cannot be mistaken for real secrets and
will not trigger secret scanners.

In `@docs/cert-provisioning/phase1.md`:
- Around line 52-59: Update the Validate documentation to correctly state the
error conditions: change the first bullet to "Key not found in map → 'invalid
provision key'", and rephrase bullets 2–3 to read "Key is expired → 'provision
key expired'" and "Key already used → 'provision key already used'"; ensure the
paragraph referencing Validate and the returned ProvisionKey reflects these
corrected conditions.

In `@internal/api/http/handler/provision.go`:
- Around line 82-147: The ProvisionHandler.Provision currently calls
keyStore.Validate(...) then later keyStore.MarkUsed(...), causing a race; add an
atomic KeyStore.Consume(key string) method that acquires the write lock,
verifies the key exists and is unused, marks it used, and returns the parsed
provision key (or an error) in one operation, then in Provision replace the
Validate(...) + MarkUsed(...) sequence with a single call to Consume(req.Key),
use the returned pk.AgentID as before, and handle/log errors exactly where
Validate/MarkUsed were handled.

In `@internal/provision/key_store.go`:
- Around line 66-81: The Validate function releases ks.mu.RLock() too early
causing a data race when MarkUsed modifies the same ProvisionKey; keep the read
lock held while checking pk.Used and pk.ExpiresAt, then return a shallow copy of
the ProvisionKey (not the pointer stored in ks.keys) before releasing the lock
so callers cannot mutate the internal struct; update KeyStore.Validate to
perform all reads under ks.mu.RLock()/RUnlock() and return a copy of the
ProvisionKey to avoid concurrent mutation.

🧹 Nitpick comments (4)

docs/cert-provisioning/phase2.md (1)
20-24: Add language specifier to fenced code block.

The route listing block should have a language specifier for consistency with markdown best practices.
📝 Proposed fix
-```
+```text
 POST   /api/v1/provision-keys      → CreateProvisionKey
 GET    /api/v1/provision-keys      → ListProvisionKeys
 DELETE /api/v1/provision-keys/:id  → RevokeProvisionKey
</details>

</blockquote></details>
<details>
<summary>docs/cert-provisioning/phase3.md (1)</summary><blockquote>

`54-63`: **Minor: Clarify validity period calculation in pseudocode.**

Line 58 shows `NotAfter: time.Now().Add(validityDays)` but `validityDays` is an `int`. The actual implementation should multiply by `24 * time.Hour` to convert days to duration. This is just pseudocode, but clarifying it would help implementers.

<details>
<summary>📝 Suggested clarification</summary>

```diff
 //    - NotBefore: time.Now()
-//    - NotAfter: time.Now().Add(validityDays)
+//    - NotAfter: time.Now().Add(time.Duration(validityDays) * 24 * time.Hour)
docs/cert-provisioning/phase4.md (1)
45-62: Add language specifier to flow block.

The flow block should have a language specifier for consistency.
📝 Proposed fix
-```
+```text
 1. Parse and validate request body
    - provision_key: required, non-empty
    - csr: required, non-empty
 ...
internal/api/http/handler/provision_test.go (1)

145-181: Invalid/missing-key tests only hit the TLS-disabled branch.

Both tests build the handler with a nil certService, so they never reach key validation or missing-key handling. Consider renaming these tests or adding a non-nil certService (or a lightweight test helper) to exercise invalid/missing key behavior.

coderabbitai · 2026-02-09T09:16:11Z

cmd/silo-proxy-agent/provision.go

+	url := *server + "/api/v1/provision"
+	resp, err := http.Post(url, "application/json", bytes.NewBuffer(reqBody))
+	if err != nil {
+		return fmt.Errorf("failed to connect to server: %w", err)
+	}
+	defer resp.Body.Close()


⚠️ Potential issue | 🟡 Minor

Add HTTP client timeout to prevent indefinite hangs.

The http.Post call uses the default HTTP client which has no timeout. If the server is unresponsive, the agent will hang indefinitely.

🛡️ Proposed fix to add timeout

+ client := &http.Client{Timeout: 30 * time.Second} url := *server + "/api/v1/provision" - resp, err := http.Post(url, "application/json", bytes.NewBuffer(reqBody)) + resp, err := client.Post(url, "application/json", bytes.NewBuffer(reqBody)) if err != nil { return fmt.Errorf("failed to connect to server: %w", err) }

You'll also need to add "time" to imports.

🤖 Prompt for AI Agents

In `@cmd/silo-proxy-agent/provision.go` around lines 37 - 42, Replace the use of the default http.Post with a bespoke http.Client that has a timeout to avoid indefinite hangs: create an http.Client{Timeout: time.Second * <N>} (choose appropriate N), build the request for url (using http.NewRequest with reqBody) and call client.Do(req) instead of http.Post, handle errors the same way and defer resp.Body.Close() as before; update imports to include "time". Use the existing local symbols url, reqBody and resp handling code in provision.go to slot this change in.

coderabbitai · 2026-02-09T09:16:11Z

cmd/silo-proxy-server/main.go

+	var keyStore *provision.KeyStore
+	if config.Provision.Enabled {
+		if certService == nil {
+			slog.Error("Provisioning requires TLS to be enabled")
+			os.Exit(1)
+		}
+		ttl := time.Duration(config.Provision.KeyTTLHours) * time.Hour
+		keyStore = provision.NewKeyStore(ttl)
+		cleanupInterval := time.Duration(config.Provision.CleanupIntervalMinutes) * time.Minute
+		go keyStore.StartCleanup(context.Background(), cleanupInterval)
+		slog.Info("Provisioning enabled", "key_ttl_hours", config.Provision.KeyTTLHours)
+	}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find the provision package and StartCleanup function find . -path ./vendor -prune -o -type f -name "*.go" -print | xargs grep -l "func.*StartCleanup" | head -5

Repository: EternisAI/silo-proxy

Length of output: 97

🏁 Script executed:

#!/bin/bash # Search for StartCleanup implementation across the codebase rg -n "func.*StartCleanup" -A 10 --type go

Repository: EternisAI/silo-proxy

Length of output: 685

🏁 Script executed:

#!/bin/bash # Check the config structure for Provision settings rg -n "type.*Provision|KeyTTLHours|CleanupIntervalMinutes" --type go | head -20

Repository: EternisAI/silo-proxy

Length of output: 1245

🏁 Script executed:

#!/bin/bash # Check NewKeyStore implementation and how it handles TTL rg -n "func NewKeyStore" -A 10 --type go

Repository: EternisAI/silo-proxy

Length of output: 773

🏁 Script executed:

#!/bin/bash # Check config loading and validation rg -n "config.*Provision|viper|unmarshal" --type go | grep -i "provision\|config" | head -20

Repository: EternisAI/silo-proxy

Length of output: 467

🏁 Script executed:

#!/bin/bash # Search for any existing validation of these config values rg -n "KeyTTLHours|CleanupIntervalMinutes" --type go

Repository: EternisAI/silo-proxy

Length of output: 604

🏁 Script executed:

#!/bin/bash # Check time.NewTicker behavior with documentation or test references rg -n "time.NewTicker|StartCleanup" -B 2 -A 5 --type go

Repository: EternisAI/silo-proxy

Length of output: 2172

🏁 Script executed:

#!/bin/bash # Check the cleanup implementation and how it uses TTL rg -n "func.*cleanup|ttl" -A 8 --type go internal/provision/key_store.go

Repository: EternisAI/silo-proxy

Length of output: 917

🏁 Script executed:

#!/bin/bash # Look at the complete ProvisionConfig definition cat -n cmd/silo-proxy-server/config.go | head -40

Repository: EternisAI/silo-proxy

Length of output: 1436

🏁 Script executed:

#!/bin/bash # Check if there's any config validation layer rg -n "validate|Validate|Config\(\)" -B 2 -A 5 --type go cmd/silo-proxy-server/

Repository: EternisAI/silo-proxy

Length of output: 1366

🏁 Script executed:

#!/bin/bash # Check Go time.NewTicker behavior with zero or negative duration # Search for any tests or documentation about this rg -n "NewTicker.*0|NewTicker.*-" --type go | head -10

Repository: EternisAI/silo-proxy

Length of output: 46

🏁 Script executed:

#!/bin/bash # Check if there's a default configuration file or environment setup find . -type f \( -name "*.yaml" -o -name "*.yml" -o -name "*.env*" -o -name "application*" \) 2>/dev/null | head -10

Repository: EternisAI/silo-proxy

Length of output: 236

🏁 Script executed:

#!/bin/bash # Check what values are used in tests or default configs rg -n "key_ttl_hours|cleanup_interval_minutes" -A 1 -B 1 --type go --type yaml

Repository: EternisAI/silo-proxy

Length of output: 844

🏁 Script executed:

#!/bin/bash # Check Go's time.NewTicker behavior - let's see if there's any panic handling or test refs rg -n "panic|NewTicker" -B 2 -A 2 --type go | grep -A 5 -B 5 "NewTicker"

Repository: EternisAI/silo-proxy

Length of output: 1375

🏁 Script executed:

#!/bin/bash # Get the full main.go context around the provisioning setup sed -n '96,110p' cmd/silo-proxy-server/main.go

Repository: EternisAI/silo-proxy

Length of output: 638

Add validation for provisioning TTL and cleanup interval before initialization.

The code directly passes config.Provision.KeyTTLHours and config.Provision.CleanupIntervalMinutes to duration calculations without validation. If either is zero or negative, time.NewTicker will panic, and keys will be created with immediate expiry. Add explicit validation to fail fast when provisioning is enabled.

Suggested fix

if config.Provision.Enabled { if certService == nil { slog.Error("Provisioning requires TLS to be enabled") os.Exit(1) } + if config.Provision.KeyTTLHours <= 0 { + slog.Error("Provisioning requires key_ttl_hours to be > 0") + os.Exit(1) + } + if config.Provision.CleanupIntervalMinutes <= 0 { + slog.Error("Provisioning requires cleanup_interval_minutes to be > 0") + os.Exit(1) + } ttl := time.Duration(config.Provision.KeyTTLHours) * time.Hour keyStore = provision.NewKeyStore(ttl) cleanupInterval := time.Duration(config.Provision.CleanupIntervalMinutes) * time.Minute go keyStore.StartCleanup(context.Background(), cleanupInterval) slog.Info("Provisioning enabled", "key_ttl_hours", config.Provision.KeyTTLHours) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

var keyStore *provision.KeyStore

if config.Provision.Enabled {

if certService == nil {

slog.Error("Provisioning requires TLS to be enabled")

os.Exit(1)

}

ttl := time.Duration(config.Provision.KeyTTLHours) * time.Hour

keyStore = provision.NewKeyStore(ttl)

cleanupInterval := time.Duration(config.Provision.CleanupIntervalMinutes) * time.Minute

go keyStore.StartCleanup(context.Background(), cleanupInterval)

slog.Info("Provisioning enabled", "key_ttl_hours", config.Provision.KeyTTLHours)

}

if config.Provision.Enabled {

if certService == nil {

slog.Error("Provisioning requires TLS to be enabled")

os.Exit(1)

}

if config.Provision.KeyTTLHours <= 0 {

slog.Error("Provisioning requires key_ttl_hours to be > 0")

os.Exit(1)

}

if config.Provision.CleanupIntervalMinutes <= 0 {

slog.Error("Provisioning requires cleanup_interval_minutes to be > 0")

os.Exit(1)

}

ttl := time.Duration(config.Provision.KeyTTLHours) * time.Hour

keyStore = provision.NewKeyStore(ttl)

cleanupInterval := time.Duration(config.Provision.CleanupIntervalMinutes) * time.Minute

go keyStore.StartCleanup(context.Background(), cleanupInterval)

slog.Info("Provisioning enabled", "key_ttl_hours", config.Provision.KeyTTLHours)

}

🤖 Prompt for AI Agents

In `@cmd/silo-proxy-server/main.go` around lines 96 - 107, When provisioning is enabled, validate config.Provision.KeyTTLHours and config.Provision.CleanupIntervalMinutes before creating provision.NewKeyStore and starting keyStore.StartCleanup: ensure both values are > 0 (reject zero/negative), log a clear error via slog.Error and os.Exit(1) if invalid, and only then compute ttl := time.Duration(config.Provision.KeyTTLHours)*time.Hour, cleanupInterval := time.Duration(config.Provision.CleanupIntervalMinutes)*time.Minute and proceed to call provision.NewKeyStore and go keyStore.StartCleanup(context.Background(), cleanupInterval); keep existing certService nil check intact.

coderabbitai · 2026-02-09T09:16:11Z

docs/cert-provisioning/overview.md

+```
+ Admin                    Server                     Agent Device
+   |                        |                            |
+   | POST /api/v1/          |                            |
+   |   provision-keys       |                            |
+   | {agent_id: "agent-5"}  |                            |
+   |----------------------->|                            |
+   |   provision_key:       |                            |
+   |   "sk_a1b2c3..."      |                            |
+   |<-----------------------|                            |
+   |                        |                            |
+   |  (copy key to device)  |                            |
+   |~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~>|
+   |                        |                            |
+   |                        |  1. Generate private key   |
+   |                        |  2. Create CSR             |
+   |                        |                            |
+   |                        |  POST /api/v1/provision    |
+   |                        |  {key, csr}                |
+   |                        |<---------------------------|
+   |                        |                            |
+   |                        |  3. Validate key           |
+   |                        |  4. Sign CSR with CA       |
+   |                        |  5. Mark key as used       |
+   |                        |                            |
+   |                        |  {agent_cert, ca_cert}     |
+   |                        |--------------------------->|
+   |                        |                            |
+   |                        |  6. Save certs to disk     |
+   |                        |  7. Connect via mTLS       |
+   |                        |<===========================|
+   |                        |     (gRPC stream)          |
+```


⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced code blocks (MD040).

markdownlint flags these blocks as missing a language. Use text (or a more specific language) to satisfy MD040.

🔧 Proposed fix (representative blocks)

-``` +```text Admin Server Agent Device | | | @@ -``` +``` -``` +```text POST /api/v1/provision-keys -``` +``` -``` +```text GET /api/v1/provision-keys -``` +``` -``` +```text DELETE /api/v1/provision-keys/:agent_id -``` +``` -``` +```text POST /api/v1/provision -``` +``` -``` +```text ~/.silo-proxy/certs/ ├── agent-key.pem # generated locally ├── agent-cert.pem # signed by server CA └── ca-cert.pem # for verifying server -``` +```

Also applies to: 74-116, 122-149, 199-216

🧰 Tools

🪛 markdownlint-cli2 (0.20.0)

[warning] 22-22: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In `@docs/cert-provisioning/overview.md` around lines 22 - 54, The fenced code blocks in docs/cert-provisioning/overview.md are missing language identifiers (MD040); update each triple-backtick block (for example the large ASCII diagram and the blocks containing "POST /api/v1/provision-keys", "GET /api/v1/provision-keys", "DELETE /api/v1/provision-keys/:agent_id", "POST /api/v1/provision", and the certs tree) to include a language tag such as text (e.g., ```text) so markdownlint passes; scan the file for all similar blocks (noted around the diagram and ranges 74-116, 122-149, 199-216) and add the same tag consistently.

coderabbitai · 2026-02-09T09:16:11Z

docs/cert-provisioning/overview.md

+Response:
+```json
+{
+  "provision_key": "sk_a1b2c3d4e5f6...",
+  "agent_id": "agent-5",
+  "expires_at": "2026-02-10T12:00:00Z"
+}


⚠️ Potential issue | 🟡 Minor

Use clearly fake placeholders for example provision keys.

The example sk_a1b2c3... strings trigger secret scanners. Swap to explicit placeholders (e.g., sk_example_...) to avoid false positives.

🧹 Proposed fix

- "provision_key": "sk_a1b2c3d4e5f6...", + "provision_key": "sk_example_123456...", @@ - "provision_key": "sk_a1b2c3d4e5f6...", + "provision_key": "sk_example_123456...",

Also applies to: 134-139

🧰 Tools

🪛 Gitleaks (8.30.0)

[high] 89-89: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🤖 Prompt for AI Agents

In `@docs/cert-provisioning/overview.md` around lines 86 - 92, Replace any example secret-like values used in the examples (e.g., the "provision_key" value "sk_a1b2c3d4e5f6..." and similar strings) with clearly fake placeholders such as "sk_example_..." or "provision_key_example" and likewise use non-secret example values for "agent_id" (e.g., "agent_example") in both occurrences mentioned; update all instances (including the second occurrence) so example values cannot be mistaken for real secrets and will not trigger secret scanners.

coderabbitai · 2026-02-09T09:16:11Z

docs/cert-provisioning/phase1.md

+### Validate Logic
+
+`Validate` checks three conditions and returns a specific error for each:
+1. Key exists in the map → `"invalid provision key"`
+2. Key not expired → `"provision key expired"`
+3. Key not already used → `"provision key already used"`
+
+Returns the `ProvisionKey` on success.


⚠️ Potential issue | 🟡 Minor

Clarify validation error conditions.

Line 55 states "Key exists in the map → invalid provision key" which is confusing. The error "invalid provision key" should occur when the key does not exist in the map.

📝 Suggested fix

`Validate` checks three conditions and returns a specific error for each: -1. Key exists in the map → `"invalid provision key"` +1. Key does not exist in the map → `"invalid provision key"` 2. Key not expired → `"provision key expired"` 3. Key not already used → `"provision key already used"`

The conditions on lines 56-57 also read awkwardly. Consider rewording to describe when errors occur:

1. Key not found in map → "invalid provision key" 2. Key is expired → "provision key expired" 3. Key already used → "provision key already used"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

### Validate Logic

`Validate` checks three conditions and returns a specific error for each:

1. Key exists in the map → `"invalid provision key"`

2. Key not expired → `"provision key expired"`

3. Key not already used → `"provision key already used"`

Returns the `ProvisionKey` on success.

### Validate Logic

`Validate` checks three conditions and returns a specific error for each:

1. Key does not exist in the map → `"invalid provision key"`

2. Key not expired → `"provision key expired"`

3. Key not already used → `"provision key already used"`

Returns the `ProvisionKey` on success.

🤖 Prompt for AI Agents

In `@docs/cert-provisioning/phase1.md` around lines 52 - 59, Update the Validate documentation to correctly state the error conditions: change the first bullet to "Key not found in map → 'invalid provision key'", and rephrase bullets 2–3 to read "Key is expired → 'provision key expired'" and "Key already used → 'provision key already used'"; ensure the paragraph referencing Validate and the returned ProvisionKey reflects these corrected conditions.

coderabbitai · 2026-02-09T09:16:11Z

internal/api/http/handler/provision.go

+func (h *ProvisionHandler) Provision(ctx *gin.Context) {
+	if h.certService == nil {
+		slog.Warn("Provision requested but TLS is disabled")
+		ctx.JSON(http.StatusBadRequest, gin.H{"error": "TLS is not enabled on this server"})
+		return
+	}
+
+	var req dto.ProvisionRequest
+	if err := ctx.ShouldBindJSON(&req); err != nil {
+		ctx.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
+		return
+	}
+
+	pk, err := h.keyStore.Validate(req.Key)
+	if err != nil {
+		slog.Warn("Provision key validation failed", "error", err)
+		ctx.JSON(http.StatusUnauthorized, gin.H{"error": err.Error()})
+		return
+	}
+
+	agentID := pk.AgentID
+
+	if h.certService.AgentCertExists(agentID) {
+		slog.Warn("Certificate already exists for agent", "agent_id", agentID)
+		ctx.JSON(http.StatusConflict, gin.H{"error": "Certificate already exists for this agent"})
+		return
+	}
+
+	agentCert, agentKey, err := h.certService.GenerateAgentCert(agentID)
+	if err != nil {
+		slog.Error("Failed to generate agent certificate", "error", err, "agent_id", agentID)
+		ctx.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to generate agent certificate"})
+		return
+	}
+
+	certPEM, err := cert.CertToPEM(agentCert)
+	if err != nil {
+		slog.Error("Failed to encode certificate", "error", err, "agent_id", agentID)
+		ctx.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to encode certificate"})
+		return
+	}
+
+	keyPEM, err := cert.KeyToPEM(agentKey)
+	if err != nil {
+		slog.Error("Failed to encode key", "error", err, "agent_id", agentID)
+		ctx.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to encode key"})
+		return
+	}
+
+	caCertBytes, err := h.certService.GetCACert()
+	if err != nil {
+		slog.Error("Failed to read CA certificate", "error", err, "agent_id", agentID)
+		ctx.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to read CA certificate"})
+		return
+	}
+
+	h.keyStore.MarkUsed(req.Key)
+
+	slog.Info("Agent provisioned successfully", "agent_id", agentID)
+	ctx.JSON(http.StatusOK, dto.ProvisionResponse{
+		AgentID:   agentID,
+		CertPEM:   string(certPEM),
+		KeyPEM:    string(keyPEM),
+		CACertPEM: string(caCertBytes),
+	})
+}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find KeyStore interface and implementation find . -type f -name "*.go" | xargs grep -l "type KeyStore" | head -20

Repository: EternisAI/silo-proxy

Length of output: 97

🏁 Script executed:

#!/bin/bash # Search for Validate and MarkUsed method definitions rg "func.*Validate\(|func.*MarkUsed\(" --type go -B 2 -A 10

Repository: EternisAI/silo-proxy

Length of output: 2072

🏁 Script executed:

#!/bin/bash # Check if a Consume method already exists rg "Consume\(" --type go -B 2 -A 5

Repository: EternisAI/silo-proxy

Length of output: 46

🏁 Script executed:

#!/bin/bash # Look at the provision handler file to understand context cat -n internal/api/http/handler/provision.go | head -150

Repository: EternisAI/silo-proxy

Length of output: 5337

Provision key race condition allows issuing multiple certificates for a single key.

Validate() releases its read lock immediately after checking the key is unused (line 95), then certificate generation, encoding, and CA cert retrieval occur without holding the lock (lines 110–136). Another concurrent request can pass Validate() during this window before the first request calls MarkUsed() (line 138), resulting in two requests using the same provision key.

Implement an atomic Consume(key string) method in KeyStore that acquires the write lock, validates the key exists and is unused, marks it used, and releases the lock—all as a single atomic operation. Replace both Validate() and MarkUsed() calls with this single operation.

🤖 Prompt for AI Agents

In `@internal/api/http/handler/provision.go` around lines 82 - 147, The ProvisionHandler.Provision currently calls keyStore.Validate(...) then later keyStore.MarkUsed(...), causing a race; add an atomic KeyStore.Consume(key string) method that acquires the write lock, verifies the key exists and is unused, marks it used, and returns the parsed provision key (or an error) in one operation, then in Provision replace the Validate(...) + MarkUsed(...) sequence with a single call to Consume(req.Key), use the returned pk.AgentID as before, and handle/log errors exactly where Validate/MarkUsed were handled.

coderabbitai · 2026-02-09T09:16:12Z

internal/provision/key_store.go

+func (ks *KeyStore) Validate(key string) (*ProvisionKey, error) {
+	ks.mu.RLock()
+	pk, exists := ks.keys[key]
+	ks.mu.RUnlock()
+
+	if !exists {
+		return nil, ErrKeyNotFound
+	}
+	if pk.Used {
+		return nil, ErrKeyAlreadyUsed
+	}
+	if time.Now().After(pk.ExpiresAt) {
+		return nil, ErrKeyExpired
+	}
+	return pk, nil
+}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

fd "key_store.go" --type f

Repository: EternisAI/silo-proxy

Length of output: 95

🏁 Script executed:

cat -n internal/provision/key_store.go | head -100

Repository: EternisAI/silo-proxy

Length of output: 2632

🏁 Script executed:

rg -n "func.*MarkUsed|func.*cleanup" internal/provision/

Repository: EternisAI/silo-proxy

Length of output: 209

🏁 Script executed:

sed -n '137,170p' internal/provision/key_store.go

Repository: EternisAI/silo-proxy

Length of output: 368

Data race in Validate due to unlock-before-read.

Validate releases the read lock at line 69 before checking pk.Used (line 74) and pk.ExpiresAt (line 77), while MarkUsed can acquire the write lock and modify pk.Used concurrently on the same pointer. This violates Go's memory model and produces a data race. Keep the lock held while reading all fields and return a copy to prevent inconsistent validation and external mutation.

🛠️ Proposed fix to keep reads under lock

func (ks *KeyStore) Validate(key string) (*ProvisionKey, error) { - ks.mu.RLock() - pk, exists := ks.keys[key] - ks.mu.RUnlock() - - if !exists { - return nil, ErrKeyNotFound - } - if pk.Used { - return nil, ErrKeyAlreadyUsed - } - if time.Now().After(pk.ExpiresAt) { - return nil, ErrKeyExpired - } - return pk, nil + ks.mu.RLock() + pk, exists := ks.keys[key] + if !exists { + ks.mu.RUnlock() + return nil, ErrKeyNotFound + } + if pk.Used { + ks.mu.RUnlock() + return nil, ErrKeyAlreadyUsed + } + if time.Now().After(pk.ExpiresAt) { + ks.mu.RUnlock() + return nil, ErrKeyExpired + } + pkCopy := *pk + ks.mu.RUnlock() + return &pkCopy, nil }

🤖 Prompt for AI Agents

In `@internal/provision/key_store.go` around lines 66 - 81, The Validate function releases ks.mu.RLock() too early causing a data race when MarkUsed modifies the same ProvisionKey; keep the read lock held while checking pk.Used and pk.ExpiresAt, then return a shallow copy of the ProvisionKey (not the pointer stored in ks.keys) before releasing the lock so callers cannot mutate the internal struct; update KeyStore.Validate to perform all reads under ks.mu.RLock()/RUnlock() and return a copy of the ProvisionKey to avoid concurrent mutation.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@cmd/silo-proxy-agent/provision.go`:
- Around line 67-70: The code unmarshals into provResp (dto.ProvisionResponse)
but does not validate provResp.AgentID before using it to build file paths; add
a check immediately after json.Unmarshal to trim whitespace and ensure
provResp.AgentID is non-empty and contains no path separators (e.g., reject
values with "/" or "\"), returning a descriptive error if invalid, so subsequent
code that constructs certs/agents/<AgentID>-cert.pem or creates directories will
not operate on an empty or malicious agent id.

In `@internal/api/http/handler/provision.go`:
- Around line 111-115: When the provision handler detects an existing
certificate (!created) it currently returns 409 without marking the provision
key used; call the provision key store's MarkUsed method for the key (same key
that was validated earlier) before returning the conflict response so the key's
Used flag is set and cannot be reused—update the branch in the handler (the code
around the created check in provision.go handling agentID) to invoke
MarkUsed(key) then log and ctx.JSON the 409 response; no error handling is
required for MarkUsed since it returns void.

coderabbitai · 2026-02-09T13:46:50Z

cmd/silo-proxy-agent/provision.go

+	var provResp dto.ProvisionResponse
+	if err := json.Unmarshal(body, &provResp); err != nil {
+		return fmt.Errorf("failed to parse response: %w", err)
+	}


⚠️ Potential issue | 🟡 Minor

Validate AgentID before using it in file paths.

If the server returns an empty AgentID, the code will create paths like certs/agents//-cert.pem and a directory named with an empty string, which could cause unexpected behavior.

🛡️ Proposed fix

var provResp dto.ProvisionResponse if err := json.Unmarshal(body, &provResp); err != nil { return fmt.Errorf("failed to parse response: %w", err) } + if provResp.AgentID == "" { + return fmt.Errorf("server returned empty agent ID") + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

var provResp dto.ProvisionResponse

if err := json.Unmarshal(body, &provResp); err != nil {

return fmt.Errorf("failed to parse response: %w", err)

}

var provResp dto.ProvisionResponse

if err := json.Unmarshal(body, &provResp); err != nil {

return fmt.Errorf("failed to parse response: %w", err)

}

if provResp.AgentID == "" {

return fmt.Errorf("server returned empty agent ID")

}

🤖 Prompt for AI Agents

In `@cmd/silo-proxy-agent/provision.go` around lines 67 - 70, The code unmarshals into provResp (dto.ProvisionResponse) but does not validate provResp.AgentID before using it to build file paths; add a check immediately after json.Unmarshal to trim whitespace and ensure provResp.AgentID is non-empty and contains no path separators (e.g., reject values with "/" or "\"), returning a descriptive error if invalid, so subsequent code that constructs certs/agents/<AgentID>-cert.pem or creates directories will not operate on an empty or malicious agent id.

coderabbitai · 2026-02-09T13:46:50Z

internal/api/http/handler/provision.go

+	if !created {
+		slog.Warn("Certificate already exists for agent", "agent_id", agentID)
+		ctx.JSON(http.StatusConflict, gin.H{"error": "Certificate already exists for this agent"})
+		return
+	}


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find key_store.go and examine MarkUsed signature and behavior fd -a "key_store.go" -type f

Repository: EternisAI/silo-proxy

Length of output: 234

🏁 Script executed:

# Also search for KeyStore interface/type definition rg -n "type KeyStore|func.*MarkUsed|func.*Validate" --max-count 50 -A 3

Repository: EternisAI/silo-proxy

Length of output: 3770

🏁 Script executed:

# Examine the provision.go file around the specified lines cat -n internal/api/http/handler/provision.go | head -160 | tail -60

Repository: EternisAI/silo-proxy

Length of output: 2347

🏁 Script executed:

cat -n internal/provision/key_store.go

Repository: EternisAI/silo-proxy

Length of output: 3908

🏁 Script executed:

# Also check the Validate function to understand key consumption/reuse behavior rg -n "func.*Validate|Used|func.*Revoke" internal/provision/key_store.go -A 5

Repository: EternisAI/silo-proxy

Length of output: 1380

Mark the provision key as used before returning 409 when certificate already exists.

When the certificate already exists (!created at line 111), the handler returns 409 without calling MarkUsed, allowing the key to potentially be reused. The Validate function checks the Used flag and rejects already-used keys, so the key should be marked used before returning the error to prevent reuse attempts.

Note: MarkUsed is a void function that does not return errors, so error handling around it is not applicable. The current design relies on the Used flag to prevent reuse.

🤖 Prompt for AI Agents

In `@internal/api/http/handler/provision.go` around lines 111 - 115, When the provision handler detects an existing certificate (!created) it currently returns 409 without marking the provision key used; call the provision key store's MarkUsed method for the key (same key that was validated earlier) before returning the conflict response so the key's Used flag is set and cannot be reused—update the branch in the handler (the code around the created check in provision.go handling agentID) to invoke MarkUsed(key) then log and ctx.JSON the 409 response; no error handling is required for MarkUsed since it returns void.

lwlee2608 added 3 commits February 9, 2026 16:28

automated cert provisioning steps

e982bf5

add cert provisioning phase docs

06e3fc1

lwlee2608 marked this pull request as ready for review February 9, 2026 09:05

coderabbitai bot reviewed Feb 9, 2026

View reviewed changes

lwlee2608 changed the title ~~Feat/agent cert provision~~ Auto cert provisioning Feb 9, 2026

lwlee2608 assigned JckHoe Feb 9, 2026

Add race condition fix and file permission for cert directory

37c54a1

coderabbitai bot reviewed Feb 9, 2026

View reviewed changes

Add simple test scripts

e36932f

JckHoe merged commit b58f7a7 into main Feb 10, 2026
2 of 3 checks passed

JckHoe deleted the feat/agent-cert-provision branch February 10, 2026 05:31

Conversation

lwlee2608 commented Feb 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lwlee2608 commented Feb 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 9, 2026 •

edited

Loading