Skip to content

Auto cert provisioning#14

Merged
JckHoe merged 5 commits intomainfrom
feat/agent-cert-provision
Feb 10, 2026
Merged

Auto cert provisioning#14
JckHoe merged 5 commits intomainfrom
feat/agent-cert-provision

Conversation

@lwlee2608
Copy link
Member

@lwlee2608 lwlee2608 commented Feb 9, 2026

Summary by CodeRabbit

  • New Features

    • Automated certificate provisioning for agents via one-time provision keys; new agent CLI provision subcommand and server provision endpoint
    • Admin endpoints to create, list, and revoke provision keys; provisioning returns agent and CA certificates
  • Configuration

    • New provision section: enable toggle, key TTL, and cleanup interval
  • Documentation

    • Comprehensive cert-provisioning docs with phased design and implementation guides
  • Tests

    • Added unit and HTTP tests covering key store and provisioning flows
  • Tools

    • Helper scripts to create, list, and revoke provision keys via REST API

Adds a one-time provision key mechanism: admin creates a key tied to an
agent ID, shares it with the agent operator, and the agent uses it to
obtain certificates via REST — no admin API key needed.

- Add in-memory KeyStore with create/validate/revoke/cleanup
- Add admin endpoints (POST/GET/DELETE /api/v1/provision-keys)
- Add agent endpoint (POST /api/v1/provision) authenticated by key
- Add agent CLI subcommand: silo-proxy-agent provision --server --key
- Add provision config section (disabled by default)
@coderabbitai
Copy link

coderabbitai bot commented Feb 9, 2026

Caution

Review failed

The pull request is closed.

Walkthrough

Adds a certificate provisioning subsystem and agent provisioning CLI support. Introduces an in-memory one-time provision KeyStore with TTL and cleanup, admin HTTP endpoints to create/list/revoke provision keys, and a provision HTTP endpoint that validates a key, signs a CSR via a CertSigner, marks keys used, and returns agent and CA certificates. Adds agent CLI provision command that posts a CSR and provision key to the server and writes returned certs to disk. Adds configuration options (provision.enabled, key_ttl_hours, cleanup_interval_minutes), corresponding wiring into server startup, DTOs, tests, scripts, and documentation for phased implementation.

Sequence Diagram(s)

sequenceDiagram
    participant AgentCLI as Agent CLI
    participant ServerHTTP as Server HTTP
    participant KeyStore as KeyStore
    participant CertService as CertService

    AgentCLI->>ServerHTTP: POST /api/v1/provision { key, csr }
    ServerHTTP->>KeyStore: Validate(key)
    KeyStore-->>ServerHTTP: provision key (valid) / error
    alt key valid
        ServerHTTP->>CertService: Sign CSR for agentID
        CertService-->>ServerHTTP: agent_cert PEM
        ServerHTTP->>KeyStore: MarkUsed(key)
        ServerHTTP-->>AgentCLI: 200 { agent_id, cert_pem, key_pem, ca_cert_pem }
        AgentCLI->>AgentCLI: write certs to disk
    else invalid/expired/used
        KeyStore-->>ServerHTTP: error
        ServerHTTP-->>AgentCLI: 401/400 error
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I found a sk_ key in a twilight patch,
I sent a brave CSR to the server's hatch,
The CA stamped a seal, a soft certificate tune,
I saved tiny files by the light of the moon,
Hooray — hop, provision complete! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Auto cert provisioning' is directly related to the main changes in the PR, which implement automated certificate provisioning for silo proxy agents.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/agent-cert-provision

Comment @coderabbitai help to get the list of available commands and usage tips.

@lwlee2608 lwlee2608 marked this pull request as ready for review February 9, 2026 09:05
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Fix all issues with AI agents
In `@cmd/silo-proxy-agent/provision.go`:
- Around line 37-42: Replace the use of the default http.Post with a bespoke
http.Client that has a timeout to avoid indefinite hangs: create an
http.Client{Timeout: time.Second * <N>} (choose appropriate N), build the
request for url (using http.NewRequest with reqBody) and call client.Do(req)
instead of http.Post, handle errors the same way and defer resp.Body.Close() as
before; update imports to include "time". Use the existing local symbols url,
reqBody and resp handling code in provision.go to slot this change in.

In `@cmd/silo-proxy-server/main.go`:
- Around line 96-107: When provisioning is enabled, validate
config.Provision.KeyTTLHours and config.Provision.CleanupIntervalMinutes before
creating provision.NewKeyStore and starting keyStore.StartCleanup: ensure both
values are > 0 (reject zero/negative), log a clear error via slog.Error and
os.Exit(1) if invalid, and only then compute ttl :=
time.Duration(config.Provision.KeyTTLHours)*time.Hour, cleanupInterval :=
time.Duration(config.Provision.CleanupIntervalMinutes)*time.Minute and proceed
to call provision.NewKeyStore and go keyStore.StartCleanup(context.Background(),
cleanupInterval); keep existing certService nil check intact.

In `@docs/cert-provisioning/overview.md`:
- Around line 22-54: The fenced code blocks in
docs/cert-provisioning/overview.md are missing language identifiers (MD040);
update each triple-backtick block (for example the large ASCII diagram and the
blocks containing "POST /api/v1/provision-keys", "GET /api/v1/provision-keys",
"DELETE /api/v1/provision-keys/:agent_id", "POST /api/v1/provision", and the
certs tree) to include a language tag such as text (e.g., ```text) so
markdownlint passes; scan the file for all similar blocks (noted around the
diagram and ranges 74-116, 122-149, 199-216) and add the same tag consistently.
- Around line 86-92: Replace any example secret-like values used in the examples
(e.g., the "provision_key" value "sk_a1b2c3d4e5f6..." and similar strings) with
clearly fake placeholders such as "sk_example_..." or "provision_key_example"
and likewise use non-secret example values for "agent_id" (e.g.,
"agent_example") in both occurrences mentioned; update all instances (including
the second occurrence) so example values cannot be mistaken for real secrets and
will not trigger secret scanners.

In `@docs/cert-provisioning/phase1.md`:
- Around line 52-59: Update the Validate documentation to correctly state the
error conditions: change the first bullet to "Key not found in map → 'invalid
provision key'", and rephrase bullets 2–3 to read "Key is expired → 'provision
key expired'" and "Key already used → 'provision key already used'"; ensure the
paragraph referencing Validate and the returned ProvisionKey reflects these
corrected conditions.

In `@internal/api/http/handler/provision.go`:
- Around line 82-147: The ProvisionHandler.Provision currently calls
keyStore.Validate(...) then later keyStore.MarkUsed(...), causing a race; add an
atomic KeyStore.Consume(key string) method that acquires the write lock,
verifies the key exists and is unused, marks it used, and returns the parsed
provision key (or an error) in one operation, then in Provision replace the
Validate(...) + MarkUsed(...) sequence with a single call to Consume(req.Key),
use the returned pk.AgentID as before, and handle/log errors exactly where
Validate/MarkUsed were handled.

In `@internal/provision/key_store.go`:
- Around line 66-81: The Validate function releases ks.mu.RLock() too early
causing a data race when MarkUsed modifies the same ProvisionKey; keep the read
lock held while checking pk.Used and pk.ExpiresAt, then return a shallow copy of
the ProvisionKey (not the pointer stored in ks.keys) before releasing the lock
so callers cannot mutate the internal struct; update KeyStore.Validate to
perform all reads under ks.mu.RLock()/RUnlock() and return a copy of the
ProvisionKey to avoid concurrent mutation.
🧹 Nitpick comments (4)
docs/cert-provisioning/phase2.md (1)

20-24: Add language specifier to fenced code block.

The route listing block should have a language specifier for consistency with markdown best practices.

📝 Proposed fix
-```
+```text
 POST   /api/v1/provision-keys      → CreateProvisionKey
 GET    /api/v1/provision-keys      → ListProvisionKeys
 DELETE /api/v1/provision-keys/:id  → RevokeProvisionKey
</details>

</blockquote></details>
<details>
<summary>docs/cert-provisioning/phase3.md (1)</summary><blockquote>

`54-63`: **Minor: Clarify validity period calculation in pseudocode.**

Line 58 shows `NotAfter: time.Now().Add(validityDays)` but `validityDays` is an `int`. The actual implementation should multiply by `24 * time.Hour` to convert days to duration. This is just pseudocode, but clarifying it would help implementers.

<details>
<summary>📝 Suggested clarification</summary>

```diff
 //    - NotBefore: time.Now()
-//    - NotAfter: time.Now().Add(validityDays)
+//    - NotAfter: time.Now().Add(time.Duration(validityDays) * 24 * time.Hour)
docs/cert-provisioning/phase4.md (1)

45-62: Add language specifier to flow block.

The flow block should have a language specifier for consistency.

📝 Proposed fix
-```
+```text
 1. Parse and validate request body
    - provision_key: required, non-empty
    - csr: required, non-empty
 ...
internal/api/http/handler/provision_test.go (1)

145-181: Invalid/missing-key tests only hit the TLS-disabled branch.

Both tests build the handler with a nil certService, so they never reach key validation or missing-key handling. Consider renaming these tests or adding a non-nil certService (or a lightweight test helper) to exercise invalid/missing key behavior.

Comment on lines 37 to 42
url := *server + "/api/v1/provision"
resp, err := http.Post(url, "application/json", bytes.NewBuffer(reqBody))
if err != nil {
return fmt.Errorf("failed to connect to server: %w", err)
}
defer resp.Body.Close()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add HTTP client timeout to prevent indefinite hangs.

The http.Post call uses the default HTTP client which has no timeout. If the server is unresponsive, the agent will hang indefinitely.

🛡️ Proposed fix to add timeout
+	client := &http.Client{Timeout: 30 * time.Second}
 	url := *server + "/api/v1/provision"
-	resp, err := http.Post(url, "application/json", bytes.NewBuffer(reqBody))
+	resp, err := client.Post(url, "application/json", bytes.NewBuffer(reqBody))
 	if err != nil {
 		return fmt.Errorf("failed to connect to server: %w", err)
 	}

You'll also need to add "time" to imports.

🤖 Prompt for AI Agents
In `@cmd/silo-proxy-agent/provision.go` around lines 37 - 42, Replace the use of
the default http.Post with a bespoke http.Client that has a timeout to avoid
indefinite hangs: create an http.Client{Timeout: time.Second * <N>} (choose
appropriate N), build the request for url (using http.NewRequest with reqBody)
and call client.Do(req) instead of http.Post, handle errors the same way and
defer resp.Body.Close() as before; update imports to include "time". Use the
existing local symbols url, reqBody and resp handling code in provision.go to
slot this change in.

Comment on lines +96 to +107
var keyStore *provision.KeyStore
if config.Provision.Enabled {
if certService == nil {
slog.Error("Provisioning requires TLS to be enabled")
os.Exit(1)
}
ttl := time.Duration(config.Provision.KeyTTLHours) * time.Hour
keyStore = provision.NewKeyStore(ttl)
cleanupInterval := time.Duration(config.Provision.CleanupIntervalMinutes) * time.Minute
go keyStore.StartCleanup(context.Background(), cleanupInterval)
slog.Info("Provisioning enabled", "key_ttl_hours", config.Provision.KeyTTLHours)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find the provision package and StartCleanup function
find . -path ./vendor -prune -o -type f -name "*.go" -print | xargs grep -l "func.*StartCleanup" | head -5

Repository: EternisAI/silo-proxy

Length of output: 97


🏁 Script executed:

#!/bin/bash
# Search for StartCleanup implementation across the codebase
rg -n "func.*StartCleanup" -A 10 --type go

Repository: EternisAI/silo-proxy

Length of output: 685


🏁 Script executed:

#!/bin/bash
# Check the config structure for Provision settings
rg -n "type.*Provision|KeyTTLHours|CleanupIntervalMinutes" --type go | head -20

Repository: EternisAI/silo-proxy

Length of output: 1245


🏁 Script executed:

#!/bin/bash
# Check NewKeyStore implementation and how it handles TTL
rg -n "func NewKeyStore" -A 10 --type go

Repository: EternisAI/silo-proxy

Length of output: 773


🏁 Script executed:

#!/bin/bash
# Check config loading and validation
rg -n "config.*Provision|viper|unmarshal" --type go | grep -i "provision\|config" | head -20

Repository: EternisAI/silo-proxy

Length of output: 467


🏁 Script executed:

#!/bin/bash
# Search for any existing validation of these config values
rg -n "KeyTTLHours|CleanupIntervalMinutes" --type go

Repository: EternisAI/silo-proxy

Length of output: 604


🏁 Script executed:

#!/bin/bash
# Check time.NewTicker behavior with documentation or test references
rg -n "time.NewTicker|StartCleanup" -B 2 -A 5 --type go

Repository: EternisAI/silo-proxy

Length of output: 2172


🏁 Script executed:

#!/bin/bash
# Check the cleanup implementation and how it uses TTL
rg -n "func.*cleanup|ttl" -A 8 --type go internal/provision/key_store.go

Repository: EternisAI/silo-proxy

Length of output: 917


🏁 Script executed:

#!/bin/bash
# Look at the complete ProvisionConfig definition
cat -n cmd/silo-proxy-server/config.go | head -40

Repository: EternisAI/silo-proxy

Length of output: 1436


🏁 Script executed:

#!/bin/bash
# Check if there's any config validation layer
rg -n "validate|Validate|Config\(\)" -B 2 -A 5 --type go cmd/silo-proxy-server/

Repository: EternisAI/silo-proxy

Length of output: 1366


🏁 Script executed:

#!/bin/bash
# Check Go time.NewTicker behavior with zero or negative duration
# Search for any tests or documentation about this
rg -n "NewTicker.*0|NewTicker.*-" --type go | head -10

Repository: EternisAI/silo-proxy

Length of output: 46


🏁 Script executed:

#!/bin/bash
# Check if there's a default configuration file or environment setup
find . -type f \( -name "*.yaml" -o -name "*.yml" -o -name "*.env*" -o -name "application*" \) 2>/dev/null | head -10

Repository: EternisAI/silo-proxy

Length of output: 236


🏁 Script executed:

#!/bin/bash
# Check what values are used in tests or default configs
rg -n "key_ttl_hours|cleanup_interval_minutes" -A 1 -B 1 --type go --type yaml

Repository: EternisAI/silo-proxy

Length of output: 844


🏁 Script executed:

#!/bin/bash
# Check Go's time.NewTicker behavior - let's see if there's any panic handling or test refs
rg -n "panic|NewTicker" -B 2 -A 2 --type go | grep -A 5 -B 5 "NewTicker"

Repository: EternisAI/silo-proxy

Length of output: 1375


🏁 Script executed:

#!/bin/bash
# Get the full main.go context around the provisioning setup
sed -n '96,110p' cmd/silo-proxy-server/main.go

Repository: EternisAI/silo-proxy

Length of output: 638


Add validation for provisioning TTL and cleanup interval before initialization.

The code directly passes config.Provision.KeyTTLHours and config.Provision.CleanupIntervalMinutes to duration calculations without validation. If either is zero or negative, time.NewTicker will panic, and keys will be created with immediate expiry. Add explicit validation to fail fast when provisioning is enabled.

Suggested fix
 	if config.Provision.Enabled {
 		if certService == nil {
 			slog.Error("Provisioning requires TLS to be enabled")
 			os.Exit(1)
 		}
+		if config.Provision.KeyTTLHours <= 0 {
+			slog.Error("Provisioning requires key_ttl_hours to be > 0")
+			os.Exit(1)
+		}
+		if config.Provision.CleanupIntervalMinutes <= 0 {
+			slog.Error("Provisioning requires cleanup_interval_minutes to be > 0")
+			os.Exit(1)
+		}
 		ttl := time.Duration(config.Provision.KeyTTLHours) * time.Hour
 		keyStore = provision.NewKeyStore(ttl)
 		cleanupInterval := time.Duration(config.Provision.CleanupIntervalMinutes) * time.Minute
 		go keyStore.StartCleanup(context.Background(), cleanupInterval)
 		slog.Info("Provisioning enabled", "key_ttl_hours", config.Provision.KeyTTLHours)
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
var keyStore *provision.KeyStore
if config.Provision.Enabled {
if certService == nil {
slog.Error("Provisioning requires TLS to be enabled")
os.Exit(1)
}
ttl := time.Duration(config.Provision.KeyTTLHours) * time.Hour
keyStore = provision.NewKeyStore(ttl)
cleanupInterval := time.Duration(config.Provision.CleanupIntervalMinutes) * time.Minute
go keyStore.StartCleanup(context.Background(), cleanupInterval)
slog.Info("Provisioning enabled", "key_ttl_hours", config.Provision.KeyTTLHours)
}
if config.Provision.Enabled {
if certService == nil {
slog.Error("Provisioning requires TLS to be enabled")
os.Exit(1)
}
if config.Provision.KeyTTLHours <= 0 {
slog.Error("Provisioning requires key_ttl_hours to be > 0")
os.Exit(1)
}
if config.Provision.CleanupIntervalMinutes <= 0 {
slog.Error("Provisioning requires cleanup_interval_minutes to be > 0")
os.Exit(1)
}
ttl := time.Duration(config.Provision.KeyTTLHours) * time.Hour
keyStore = provision.NewKeyStore(ttl)
cleanupInterval := time.Duration(config.Provision.CleanupIntervalMinutes) * time.Minute
go keyStore.StartCleanup(context.Background(), cleanupInterval)
slog.Info("Provisioning enabled", "key_ttl_hours", config.Provision.KeyTTLHours)
}
🤖 Prompt for AI Agents
In `@cmd/silo-proxy-server/main.go` around lines 96 - 107, When provisioning is
enabled, validate config.Provision.KeyTTLHours and
config.Provision.CleanupIntervalMinutes before creating provision.NewKeyStore
and starting keyStore.StartCleanup: ensure both values are > 0 (reject
zero/negative), log a clear error via slog.Error and os.Exit(1) if invalid, and
only then compute ttl := time.Duration(config.Provision.KeyTTLHours)*time.Hour,
cleanupInterval :=
time.Duration(config.Provision.CleanupIntervalMinutes)*time.Minute and proceed
to call provision.NewKeyStore and go keyStore.StartCleanup(context.Background(),
cleanupInterval); keep existing certService nil check intact.

Comment on lines +22 to +54
```
Admin Server Agent Device
| | |
| POST /api/v1/ | |
| provision-keys | |
| {agent_id: "agent-5"} | |
|----------------------->| |
| provision_key: | |
| "sk_a1b2c3..." | |
|<-----------------------| |
| | |
| (copy key to device) | |
|~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~>|
| | |
| | 1. Generate private key |
| | 2. Create CSR |
| | |
| | POST /api/v1/provision |
| | {key, csr} |
| |<---------------------------|
| | |
| | 3. Validate key |
| | 4. Sign CSR with CA |
| | 5. Mark key as used |
| | |
| | {agent_cert, ca_cert} |
| |--------------------------->|
| | |
| | 6. Save certs to disk |
| | 7. Connect via mTLS |
| |<===========================|
| | (gRPC stream) |
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced code blocks (MD040).

markdownlint flags these blocks as missing a language. Use text (or a more specific language) to satisfy MD040.

🔧 Proposed fix (representative blocks)
-```
+```text
 Admin                    Server                     Agent Device
   |                        |                            |
@@
-```
+```

-```
+```text
 POST /api/v1/provision-keys
-```
+```

-```
+```text
 GET /api/v1/provision-keys
-```
+```

-```
+```text
 DELETE /api/v1/provision-keys/:agent_id
-```
+```

-```
+```text
 POST /api/v1/provision
-```
+```

-```
+```text
   ~/.silo-proxy/certs/
   ├── agent-key.pem    # generated locally
   ├── agent-cert.pem   # signed by server CA
   └── ca-cert.pem      # for verifying server
-```
+```

Also applies to: 74-116, 122-149, 199-216

🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 22-22: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In `@docs/cert-provisioning/overview.md` around lines 22 - 54, The fenced code
blocks in docs/cert-provisioning/overview.md are missing language identifiers
(MD040); update each triple-backtick block (for example the large ASCII diagram
and the blocks containing "POST /api/v1/provision-keys", "GET
/api/v1/provision-keys", "DELETE /api/v1/provision-keys/:agent_id", "POST
/api/v1/provision", and the certs tree) to include a language tag such as text
(e.g., ```text) so markdownlint passes; scan the file for all similar blocks
(noted around the diagram and ranges 74-116, 122-149, 199-216) and add the same
tag consistently.

Comment on lines +86 to +92
Response:
```json
{
"provision_key": "sk_a1b2c3d4e5f6...",
"agent_id": "agent-5",
"expires_at": "2026-02-10T12:00:00Z"
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use clearly fake placeholders for example provision keys.

The example sk_a1b2c3... strings trigger secret scanners. Swap to explicit placeholders (e.g., sk_example_...) to avoid false positives.

🧹 Proposed fix
-  "provision_key": "sk_a1b2c3d4e5f6...",
+  "provision_key": "sk_example_123456...",
@@
-  "provision_key": "sk_a1b2c3d4e5f6...",
+  "provision_key": "sk_example_123456...",

Also applies to: 134-139

🧰 Tools
🪛 Gitleaks (8.30.0)

[high] 89-89: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🤖 Prompt for AI Agents
In `@docs/cert-provisioning/overview.md` around lines 86 - 92, Replace any example
secret-like values used in the examples (e.g., the "provision_key" value
"sk_a1b2c3d4e5f6..." and similar strings) with clearly fake placeholders such as
"sk_example_..." or "provision_key_example" and likewise use non-secret example
values for "agent_id" (e.g., "agent_example") in both occurrences mentioned;
update all instances (including the second occurrence) so example values cannot
be mistaken for real secrets and will not trigger secret scanners.

Comment on lines +52 to +59
### Validate Logic

`Validate` checks three conditions and returns a specific error for each:
1. Key exists in the map → `"invalid provision key"`
2. Key not expired → `"provision key expired"`
3. Key not already used → `"provision key already used"`

Returns the `ProvisionKey` on success.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify validation error conditions.

Line 55 states "Key exists in the map → invalid provision key" which is confusing. The error "invalid provision key" should occur when the key does not exist in the map.

📝 Suggested fix
 `Validate` checks three conditions and returns a specific error for each:
-1. Key exists in the map → `"invalid provision key"`
+1. Key does not exist in the map → `"invalid provision key"`
 2. Key not expired → `"provision key expired"`
 3. Key not already used → `"provision key already used"`

The conditions on lines 56-57 also read awkwardly. Consider rewording to describe when errors occur:

1. Key not found in map → "invalid provision key"
2. Key is expired → "provision key expired"  
3. Key already used → "provision key already used"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Validate Logic
`Validate` checks three conditions and returns a specific error for each:
1. Key exists in the map → `"invalid provision key"`
2. Key not expired → `"provision key expired"`
3. Key not already used → `"provision key already used"`
Returns the `ProvisionKey` on success.
### Validate Logic
`Validate` checks three conditions and returns a specific error for each:
1. Key does not exist in the map → `"invalid provision key"`
2. Key not expired → `"provision key expired"`
3. Key not already used → `"provision key already used"`
Returns the `ProvisionKey` on success.
🤖 Prompt for AI Agents
In `@docs/cert-provisioning/phase1.md` around lines 52 - 59, Update the Validate
documentation to correctly state the error conditions: change the first bullet
to "Key not found in map → 'invalid provision key'", and rephrase bullets 2–3 to
read "Key is expired → 'provision key expired'" and "Key already used →
'provision key already used'"; ensure the paragraph referencing Validate and the
returned ProvisionKey reflects these corrected conditions.

Comment on lines 82 to 147
func (h *ProvisionHandler) Provision(ctx *gin.Context) {
if h.certService == nil {
slog.Warn("Provision requested but TLS is disabled")
ctx.JSON(http.StatusBadRequest, gin.H{"error": "TLS is not enabled on this server"})
return
}

var req dto.ProvisionRequest
if err := ctx.ShouldBindJSON(&req); err != nil {
ctx.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
return
}

pk, err := h.keyStore.Validate(req.Key)
if err != nil {
slog.Warn("Provision key validation failed", "error", err)
ctx.JSON(http.StatusUnauthorized, gin.H{"error": err.Error()})
return
}

agentID := pk.AgentID

if h.certService.AgentCertExists(agentID) {
slog.Warn("Certificate already exists for agent", "agent_id", agentID)
ctx.JSON(http.StatusConflict, gin.H{"error": "Certificate already exists for this agent"})
return
}

agentCert, agentKey, err := h.certService.GenerateAgentCert(agentID)
if err != nil {
slog.Error("Failed to generate agent certificate", "error", err, "agent_id", agentID)
ctx.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to generate agent certificate"})
return
}

certPEM, err := cert.CertToPEM(agentCert)
if err != nil {
slog.Error("Failed to encode certificate", "error", err, "agent_id", agentID)
ctx.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to encode certificate"})
return
}

keyPEM, err := cert.KeyToPEM(agentKey)
if err != nil {
slog.Error("Failed to encode key", "error", err, "agent_id", agentID)
ctx.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to encode key"})
return
}

caCertBytes, err := h.certService.GetCACert()
if err != nil {
slog.Error("Failed to read CA certificate", "error", err, "agent_id", agentID)
ctx.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to read CA certificate"})
return
}

h.keyStore.MarkUsed(req.Key)

slog.Info("Agent provisioned successfully", "agent_id", agentID)
ctx.JSON(http.StatusOK, dto.ProvisionResponse{
AgentID: agentID,
CertPEM: string(certPEM),
KeyPEM: string(keyPEM),
CACertPEM: string(caCertBytes),
})
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find KeyStore interface and implementation
find . -type f -name "*.go" | xargs grep -l "type KeyStore" | head -20

Repository: EternisAI/silo-proxy

Length of output: 97


🏁 Script executed:

#!/bin/bash
# Search for Validate and MarkUsed method definitions
rg "func.*Validate\(|func.*MarkUsed\(" --type go -B 2 -A 10

Repository: EternisAI/silo-proxy

Length of output: 2072


🏁 Script executed:

#!/bin/bash
# Check if a Consume method already exists
rg "Consume\(" --type go -B 2 -A 5

Repository: EternisAI/silo-proxy

Length of output: 46


🏁 Script executed:

#!/bin/bash
# Look at the provision handler file to understand context
cat -n internal/api/http/handler/provision.go | head -150

Repository: EternisAI/silo-proxy

Length of output: 5337


Provision key race condition allows issuing multiple certificates for a single key.

Validate() releases its read lock immediately after checking the key is unused (line 95), then certificate generation, encoding, and CA cert retrieval occur without holding the lock (lines 110–136). Another concurrent request can pass Validate() during this window before the first request calls MarkUsed() (line 138), resulting in two requests using the same provision key.

Implement an atomic Consume(key string) method in KeyStore that acquires the write lock, validates the key exists and is unused, marks it used, and releases the lock—all as a single atomic operation. Replace both Validate() and MarkUsed() calls with this single operation.

🤖 Prompt for AI Agents
In `@internal/api/http/handler/provision.go` around lines 82 - 147, The
ProvisionHandler.Provision currently calls keyStore.Validate(...) then later
keyStore.MarkUsed(...), causing a race; add an atomic KeyStore.Consume(key
string) method that acquires the write lock, verifies the key exists and is
unused, marks it used, and returns the parsed provision key (or an error) in one
operation, then in Provision replace the Validate(...) + MarkUsed(...) sequence
with a single call to Consume(req.Key), use the returned pk.AgentID as before,
and handle/log errors exactly where Validate/MarkUsed were handled.

Comment on lines +66 to +81
func (ks *KeyStore) Validate(key string) (*ProvisionKey, error) {
ks.mu.RLock()
pk, exists := ks.keys[key]
ks.mu.RUnlock()

if !exists {
return nil, ErrKeyNotFound
}
if pk.Used {
return nil, ErrKeyAlreadyUsed
}
if time.Now().After(pk.ExpiresAt) {
return nil, ErrKeyExpired
}
return pk, nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

fd "key_store.go" --type f

Repository: EternisAI/silo-proxy

Length of output: 95


🏁 Script executed:

cat -n internal/provision/key_store.go | head -100

Repository: EternisAI/silo-proxy

Length of output: 2632


🏁 Script executed:

rg -n "func.*MarkUsed|func.*cleanup" internal/provision/

Repository: EternisAI/silo-proxy

Length of output: 209


🏁 Script executed:

sed -n '137,170p' internal/provision/key_store.go

Repository: EternisAI/silo-proxy

Length of output: 368


Data race in Validate due to unlock-before-read.

Validate releases the read lock at line 69 before checking pk.Used (line 74) and pk.ExpiresAt (line 77), while MarkUsed can acquire the write lock and modify pk.Used concurrently on the same pointer. This violates Go's memory model and produces a data race. Keep the lock held while reading all fields and return a copy to prevent inconsistent validation and external mutation.

🛠️ Proposed fix to keep reads under lock
 func (ks *KeyStore) Validate(key string) (*ProvisionKey, error) {
-	ks.mu.RLock()
-	pk, exists := ks.keys[key]
-	ks.mu.RUnlock()
-
-	if !exists {
-		return nil, ErrKeyNotFound
-	}
-	if pk.Used {
-		return nil, ErrKeyAlreadyUsed
-	}
-	if time.Now().After(pk.ExpiresAt) {
-		return nil, ErrKeyExpired
-	}
-	return pk, nil
+	ks.mu.RLock()
+	pk, exists := ks.keys[key]
+	if !exists {
+		ks.mu.RUnlock()
+		return nil, ErrKeyNotFound
+	}
+	if pk.Used {
+		ks.mu.RUnlock()
+		return nil, ErrKeyAlreadyUsed
+	}
+	if time.Now().After(pk.ExpiresAt) {
+		ks.mu.RUnlock()
+		return nil, ErrKeyExpired
+	}
+	pkCopy := *pk
+	ks.mu.RUnlock()
+	return &pkCopy, nil
 }
🤖 Prompt for AI Agents
In `@internal/provision/key_store.go` around lines 66 - 81, The Validate function
releases ks.mu.RLock() too early causing a data race when MarkUsed modifies the
same ProvisionKey; keep the read lock held while checking pk.Used and
pk.ExpiresAt, then return a shallow copy of the ProvisionKey (not the pointer
stored in ks.keys) before releasing the lock so callers cannot mutate the
internal struct; update KeyStore.Validate to perform all reads under
ks.mu.RLock()/RUnlock() and return a copy of the ProvisionKey to avoid
concurrent mutation.

@lwlee2608 lwlee2608 changed the title Feat/agent cert provision Auto cert provisioning Feb 9, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@cmd/silo-proxy-agent/provision.go`:
- Around line 67-70: The code unmarshals into provResp (dto.ProvisionResponse)
but does not validate provResp.AgentID before using it to build file paths; add
a check immediately after json.Unmarshal to trim whitespace and ensure
provResp.AgentID is non-empty and contains no path separators (e.g., reject
values with "/" or "\"), returning a descriptive error if invalid, so subsequent
code that constructs certs/agents/<AgentID>-cert.pem or creates directories will
not operate on an empty or malicious agent id.

In `@internal/api/http/handler/provision.go`:
- Around line 111-115: When the provision handler detects an existing
certificate (!created) it currently returns 409 without marking the provision
key used; call the provision key store's MarkUsed method for the key (same key
that was validated earlier) before returning the conflict response so the key's
Used flag is set and cannot be reused—update the branch in the handler (the code
around the created check in provision.go handling agentID) to invoke
MarkUsed(key) then log and ctx.JSON the 409 response; no error handling is
required for MarkUsed since it returns void.

Comment on lines +67 to +70
var provResp dto.ProvisionResponse
if err := json.Unmarshal(body, &provResp); err != nil {
return fmt.Errorf("failed to parse response: %w", err)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate AgentID before using it in file paths.

If the server returns an empty AgentID, the code will create paths like certs/agents//-cert.pem and a directory named with an empty string, which could cause unexpected behavior.

🛡️ Proposed fix
 	var provResp dto.ProvisionResponse
 	if err := json.Unmarshal(body, &provResp); err != nil {
 		return fmt.Errorf("failed to parse response: %w", err)
 	}
+	if provResp.AgentID == "" {
+		return fmt.Errorf("server returned empty agent ID")
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
var provResp dto.ProvisionResponse
if err := json.Unmarshal(body, &provResp); err != nil {
return fmt.Errorf("failed to parse response: %w", err)
}
var provResp dto.ProvisionResponse
if err := json.Unmarshal(body, &provResp); err != nil {
return fmt.Errorf("failed to parse response: %w", err)
}
if provResp.AgentID == "" {
return fmt.Errorf("server returned empty agent ID")
}
🤖 Prompt for AI Agents
In `@cmd/silo-proxy-agent/provision.go` around lines 67 - 70, The code unmarshals
into provResp (dto.ProvisionResponse) but does not validate provResp.AgentID
before using it to build file paths; add a check immediately after
json.Unmarshal to trim whitespace and ensure provResp.AgentID is non-empty and
contains no path separators (e.g., reject values with "/" or "\"), returning a
descriptive error if invalid, so subsequent code that constructs
certs/agents/<AgentID>-cert.pem or creates directories will not operate on an
empty or malicious agent id.

Comment on lines +111 to +115
if !created {
slog.Warn("Certificate already exists for agent", "agent_id", agentID)
ctx.JSON(http.StatusConflict, gin.H{"error": "Certificate already exists for this agent"})
return
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find key_store.go and examine MarkUsed signature and behavior
fd -a "key_store.go" -type f

Repository: EternisAI/silo-proxy

Length of output: 234


🏁 Script executed:

# Also search for KeyStore interface/type definition
rg -n "type KeyStore|func.*MarkUsed|func.*Validate" --max-count 50 -A 3

Repository: EternisAI/silo-proxy

Length of output: 3770


🏁 Script executed:

# Examine the provision.go file around the specified lines
cat -n internal/api/http/handler/provision.go | head -160 | tail -60

Repository: EternisAI/silo-proxy

Length of output: 2347


🏁 Script executed:

cat -n internal/provision/key_store.go

Repository: EternisAI/silo-proxy

Length of output: 3908


🏁 Script executed:

# Also check the Validate function to understand key consumption/reuse behavior
rg -n "func.*Validate|Used|func.*Revoke" internal/provision/key_store.go -A 5

Repository: EternisAI/silo-proxy

Length of output: 1380


Mark the provision key as used before returning 409 when certificate already exists.

When the certificate already exists (!created at line 111), the handler returns 409 without calling MarkUsed, allowing the key to potentially be reused. The Validate function checks the Used flag and rejects already-used keys, so the key should be marked used before returning the error to prevent reuse attempts.

Note: MarkUsed is a void function that does not return errors, so error handling around it is not applicable. The current design relies on the Used flag to prevent reuse.

🤖 Prompt for AI Agents
In `@internal/api/http/handler/provision.go` around lines 111 - 115, When the
provision handler detects an existing certificate (!created) it currently
returns 409 without marking the provision key used; call the provision key
store's MarkUsed method for the key (same key that was validated earlier) before
returning the conflict response so the key's Used flag is set and cannot be
reused—update the branch in the handler (the code around the created check in
provision.go handling agentID) to invoke MarkUsed(key) then log and ctx.JSON the
409 response; no error handling is required for MarkUsed since it returns void.

@JckHoe JckHoe merged commit b58f7a7 into main Feb 10, 2026
2 of 3 checks passed
@JckHoe JckHoe deleted the feat/agent-cert-provision branch February 10, 2026 05:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants