Skip to content

Zero-Trust IAM reference implementation: OIDC/Keycloak authentication, RBAC authorization, multi-tenant isolation, structured audit logging. Production patterns with FastAPI, JWT RS256, mTLS. Includes threat model (STRIDE), ADRs, and 80%+ test coverage.

License

Notifications You must be signed in to change notification settings

laugiov/iam-zero-trust-reference

Repository files navigation

IAM Zero-Trust Reference

A working example of Zero-Trust IAM architecture: OIDC authentication, RBAC authorization, multi-tenant isolation, and audit logging.

Python FastAPI Keycloak License


Why this project?

After 20 years building and securing platforms as CTO/VP Engineering, I've seen "Zero-Trust" used as a buzzword more often than as an actual implementation. This repo is my way of showing what Zero-Trust IAM looks like when you actually build it.

If you have a valid Keycloak token with the right claims, you get access. If Alice (a viewer) tries to create a resource, she gets a 403. If Eve (a GLOBEX admin with full permissions) tries to access ACME data, she also gets a 403 because tenant isolation is enforced regardless of role. And every single authorization decision is logged with the reason for denial.

The code reflects the security decisions I've made on real production systems: short-lived tokens, deny-by-default RBAC, hard tenant boundaries, and comprehensive audit trails. Each design choice is documented in an ADR with the reasoning behind it.

You can spin up the stack and verify all of this in about 15 minutes: docs/00-evaluate-in-15min.md


Quick Start

# Start with Keycloak
docker compose --profile idp up -d

# Wait for Keycloak to be ready (~60s)
docker compose logs -f keycloak | grep -m1 "Running the server"

# Run the demo
./scripts/demo-keycloak.sh

The script fetches tokens for different users and tests authorization scenarios. You should see output like:

Test 1: Alice (viewer) lists ACME resources - should succeed
  PASS: HTTP 200
Test 2: Alice (viewer) creates ACME resource - should fail (403)
  PASS: HTTP 403
Test 3: Charlie (admin) creates ACME resource - should succeed
  PASS: HTTP 201
Test 4: Eve (GLOBEX admin) accesses ACME resources - should fail (403)
  PASS: HTTP 403
Test 5: Eve (GLOBEX admin) accesses GLOBEX resources - should succeed
  PASS: HTTP 200

What's implemented

Authentication (OIDC)

Keycloak acts as the Identity Provider. It issues RS256-signed JWTs with custom claims (tenant_id, roles). The gateway validates tokens via Keycloak's JWKS endpoint, so there are no hardcoded keys and automatic key rotation is supported.

Tokens expire in 15 minutes. See ADR-003 for the rationale.

Authorization (RBAC)

4 roles with granular permissions:

Role resource:read resource:write resource:delete config:read config:write audit:read
viewer yes - - - - -
editor yes yes - - - -
operator yes yes yes yes - yes
admin yes yes yes yes yes yes

A viewer can list resources but cannot create them. An editor can create but not delete. Deny-by-default: if you don't have the permission, you don't get in.

Tenant Isolation

The tenant_id is embedded in the JWT. When you hit /tenants/acme-corp/resources, the gateway checks that your token's tenant_id matches acme-corp. If you're from GLOBEX, you get a 403, even if you're an admin.

This is enforced at the gateway level, before any business logic runs. There's no parameter to bypass it, no "super-admin" escape hatch.

Audit Logging

Every authorization decision produces a structured JSON event:

{
  "event_type": "AUTHZ_FAILURE",
  "actor": {"id": "alice-uuid", "tenant_id": "acme-corp"},
  "details": {
    "role": "viewer",
    "required_permission": "resource:write",
    "reason": "missing_permission"
  },
  "outcome": "denied"
}

You get AUTHZ_SUCCESS, AUTHZ_FAILURE, and TENANT_ACCESS_DENIED events. Useful for debugging, security monitoring, and compliance audits.

Security Testing

The test suite includes negative test cases that validate the security model:

  • Token validation failures (expired, wrong signature, wrong issuer, wrong audience)
  • RBAC denials (missing permissions for each role)
  • Cross-tenant access attempts (admin from tenant A accessing tenant B)
  • Policy regression tests (ensuring changes don't open unintended access)

See docs/06-threat-model.md for STRIDE analysis and compensating controls.


Test Users

The Keycloak realm includes 5 pre-configured users across 2 tenants:

ACME Corp (tenant_id: acme-corp):
  alice-viewer / alice123     → viewer role
  bob-editor / bob123         → editor role
  charlie-admin / charlie123  → admin role (all permissions)

GLOBEX Inc (tenant_id: globex-inc):
  diana-viewer / diana123     → viewer role
  eve-admin / eve123          → admin role (all permissions)

Charlie has full admin rights on ACME resources. Eve has full admin rights on GLOBEX resources. Neither can touch the other tenant's data.


Architecture

                    ┌─────────────────────────────────────┐
                    │         Keycloak (:8080)            │
                    │                                     │
                    │  Realm: iam-demo                    │
                    │  Clients: iam-cli, iam-api          │
                    │  Users: alice, bob, charlie,        │
                    │         diana, eve                  │
                    │  Roles: viewer, editor,             │
                    │         operator, admin             │
                    └──────────────┬──────────────────────┘
                                   │
                                   │ JWT (RS256)
                                   │ Claims: sub, tenant_id, roles
                                   ▼
                    ┌─────────────────────────────────────┐
                    │       API Gateway (:8001)           │
                    │                                     │
                    │  1. JWKS validation (signature,     │
                    │     issuer, audience, expiry)       │
                    │  2. RBAC check (role → permission)  │
                    │  3. Tenant check (token tenant      │
                    │     == resource tenant)             │
                    │  4. Audit logging                   │
                    │                                     │
                    └──────────────┬──────────────────────┘
                                   │
                                   ▼
                    ┌─────────────────────────────────────┐
                    │  /tenants/{tenant_id}/resources     │
                    │                                     │
                    │  GET    → resource:read             │
                    │  POST   → resource:write            │
                    │  PUT    → resource:write            │
                    │  DELETE → resource:delete           │
                    └─────────────────────────────────────┘

The gateway can run in two modes:

  • OIDC mode (production): validates tokens via Keycloak's JWKS endpoint
  • Local mode (testing): validates tokens with RSA keys from environment variables

API Endpoints

Method Endpoint Permission Description
GET /health none Health check
GET /metrics none Prometheus metrics
GET /tenants/{tenant}/resources resource:read List resources
POST /tenants/{tenant}/resources resource:write Create resource
GET /tenants/{tenant}/resources/{id} resource:read Get resource
PUT /tenants/{tenant}/resources/{id} resource:write Update resource
DELETE /tenants/{tenant}/resources/{id} resource:delete Delete resource

All /tenants/* endpoints require a valid JWT and enforce tenant isolation.


Documentation

Core docs

Architecture Decision Records


Tests

poetry run pytest

210+ tests covering:

  • JWT validation (valid, expired, wrong signature, wrong issuer, wrong audience)
  • RBAC enforcement (each role/permission combination)
  • Tenant isolation (same tenant allowed, cross-tenant denied)
  • Audit logging (events are emitted with correct structure)
  • Security headers (OWASP recommended headers present)
  • Rate limiting (429 after threshold)

Stack

  • Python 3.12 with FastAPI for the API
  • Keycloak 23.0 as Identity Provider
  • PyJWT + PyJWKClient for token validation
  • Pydantic for input validation
  • slowapi for rate limiting
  • Docker Compose for local development

Project Structure

├── iam/                      # API Gateway code
│   ├── auth.py               # JWT verification (local + OIDC)
│   ├── oidc.py               # Keycloak JWKS integration
│   ├── rbac.py               # Permission decorators
│   ├── rbac_roles.py         # Role/permission definitions
│   ├── tenant.py             # Tenant isolation logic
│   ├── audit.py              # Audit logger
│   └── routers/
│       └── resources.py      # /tenants/{tenant}/resources endpoints
├── infra/
│   └── keycloak/
│       └── iam-realm.json    # Keycloak realm configuration
├── scripts/
│   └── demo-keycloak.sh      # Demo script
├── tests/                    # Test suite
├── docs/                     # Documentation
└── decision-records/         # ADRs

Limitations

This is a reference implementation, not a production-ready product. In a real deployment, you'd want proper secret management (Vault, AWS Secrets Manager) instead of environment variables. You'd also need a managed IdP or a hardened Keycloak deployment, more sophisticated monitoring and alerting, and probably some form of token revocation mechanism.

The goal here is to show the patterns, not to provide a turnkey solution.


Design Principles

This implementation follows principles I apply to production systems:

What I refuse to implement:

  • Shared admin tokens or service accounts with broad access
  • Long-lived secrets or tokens without rotation
  • Tenant bypass mechanisms, even for "super-admins"
  • Silent failures in authorization (every denial must be logged with reason)

Where I draw the line between DX and security:

  • Developers get clear error messages, but never internal implementation details
  • Token lifetime is short (15 min) even if it means more refresh cycles
  • Input validation is strict (extra="forbid") even if it breaks some clients

What I automate vs. what requires governance:

  • Automated: key rotation, SAST/SCA scanning, security headers, rate limiting
  • Governed: role definitions, permission assignments, exception requests, break-glass access

Author

Laurent Giovannoni

20 years as CTO and VP Engineering, building and securing platforms that handle sensitive data at scale. This project reflects the security patterns I've implemented in production: defense in depth, explicit authorization, and complete auditability.

My approach to IAM is straightforward: every access decision should be traceable, every denial should have a clear reason, and no "magic" should hide behind the scenes. If you can't explain why a request was denied, your security model has a problem.

This repo exists because I believe the best way to demonstrate expertise is to show working code. Not slides, not whitepapers. Code that you can run, test, and read.


License

MIT License. See LICENSE for details.

About

Zero-Trust IAM reference implementation: OIDC/Keycloak authentication, RBAC authorization, multi-tenant isolation, structured audit logging. Production patterns with FastAPI, JWT RS256, mTLS. Includes threat model (STRIDE), ADRs, and 80%+ test coverage.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published