architecture-docs

Summary

This repository contains unofficial architecture documentation on the Rucio project. Its primary goal is to provide a minimal and focused understanding of the system (building blocks, communication between components and systems, relevant cross-cutting concepts, etc.). It is intended as a lightweight reference to support personal comprehension and quick lookup, rather than as a comprehensive or authoritative source.

References

Preconditions

1. Install Python dependencies

pip install diagrams

2. Install Graphviz

macOS:

brew install graphviz

Ubuntu/Debian:

sudo apt-get install -y graphviz

Windows (using Chocolatey):

choco install graphviz

arc42 chapters

TODO(mgajek-cern): Eventually setup Jekyll based project or use other static site generator tool

1. Introduction and Goals

See What is Rucio?

2. Constraints

Overview of constraints influencing the architecture can be found here

3. Context & Scope

Name	Type	Description
Rucio	Internal System	Scientific data management framework providing declarative policy-based data organization, transfer, and lifecycle management across distributed heterogeneous storage infrastructure
Workflow Management Systems	External System	Job and task orchestration platforms including HPC clusters with batch scheduling systems, container orchestration, scientific workflow engines and data preparation pipelines that coordinate with Rucio for data availability, computational processing, and output registration
Authentication Systems	External System	Identity and access management services providing user authentication and authorization through various protocols and credential mechanisms, including federated multi-issuer token services
Storage Systems	External System	Heterogeneous storage backends including traditional filesystems, object storage, tape archives, cloud storage and external data repositories
Monitoring Systems	External System	Analytics and observability platforms that collect, process, and visualize system performance metrics, usage statistics and operational health data
Logging Systems	External System	Centralized logging infrastructure that aggregates, stores, and provides search capabilities for system events, audit trails, and troubleshooting information
Data Discovery Systems	External System	Catalogue and metadata federation services including ESGF Search, FAIR Data Points, and domain-specific discovery interfaces that enable cross-repository data location and metadata harmonization
Database Systems	External System	Transactional relational database management systems that serve as the persistence layer for catalog metadata, system state, and configuration data, including FAIR Data Points providing standardized metadata and data discovery services
Transfer Systems	External System	Data movement services and protocols that handle the physical transfer of files between storage endpoints with reliability, scheduling and error handling capabilities
Messaging Systems	External System	Messaging services that enable asynchronous communication between distributed components, supporting event-driven architectures, decoupling, reliable message delivery and catalogue change notifications to external applications
~~Caching Systems~~	~~External System~~	High-speed data stores that temporarily hold frequently accessed data to reduce latency, decrease load on primary data sources, and improve overall system performance through intelligent data placement algorithms
Email Systems	External System	SMTP-based notification services that deliver system alerts, status updates, operational notifications, and administrative communications to users and operators for workflow management and incident response

Stakeholder details are provided here

4. Solution strategy

For comprehensive information about the solution strategy refer to following markdown file

5. Building Block views

Lvl 1

Name	Type	Description
REST API	Internal Component	HTTPs interface providing programmatic access to Rucio functionality through standardized endpoints for authentication, data management, and system operations
Web UI	Internal Component	Javascript-based web user interface for browser-based interaction with Rucio services
Client CLIs	Internal Component	User command-line tools (bin/rucio and bin/rucio-admin) that interact with Rucio through the REST API for data operations like upload, download, and management
Daemon CLIs	Internal Component	System administration command-line tools where each daemon has a corresponding CLI application that bypasses the API and accesses the database directly with the same logic as daemon processes
Daemons	Internal Component	Background processes that orchestrate data management through a database-driven workflow pipeline, handling asynchronous tasks like rule evaluation, data transfers, and cleanup operations
Log Collector	External System	Centralized logging infrastructure that aggregates and forwards logs from Rucio components through polling, file monitoring, or direct ingestion; provides storage and search capabilities for system events, audit trails, and troubleshooting information
Rucio	Internal System	Scientific data management framework providing declarative policy-based data organization, transfer, and lifecycle management across distributed heterogeneous storage infrastructure

Lvl 2

Lvl 3

Daemons

For comprehensive information about Rucio daemons, see the official documentation

For detailed information about which external systems each daemon communicates with, refer to the daemon external communications analysis.

Rucio daemons orchestrate data management through a database-driven workflow pipeline where each daemon specializes in a specific task and communicates with others through shared database state rather than direct messaging. This creates a robust, scalable architecture:

Rule Created → Judge Evaluator → Conveyor Submitter → Transfer Tool → Conveyor Poller → Conveyor Finisher

6. Runtime view

Data Replication Workflow

Client CLIs/Web UI → REST API: "Create replication rule: 3 copies on different continents"
REST API → Database: Records the rule as a database entry
Daemons → Database: Query for pending rules/tasks
Daemons → Storage/Transfer systems: Execute the actual data operations (hours/days)
Daemons → Database: Update completion status
Daemons → Database: Track transfer progress and report metrics
Error path: Storage/Transfer systems → Daemons: "Transfer failed" → Database updated with error status

Authentication Workflow

Client CLIs/Web UI → REST API: Sends login request with credentials/token
REST API → Authentication System: Validates credentials
Authentication System → REST API: Returns success/failure
REST API → Client CLIs/Web UI: Issues session/token if valid
Error path: If validation fails → REST API → Client CLIs/Web UI: "Authentication denied"

Data Query Workflow

Client CLIs/Web UI → REST API: Requests dataset metadata
REST API → Database: Retrieves dataset info
Database → REST API: Returns query result
REST API → Client CLIs/Web UI: Sends dataset details back to user
Error path: If dataset not found → Database → REST API → Client CLIs/Web UI: "Dataset not found"

Database Migration Workflows

The database migration workflows can be found in following markdown file.

Third-Party-Copy Sequence

The Third-Party-Copy Sequence illustrating interactions between Rucio, FTS and two RSEs for large file transfers can be found here.

7. Deployment view

To be defined with stakeholders – example below.

Environments

Development – for active feature work and integration (typically single-node docker-compose or Kubernetes). Includes automated code quality checks, security scanning, and unit testing to catch issues early.
QA – functional, integration, and security testing with controlled datasets. Validates feature requirements and security standards before promotion to staging.
Staging – production-like environment for final acceptance, performance, and security validation. Serves as the final checkpoint before production deployment.
Production – live operational environment with appropriate redundancy, continuous monitoring, and incident response capabilities.

Based on industry best practices from:

Environment Flow Activity Diagram

graph TD
    A[Developer commits code] --> B[Development Environment]
    
    B --> B1[Unit Tests]
    B --> B2[Code Quality Checks]
    B --> B3[Security Scanning]
    
    B1 --> C{All Dev Tests Pass?}
    B2 --> C
    B3 --> C
    
    C -->|No| D[Fix Issues & Recommit]
    D --> A
    
    C -->|Yes| E[QA Environment]
    
    E --> E1[Functional Testing]
    E --> E2[Integration Testing]
    E --> E3[Security Validation]
    E --> E4[Compliance Checks]
    
    E1 --> F{QA Validation Complete?}
    E2 --> F
    E3 --> F
    E4 --> F
    
    F -->|Issues Found| G[Create Bug Reports]
    G --> D
    
    F -->|Pass| H[Staging Environment]
    
    H --> H1[End-to-End Testing]
    H --> H2[Performance Testing]
    H --> H3[User Acceptance Testing]
    H --> H4[Final Security Review]
    
    H1 --> I{Production Ready?}
    H2 --> I
    H3 --> I
    H4 --> I
    
    I -->|Not Ready| J[Address Issues]
    J --> G
    
    I -->|Approved| K[Production Deployment]
    
    K --> K1[Blue/Green Switch]
    K --> K2[Canary Release]
    K --> K3[Rolling Update]
    
    K1 --> L[Production Environment]
    K2 --> L
    K3 --> L
    
    L --> L1[Continuous Monitoring]
    L --> L2[Incident Response]
    L --> L3[Performance Metrics]
    
    L1 --> M{Issues Detected?}
    L2 --> M
    L3 --> M
    
    M -->|Yes| N[Rollback/Hotfix]
    N --> K
    
    M -->|No| O[Success - Monitor & Maintain]

Key Testing Gates

graph LR
    subgraph "Development"
        D1[Unit Tests] --> D2[SAST] --> D3[Dependency Scan]
    end
    
    subgraph "QA" 
        Q1[Integration Tests] --> Q2[API Tests] --> Q3[Security Tests]
    end
    
    subgraph "Staging"
        S1[E2E Tests] --> S2[Load Tests] --> S3[UAT]
    end
    
    subgraph "Production"
        P1[Health Checks] --> P2[Monitoring] --> P3[Alerts]
    end
    
    D3 --> Q1
    Q3 --> S1  
    S3 --> P1

Deployment strategies

Modern deployment approaches ensure safe, reliable software releases:

Blue-Green Deployments – maintain two identical environments, switching traffic after validation
Canary Releases – gradual rollout to subset of users with monitoring and rollback capabilities
GitOps – infrastructure and deployments managed through version control for consistency and traceability

Security validation and policy enforcement are integrated throughout the deployment pipeline.

Ephemeral environments

Preview environments are automatically created for feature branches and pull requests, enabling early stakeholder feedback and isolated testing without environment conflicts. These temporary environments are provisioned and destroyed based on development lifecycle needs.

Learn more: What is an ephemeral environment?

Development environment – Single Node (docker-compose)

For comprehensive information about the deployment view refer to diagrams in following directory.

Development, QA, Staging and Production environments - Multi-Node (Kubernetes)

For comprehensive information about the deployment view refer to diagrams in following directories:

Production environments - Multi-Site (Kubernetes Federation)

TODO(mgajek-cern): add diagram, links, some content

8. Crosscutting concepts

Refer to Accounting and quota web page and subsequent sections. Also checkout newly added concepts here.

9. Architectural decisions

For comprehensive information about the architectural decision refer to following folder with markdown files

10. Quality requirements

Overview of quality requirements with SLA/SLO/SLI framework can be found here

This section defines measurable quality requirements using Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) to ensure system reliability, performance, and maintainability standards.

11. Risks & technical debt

TODO(mgajek-cern): Add links if existing

12. Glossary

TODO(mgajek-cern): Add links if existing

Testing definitions for different environments

Functional/API testing – verifies each feature or API behaves as expected.
Regression testing – re-runs existing tests automatically on code changes to catch breaks early.
Integration testing – verifies components/services interact correctly.
Acceptance testing – final check that system meets business and security requirements before release.
Load testing – measures performance under expected or heavy usage.
Security testing – includes static analysis, vulnerability scans, penetration tests, and compliance validation across environments.
Release validation – confirms the build is correct, stable, secure, and production-ready.

SLA/SLO/SLI Framework

SLA (Service Level Agreement): External commitments to users/customers (e.g., 99.9% uptime)
SLO (Service Level Objective): Internal targets to meet SLAs (e.g., 99.95% uptime target)
SLI (Service Level Indicator): Measurable metrics that track SLO performance (e.g., actual uptime percentage)

As-a-Service (XaaS) Options

SaaS (Software as a Service) – Ready-to-use applications delivered over the network; provider manages infrastructure, OS, and updates. Example: Gmail, Slack.
PaaS (Platform as a Service) – Development and runtime environment; provider handles OS/runtime and scaling, user manages app code. Example: Heroku, Google App Engine.
IaaS (Infrastructure as a Service) – Virtualized compute, storage, networking; user manages OS and apps, provider manages hardware and virtualization. Example: AWS EC2, Azure VMs.
FaaS (Function as a Service / Serverless) – Run individual functions or code snippets without managing servers; provider handles runtime and scaling. Example: AWS Lambda.
DaaS (Data as a Service) – Access to curated datasets via API; provider manages storage and delivery. Example: Snowflake, OpenWeatherMap API.
MLaaS (Machine Learning as a Service) – Pre-built ML models and training/deployment tools; user provides input and tuning. Example: Google Vertex AI.
AIaaS (Artificial Intelligence as a Service) – AI capabilities via API; provider manages models and infrastructure. Example: ChatGPT, Claude.

Cloud Deployment Types

Public Cloud – Shared infrastructure available to the general public, hosted by a provider. Example: AWS, Azure, GCP.
Private Cloud – Exclusive infrastructure for a single organization, either on-premises or hosted.
Community Cloud – Shared infrastructure among a group of organizations with common requirements.
Hybrid Cloud – Combination of private and public cloud, allowing flexibility for sensitive data or scaling.
Multi-Cloud – Use of multiple public cloud providers simultaneously for redundancy, performance, or avoiding vendor lock-in.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
10-quality-requirements		10-quality-requirements
11-risks-and-technical-debt		11-risks-and-technical-debt
2-constraints		2-constraints
3-scope-and-context		3-scope-and-context
4-solution-strategy		4-solution-strategy
5-building-block-views		5-building-block-views
6-runtime-view		6-runtime-view
7-deployment-view		7-deployment-view
8-concepts		8-concepts
9-adrs		9-adrs
diagrams		diagrams
images		images
.gitignore		.gitignore
README.md		README.md

mgajek-cern/architecture-docs

Folders and files

Latest commit

History

Repository files navigation