This repository contains unofficial architecture documentation on the Rucio project. Its primary goal is to provide a minimal and focused understanding of the system (building blocks, communication between components and systems, relevant cross-cutting concepts, etc.). It is intended as a lightweight reference to support personal comprehension and quick lookup, rather than as a comprehensive or authoritative source.
- What is Rucio?
- Rucio daemons
- Rucio Project Structure
- arc42 overview
- Markdown Architectural Decision Records
pip install diagramsmacOS:
brew install graphvizUbuntu/Debian:
sudo apt-get install -y graphvizWindows (using Chocolatey):
choco install graphvizTODO(mgajek-cern): Eventually setup Jekyll based project or use other static site generator tool
See What is Rucio?
Overview of constraints influencing the architecture can be found here
| Name | Type | Description |
|---|---|---|
| Rucio | Internal System | Scientific data management framework providing declarative policy-based data organization, transfer, and lifecycle management across distributed heterogeneous storage infrastructure |
| Workflow Management Systems | External System | Job and task orchestration platforms including HPC clusters with batch scheduling systems, container orchestration, scientific workflow engines and data preparation pipelines that coordinate with Rucio for data availability, computational processing, and output registration |
| Authentication Systems | External System | Identity and access management services providing user authentication and authorization through various protocols and credential mechanisms, including federated multi-issuer token services |
| Storage Systems | External System | Heterogeneous storage backends including traditional filesystems, object storage, tape archives, cloud storage and external data repositories |
| Monitoring Systems | External System | Analytics and observability platforms that collect, process, and visualize system performance metrics, usage statistics and operational health data |
| Logging Systems | External System | Centralized logging infrastructure that aggregates, stores, and provides search capabilities for system events, audit trails, and troubleshooting information |
| Data Discovery Systems | External System | Catalogue and metadata federation services including ESGF Search, FAIR Data Points, and domain-specific discovery interfaces that enable cross-repository data location and metadata harmonization |
| Database Systems | External System | Transactional relational database management systems that serve as the persistence layer for catalog metadata, system state, and configuration data, including FAIR Data Points providing standardized metadata and data discovery services |
| Transfer Systems | External System | Data movement services and protocols that handle the physical transfer of files between storage endpoints with reliability, scheduling and error handling capabilities |
| Messaging Systems | External System | Messaging services that enable asynchronous communication between distributed components, supporting event-driven architectures, decoupling, reliable message delivery and catalogue change notifications to external applications |
| Email Systems | External System | SMTP-based notification services that deliver system alerts, status updates, operational notifications, and administrative communications to users and operators for workflow management and incident response |
Stakeholder details are provided here
For comprehensive information about the solution strategy refer to following markdown file
| Name | Type | Description |
|---|---|---|
| REST API | Internal Component | HTTPs interface providing programmatic access to Rucio functionality through standardized endpoints for authentication, data management, and system operations |
| Web UI | Internal Component | Javascript-based web user interface for browser-based interaction with Rucio services |
| Client CLIs | Internal Component | User command-line tools (bin/rucio and bin/rucio-admin) that interact with Rucio through the REST API for data operations like upload, download, and management |
| Daemon CLIs | Internal Component | System administration command-line tools where each daemon has a corresponding CLI application that bypasses the API and accesses the database directly with the same logic as daemon processes |
| Daemons | Internal Component | Background processes that orchestrate data management through a database-driven workflow pipeline, handling asynchronous tasks like rule evaluation, data transfers, and cleanup operations |
| Log Collector | External System | Centralized logging infrastructure that aggregates and forwards logs from Rucio components through polling, file monitoring, or direct ingestion; provides storage and search capabilities for system events, audit trails, and troubleshooting information |
| Rucio | Internal System | Scientific data management framework providing declarative policy-based data organization, transfer, and lifecycle management across distributed heterogeneous storage infrastructure |
See also: Rucio Project Structure
For comprehensive information about Rucio daemons, see the official documentation
For detailed information about which external systems each daemon communicates with, refer to the daemon external communications analysis.
Rucio daemons orchestrate data management through a database-driven workflow pipeline where each daemon specializes in a specific task and communicates with others through shared database state rather than direct messaging. This creates a robust, scalable architecture:
Rule Created → Judge Evaluator → Conveyor Submitter → Transfer Tool → Conveyor Poller → Conveyor Finisher
- Client CLIs/Web UI → REST API: "Create replication rule: 3 copies on different continents"
- REST API → Database: Records the rule as a database entry
- Daemons → Database: Query for pending rules/tasks
- Daemons → Storage/Transfer systems: Execute the actual data operations (hours/days)
- Daemons → Database: Update completion status
- Daemons → Database: Track transfer progress and report metrics
- Error path: Storage/Transfer systems → Daemons: "Transfer failed" → Database updated with error status
- Client CLIs/Web UI → REST API: Sends login request with credentials/token
- REST API → Authentication System: Validates credentials
- Authentication System → REST API: Returns success/failure
- REST API → Client CLIs/Web UI: Issues session/token if valid
- Error path: If validation fails → REST API → Client CLIs/Web UI: "Authentication denied"
- Client CLIs/Web UI → REST API: Requests dataset metadata
- REST API → Database: Retrieves dataset info
- Database → REST API: Returns query result
- REST API → Client CLIs/Web UI: Sends dataset details back to user
- Error path: If dataset not found → Database → REST API → Client CLIs/Web UI: "Dataset not found"
The database migration workflows can be found in following markdown file.
The Third-Party-Copy Sequence illustrating interactions between Rucio, FTS and two RSEs for large file transfers can be found here.
To be defined with stakeholders – example below.
- Development – for active feature work and integration (typically single-node
docker-composeor Kubernetes). Includes automated code quality checks, security scanning, and unit testing to catch issues early. - QA – functional, integration, and security testing with controlled datasets. Validates feature requirements and security standards before promotion to staging.
- Staging – production-like environment for final acceptance, performance, and security validation. Serves as the final checkpoint before production deployment.
- Production – live operational environment with appropriate redundancy, continuous monitoring, and incident response capabilities.
Based on industry best practices from:
Environment Flow Activity Diagram
graph TD
A[Developer commits code] --> B[Development Environment]
B --> B1[Unit Tests]
B --> B2[Code Quality Checks]
B --> B3[Security Scanning]
B1 --> C{All Dev Tests Pass?}
B2 --> C
B3 --> C
C -->|No| D[Fix Issues & Recommit]
D --> A
C -->|Yes| E[QA Environment]
E --> E1[Functional Testing]
E --> E2[Integration Testing]
E --> E3[Security Validation]
E --> E4[Compliance Checks]
E1 --> F{QA Validation Complete?}
E2 --> F
E3 --> F
E4 --> F
F -->|Issues Found| G[Create Bug Reports]
G --> D
F -->|Pass| H[Staging Environment]
H --> H1[End-to-End Testing]
H --> H2[Performance Testing]
H --> H3[User Acceptance Testing]
H --> H4[Final Security Review]
H1 --> I{Production Ready?}
H2 --> I
H3 --> I
H4 --> I
I -->|Not Ready| J[Address Issues]
J --> G
I -->|Approved| K[Production Deployment]
K --> K1[Blue/Green Switch]
K --> K2[Canary Release]
K --> K3[Rolling Update]
K1 --> L[Production Environment]
K2 --> L
K3 --> L
L --> L1[Continuous Monitoring]
L --> L2[Incident Response]
L --> L3[Performance Metrics]
L1 --> M{Issues Detected?}
L2 --> M
L3 --> M
M -->|Yes| N[Rollback/Hotfix]
N --> K
M -->|No| O[Success - Monitor & Maintain]
Key Testing Gates
graph LR
subgraph "Development"
D1[Unit Tests] --> D2[SAST] --> D3[Dependency Scan]
end
subgraph "QA"
Q1[Integration Tests] --> Q2[API Tests] --> Q3[Security Tests]
end
subgraph "Staging"
S1[E2E Tests] --> S2[Load Tests] --> S3[UAT]
end
subgraph "Production"
P1[Health Checks] --> P2[Monitoring] --> P3[Alerts]
end
D3 --> Q1
Q3 --> S1
S3 --> P1
Modern deployment approaches ensure safe, reliable software releases:
- Blue-Green Deployments – maintain two identical environments, switching traffic after validation
- Canary Releases – gradual rollout to subset of users with monitoring and rollback capabilities
- GitOps – infrastructure and deployments managed through version control for consistency and traceability
Security validation and policy enforcement are integrated throughout the deployment pipeline.
Preview environments are automatically created for feature branches and pull requests, enabling early stakeholder feedback and isolated testing without environment conflicts. These temporary environments are provisioned and destroyed based on development lifecycle needs.
Learn more: What is an ephemeral environment?
For comprehensive information about the deployment view refer to diagrams in following directory.
For comprehensive information about the deployment view refer to diagrams in following directories:
TODO(mgajek-cern): add diagram, links, some content
Refer to Accounting and quota web page and subsequent sections. Also checkout newly added concepts here.
For comprehensive information about the architectural decision refer to following folder with markdown files
Overview of quality requirements with SLA/SLO/SLI framework can be found here
This section defines measurable quality requirements using Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) to ensure system reliability, performance, and maintainability standards.
TODO(mgajek-cern): Add links if existing
TODO(mgajek-cern): Add links if existing
- Functional/API testing – verifies each feature or API behaves as expected.
- Regression testing – re-runs existing tests automatically on code changes to catch breaks early.
- Integration testing – verifies components/services interact correctly.
- Acceptance testing – final check that system meets business and security requirements before release.
- Load testing – measures performance under expected or heavy usage.
- Security testing – includes static analysis, vulnerability scans, penetration tests, and compliance validation across environments.
- Release validation – confirms the build is correct, stable, secure, and production-ready.
- SLA (Service Level Agreement): External commitments to users/customers (e.g., 99.9% uptime)
- SLO (Service Level Objective): Internal targets to meet SLAs (e.g., 99.95% uptime target)
- SLI (Service Level Indicator): Measurable metrics that track SLO performance (e.g., actual uptime percentage)
- SaaS (Software as a Service) – Ready-to-use applications delivered over the network; provider manages infrastructure, OS, and updates. Example: Gmail, Slack.
- PaaS (Platform as a Service) – Development and runtime environment; provider handles OS/runtime and scaling, user manages app code. Example: Heroku, Google App Engine.
- IaaS (Infrastructure as a Service) – Virtualized compute, storage, networking; user manages OS and apps, provider manages hardware and virtualization. Example: AWS EC2, Azure VMs.
- FaaS (Function as a Service / Serverless) – Run individual functions or code snippets without managing servers; provider handles runtime and scaling. Example: AWS Lambda.
- DaaS (Data as a Service) – Access to curated datasets via API; provider manages storage and delivery. Example: Snowflake, OpenWeatherMap API.
- MLaaS (Machine Learning as a Service) – Pre-built ML models and training/deployment tools; user provides input and tuning. Example: Google Vertex AI.
- AIaaS (Artificial Intelligence as a Service) – AI capabilities via API; provider manages models and infrastructure. Example: ChatGPT, Claude.
- Public Cloud – Shared infrastructure available to the general public, hosted by a provider. Example: AWS, Azure, GCP.
- Private Cloud – Exclusive infrastructure for a single organization, either on-premises or hosted.
- Community Cloud – Shared infrastructure among a group of organizations with common requirements.
- Hybrid Cloud – Combination of private and public cloud, allowing flexibility for sensitive data or scaling.
- Multi-Cloud – Use of multiple public cloud providers simultaneously for redundancy, performance, or avoiding vendor lock-in.


