Skip to content

Unofficial architecture documentation on the Rucio project

Notifications You must be signed in to change notification settings

mgajek-cern/architecture-docs

Repository files navigation

architecture-docs

Summary

This repository contains unofficial architecture documentation on the Rucio project. Its primary goal is to provide a minimal and focused understanding of the system (building blocks, communication between components and systems, relevant cross-cutting concepts, etc.). It is intended as a lightweight reference to support personal comprehension and quick lookup, rather than as a comprehensive or authoritative source.

References

Preconditions

1. Install Python dependencies

pip install diagrams

2. Install Graphviz

macOS:

brew install graphviz

Ubuntu/Debian:

sudo apt-get install -y graphviz

Windows (using Chocolatey):

choco install graphviz

arc42 chapters

TODO(mgajek-cern): Eventually setup Jekyll based project or use other static site generator tool

1. Introduction and Goals

See What is Rucio?

2. Constraints

Overview of constraints influencing the architecture can be found here

3. Context & Scope

Context View

Name Type Description
Rucio Internal System Scientific data management framework providing declarative policy-based data organization, transfer, and lifecycle management across distributed heterogeneous storage infrastructure
Workflow Management Systems External System Job and task orchestration platforms including HPC clusters with batch scheduling systems, container orchestration, scientific workflow engines and data preparation pipelines that coordinate with Rucio for data availability, computational processing, and output registration
Authentication Systems External System Identity and access management services providing user authentication and authorization through various protocols and credential mechanisms, including federated multi-issuer token services
Storage Systems External System Heterogeneous storage backends including traditional filesystems, object storage, tape archives, cloud storage and external data repositories
Monitoring Systems External System Analytics and observability platforms that collect, process, and visualize system performance metrics, usage statistics and operational health data
Logging Systems External System Centralized logging infrastructure that aggregates, stores, and provides search capabilities for system events, audit trails, and troubleshooting information
Data Discovery Systems External System Catalogue and metadata federation services including ESGF Search, FAIR Data Points, and domain-specific discovery interfaces that enable cross-repository data location and metadata harmonization
Database Systems External System Transactional relational database management systems that serve as the persistence layer for catalog metadata, system state, and configuration data, including FAIR Data Points providing standardized metadata and data discovery services
Transfer Systems External System Data movement services and protocols that handle the physical transfer of files between storage endpoints with reliability, scheduling and error handling capabilities
Messaging Systems External System Messaging services that enable asynchronous communication between distributed components, supporting event-driven architectures, decoupling, reliable message delivery and catalogue change notifications to external applications
Caching Systems External System High-speed data stores that temporarily hold frequently accessed data to reduce latency, decrease load on primary data sources, and improve overall system performance through intelligent data placement algorithms
Email Systems External System SMTP-based notification services that deliver system alerts, status updates, operational notifications, and administrative communications to users and operators for workflow management and incident response

Stakeholder details are provided here

4. Solution strategy

For comprehensive information about the solution strategy refer to following markdown file

5. Building Block views

Lvl 1

Building Block Lvl 1 View

Name Type Description
REST API Internal Component HTTPs interface providing programmatic access to Rucio functionality through standardized endpoints for authentication, data management, and system operations
Web UI Internal Component Javascript-based web user interface for browser-based interaction with Rucio services
Client CLIs Internal Component User command-line tools (bin/rucio and bin/rucio-admin) that interact with Rucio through the REST API for data operations like upload, download, and management
Daemon CLIs Internal Component System administration command-line tools where each daemon has a corresponding CLI application that bypasses the API and accesses the database directly with the same logic as daemon processes
Daemons Internal Component Background processes that orchestrate data management through a database-driven workflow pipeline, handling asynchronous tasks like rule evaluation, data transfers, and cleanup operations
Log Collector External System Centralized logging infrastructure that aggregates and forwards logs from Rucio components through polling, file monitoring, or direct ingestion; provides storage and search capabilities for system events, audit trails, and troubleshooting information
Rucio Internal System Scientific data management framework providing declarative policy-based data organization, transfer, and lifecycle management across distributed heterogeneous storage infrastructure

Lvl 2

Building Block Lvl 2 View

See also: Rucio Project Structure

Lvl 3

Daemons

For comprehensive information about Rucio daemons, see the official documentation

For detailed information about which external systems each daemon communicates with, refer to the daemon external communications analysis.

Rucio daemons orchestrate data management through a database-driven workflow pipeline where each daemon specializes in a specific task and communicates with others through shared database state rather than direct messaging. This creates a robust, scalable architecture:

Rule Created → Judge Evaluator → Conveyor Submitter → Transfer Tool → Conveyor Poller → Conveyor Finisher

6. Runtime view

Data Replication Workflow

  1. Client CLIs/Web UIREST API: "Create replication rule: 3 copies on different continents"
  2. REST APIDatabase: Records the rule as a database entry
  3. DaemonsDatabase: Query for pending rules/tasks
  4. DaemonsStorage/Transfer systems: Execute the actual data operations (hours/days)
  5. DaemonsDatabase: Update completion status
  6. DaemonsDatabase: Track transfer progress and report metrics
  7. Error path: Storage/Transfer systemsDaemons: "Transfer failed" → Database updated with error status

Authentication Workflow

  1. Client CLIs/Web UIREST API: Sends login request with credentials/token
  2. REST APIAuthentication System: Validates credentials
  3. Authentication SystemREST API: Returns success/failure
  4. REST APIClient CLIs/Web UI: Issues session/token if valid
  5. Error path: If validation fails → REST APIClient CLIs/Web UI: "Authentication denied"

Data Query Workflow

  1. Client CLIs/Web UIREST API: Requests dataset metadata
  2. REST APIDatabase: Retrieves dataset info
  3. DatabaseREST API: Returns query result
  4. REST APIClient CLIs/Web UI: Sends dataset details back to user
  5. Error path: If dataset not found → DatabaseREST APIClient CLIs/Web UI: "Dataset not found"

Database Migration Workflows

The database migration workflows can be found in following markdown file.

Third-Party-Copy Sequence

The Third-Party-Copy Sequence illustrating interactions between Rucio, FTS and two RSEs for large file transfers can be found here.

7. Deployment view

To be defined with stakeholders – example below.

Environments

  • Development – for active feature work and integration (typically single-node docker-compose or Kubernetes). Includes automated code quality checks, security scanning, and unit testing to catch issues early.
  • QA – functional, integration, and security testing with controlled datasets. Validates feature requirements and security standards before promotion to staging.
  • Staging – production-like environment for final acceptance, performance, and security validation. Serves as the final checkpoint before production deployment.
  • Production – live operational environment with appropriate redundancy, continuous monitoring, and incident response capabilities.

Based on industry best practices from:

Environment Flow Activity Diagram

graph TD
    A[Developer commits code] --> B[Development Environment]
    
    B --> B1[Unit Tests]
    B --> B2[Code Quality Checks]
    B --> B3[Security Scanning]
    
    B1 --> C{All Dev Tests Pass?}
    B2 --> C
    B3 --> C
    
    C -->|No| D[Fix Issues & Recommit]
    D --> A
    
    C -->|Yes| E[QA Environment]
    
    E --> E1[Functional Testing]
    E --> E2[Integration Testing]
    E --> E3[Security Validation]
    E --> E4[Compliance Checks]
    
    E1 --> F{QA Validation Complete?}
    E2 --> F
    E3 --> F
    E4 --> F
    
    F -->|Issues Found| G[Create Bug Reports]
    G --> D
    
    F -->|Pass| H[Staging Environment]
    
    H --> H1[End-to-End Testing]
    H --> H2[Performance Testing]
    H --> H3[User Acceptance Testing]
    H --> H4[Final Security Review]
    
    H1 --> I{Production Ready?}
    H2 --> I
    H3 --> I
    H4 --> I
    
    I -->|Not Ready| J[Address Issues]
    J --> G
    
    I -->|Approved| K[Production Deployment]
    
    K --> K1[Blue/Green Switch]
    K --> K2[Canary Release]
    K --> K3[Rolling Update]
    
    K1 --> L[Production Environment]
    K2 --> L
    K3 --> L
    
    L --> L1[Continuous Monitoring]
    L --> L2[Incident Response]
    L --> L3[Performance Metrics]
    
    L1 --> M{Issues Detected?}
    L2 --> M
    L3 --> M
    
    M -->|Yes| N[Rollback/Hotfix]
    N --> K
    
    M -->|No| O[Success - Monitor & Maintain]
Loading

Key Testing Gates

graph LR
    subgraph "Development"
        D1[Unit Tests] --> D2[SAST] --> D3[Dependency Scan]
    end
    
    subgraph "QA" 
        Q1[Integration Tests] --> Q2[API Tests] --> Q3[Security Tests]
    end
    
    subgraph "Staging"
        S1[E2E Tests] --> S2[Load Tests] --> S3[UAT]
    end
    
    subgraph "Production"
        P1[Health Checks] --> P2[Monitoring] --> P3[Alerts]
    end
    
    D3 --> Q1
    Q3 --> S1  
    S3 --> P1
Loading

Deployment strategies

Modern deployment approaches ensure safe, reliable software releases:

  • Blue-Green Deployments – maintain two identical environments, switching traffic after validation
  • Canary Releases – gradual rollout to subset of users with monitoring and rollback capabilities
  • GitOps – infrastructure and deployments managed through version control for consistency and traceability

Security validation and policy enforcement are integrated throughout the deployment pipeline.

Ephemeral environments

Preview environments are automatically created for feature branches and pull requests, enabling early stakeholder feedback and isolated testing without environment conflicts. These temporary environments are provisioned and destroyed based on development lifecycle needs.

Learn more: What is an ephemeral environment?

Development environment – Single Node (docker-compose)

For comprehensive information about the deployment view refer to diagrams in following directory.

Development, QA, Staging and Production environments - Multi-Node (Kubernetes)

For comprehensive information about the deployment view refer to diagrams in following directories:

Production environments - Multi-Site (Kubernetes Federation)

TODO(mgajek-cern): add diagram, links, some content

8. Crosscutting concepts

Refer to Accounting and quota web page and subsequent sections. Also checkout newly added concepts here.

9. Architectural decisions

For comprehensive information about the architectural decision refer to following folder with markdown files

10. Quality requirements

Overview of quality requirements with SLA/SLO/SLI framework can be found here

This section defines measurable quality requirements using Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) to ensure system reliability, performance, and maintainability standards.

11. Risks & technical debt

TODO(mgajek-cern): Add links if existing

12. Glossary

TODO(mgajek-cern): Add links if existing

Testing definitions for different environments

  • Functional/API testing – verifies each feature or API behaves as expected.
  • Regression testing – re-runs existing tests automatically on code changes to catch breaks early.
  • Integration testing – verifies components/services interact correctly.
  • Acceptance testing – final check that system meets business and security requirements before release.
  • Load testing – measures performance under expected or heavy usage.
  • Security testing – includes static analysis, vulnerability scans, penetration tests, and compliance validation across environments.
  • Release validation – confirms the build is correct, stable, secure, and production-ready.

SLA/SLO/SLI Framework

  • SLA (Service Level Agreement): External commitments to users/customers (e.g., 99.9% uptime)
  • SLO (Service Level Objective): Internal targets to meet SLAs (e.g., 99.95% uptime target)
  • SLI (Service Level Indicator): Measurable metrics that track SLO performance (e.g., actual uptime percentage)

As-a-Service (XaaS) Options

  • SaaS (Software as a Service) – Ready-to-use applications delivered over the network; provider manages infrastructure, OS, and updates. Example: Gmail, Slack.
  • PaaS (Platform as a Service) – Development and runtime environment; provider handles OS/runtime and scaling, user manages app code. Example: Heroku, Google App Engine.
  • IaaS (Infrastructure as a Service) – Virtualized compute, storage, networking; user manages OS and apps, provider manages hardware and virtualization. Example: AWS EC2, Azure VMs.
  • FaaS (Function as a Service / Serverless) – Run individual functions or code snippets without managing servers; provider handles runtime and scaling. Example: AWS Lambda.
  • DaaS (Data as a Service) – Access to curated datasets via API; provider manages storage and delivery. Example: Snowflake, OpenWeatherMap API.
  • MLaaS (Machine Learning as a Service) – Pre-built ML models and training/deployment tools; user provides input and tuning. Example: Google Vertex AI.
  • AIaaS (Artificial Intelligence as a Service) – AI capabilities via API; provider manages models and infrastructure. Example: ChatGPT, Claude.

Cloud Deployment Types

  • Public Cloud – Shared infrastructure available to the general public, hosted by a provider. Example: AWS, Azure, GCP.
  • Private Cloud – Exclusive infrastructure for a single organization, either on-premises or hosted.
  • Community Cloud – Shared infrastructure among a group of organizations with common requirements.
  • Hybrid Cloud – Combination of private and public cloud, allowing flexibility for sensitive data or scaling.
  • Multi-Cloud – Use of multiple public cloud providers simultaneously for redundancy, performance, or avoiding vendor lock-in.

About

Unofficial architecture documentation on the Rucio project

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages