Catalyst: A Hardware-Agnostic Deep Learning Framework

Author: Stephen Elliott
Institution: UNSW Sydney
Project: COMP4121 Course Project (2024)
Status: Incomplete, Archived

Overview

Catalyst is a deep learning framework I designed and built to learn the architectural principles of modern deep learning systems, like PyTorch and TensorFlow. This project examines how the mathematics of neural networks are implemented and executed on hardware, through computational graphs, automatic differentiation, backpropagation, and hardware abstraction layers.

The framework develops architectural decisions from first principles, with emphasis on modularity and explicit design patterns over quick-and-dirty tricks. I originally aimed to train a GPT on the framework, but I soon realised the scope of that ambition and decided to prioritise depth of systems understanding instead.

Over the 10-week course, I research established deep learning backends, learned C++, designed the system, and implemented substantial parts of it. This was a fantastic learning experience as an intermediate undergraduate and I am pleased to share it.

Focus Areas:

Computational graph design and execution
Hardware abstraction with strategy patterns
Automatic differentiation and backpropagation
Asynchronous process and hardware management

Learning Outcomes

Beyond learning C++, this project provided systematic investigation of:

Framework Design Complexity: The translation from mathematical specification (∂L/∂w) to executable code requires addressing graph intermediate representations, memory management, execution scheduling, and numerical stability
Hardware Abstraction: Strategy patterns enable transparent operation dispatch to CPU (sequential, cached execution) versus GPU (parallel, high-bandwidth execution) contexts based on hardware characteristics
HPC Software Engineering: Complexity management through smart pointers, RAII, mutex protection, and design patterns (Strategy, Factory, Singleton) in performance-critical implementations
Systems Understanding: Machine learning as high-performance computing applications with concrete hardware constraints, memory hierarchies, and parallelisation requirements.

As a time-limited education project, I emphasised design depth over implementation breadth. I treated architectural decisions as learning opportunities rather than optimising for feature completeness or efficiency. In the end this system gave me a good understanding of deep learning backends and larger-scale system design, and prepared me to write better code in future research projects. It also opened the door to contributing to open-source HPC backends, though I discovered through the project that my interest really lies in models. Hence, I refocused my attention on models and frontier theory after this project.

Project Evolution

Initial objectives: Primarily, to learn about industrial deep learning backends. In particular, by: Designing and implementing my own framework -> Training a GPT in it -> In this context, analysing the transformer architecture.

Complexity was higher than I expected: The design required much more implementation than I anticipated. With the scope and interconnectedness of a flexible system like this, core components - graph traversal algorithms, cache invalidation, pointer management, integrating libtorch - each expanded into their own engineering problems.

I adjusted the scope: I shifted my focus from outcomes (train a GPT) to educational benefit, especially by focusing on a strong design. My final report and system diagram focuses on this architectural understanding over chasing a complete implementation, which I believe would have taught me less about systems in general (my objective).

Architecture

Model (API Layer)
    ↓
Computational Graph (CGGraph, CGNode)
    ↓
Tensor Operations (Strategy Pattern)
    ↓
CPU Strategy ← → GPU Strategy (OpenCL)

Core Components:

Computational Graph: DAG-based execution with topological sorting and automatic differentiation
Tensor Operations: Hardware-agnostic interface with CPU/GPU strategy implementations
Data Pipeline: Dataset loading, batching, train/val/test splitting
Logging System: Async JSON logging with system metrics and training diagnostics
GPU Infrastructure: OpenCL kernel management (architecture defined, kernels not implemented)

Implementation Status

Completed:

✓ Data handling (Dataset, DataLoader with proper splitting)
✓ Async logging system (multiple entry types, serialisation to JSON)
✓ Tensor operations interface with strategy pattern
✓ Comprehensive unit tests (Google Test)

Partially Implemented:

◐ Computational graph (forward/backward passes with known segfault issues)
◐ GPU operations infrastructure (headers and interfaces only)
◐ Model training loop (stubs present)

Designed but Not Implemented:

✗ Custom OpenCL kernels
✗ Computational graph optimisation
✗ Multi-device scheduling
✗ Distributed training (Bonzo-HPC library described in report)

Building the repo

Dependencies:

C++20 compiler
CMake 3.18+
LibTorch 2.5.1 (CUDA 12.4)
OpenCL 1.2+
Google Test (included)

# Download LibTorch
wget https://download.pytorch.org/libtorch/cu124/libtorch-cxx11-abi-shared-with-deps-2.5.1%2Bcu124.zip
unzip libtorch-*.zip

# Build
mkdir build && cd build
cmake -DCMAKE_PREFIX_PATH=../libtorch ..
cmake --build . --config Release

# Run tests
ctest --output-on-failure

Note: Not all branches compile successfully. The tensorops branch contains the most complete tested implementation.

Repo Structure

source/
├── computationgraph/    # CGGraph, CGNode (DAG execution)
├── tensorops/           # Hardware abstraction (Strategy pattern)
├── gpuops/              # OpenCL kernel infrastructure
├── datahandlers/        # Dataset, DataLoader
├── logging/             # Async logging system
├── io/                  # File I/O utilities
└── Model/               # Top-level API

tests/                   # Unit tests (Google Test)
docs/                    # Technical reports
design/                  # Figma architecture diagrams

Known Issues

Segmentation faults in multi-layer forward passes (pointer invalidation)
Type mismatches between tensor and scalar operations in gradient updates
OpenCL kernel source code unimplemented (infrastructure only)
No end-to-end training loop despite component implementations

These issues are documented to indicate my progress with the implementation. The project's educational objectives have been achieved so I don't intend to resolve them.

Related Work

Inspired by:

PyTorch: Dynamic graphs, autograd
TensorFlow: Grappler optimisation, XLA compilation
Karpathy's micrograd: Minimal autograd reference
Horovod: Distributed training patterns

Academic foundations:

Vaswani et al., "Attention is All You Need"
Dao et al., "FlashAttention"
CLRS, Introduction to Algorithms (graph algorithms)

License

MIT License - See LICENSE file for details.

Contact

GitHub: one-2/catalyst

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
.vscode		.vscode
docs		docs
source		source
tests		tests
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
catalyst.code-workspace		catalyst.code-workspace
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Catalyst: A Hardware-Agnostic Deep Learning Framework

Overview

Learning Outcomes

Project Evolution

Architecture

Implementation Status

Building the repo

Repo Structure

Known Issues

Related Work

License

Contact

About

Uh oh!

Languages

License

one-2/catalyst

Folders and files

Latest commit

History

Repository files navigation

Catalyst: A Hardware-Agnostic Deep Learning Framework

Overview

Learning Outcomes

Project Evolution

Architecture

Implementation Status

Building the repo

Repo Structure

Known Issues

Related Work

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages