Skip to content

High-Performance Computing graduate course at the University of Trento, academic year 2024/2025

Notifications You must be signed in to change notification settings

MaiDormo/parallel_mst

 
 

Repository files navigation

HPC Project: Distributed Parallel MST

MPI OpenMP License

A highly optimized, hybrid MPI + OpenMP implementation of Boruvka's Algorithm for finding the Minimum Spanning Tree (MST) on distributed clusters. Designed for speed, scalability, and memory efficiency.


🚀 Performance Highlights

Metric Baseline Optimized Improvement
Total Execution (20k Nodes) 94.0s 20.0s 4.7x Faster
I/O Overhead 93s ~1.2s 77x Faster
Memory Access Indirect (Ptr-to-Ptr) Flat 1D (Contiguous) Cache Optimal

📖 Table of Contents


✨ Key Features

  • Hybrid Parallelism: Uses MPI for inter-node communication and OpenMP for intra-node threading.
  • Vectorized Compute: SIMD-friendly loops utilizing AVX2/AVX-512 instructions.
  • Zero-Copy I/O: Memory-mapped file reading (mmap) for instant graph loading.
  • Automated Tooling: Complete suite for graph generation, benchmarking, and result plotting.

🛠 Architecture & Optimizations

This project moves beyond standard implementations by addressing hardware-level bottlenecks:

1. Memory Layout

  • Flattened 1D Arrays: Replaced uint16_t** with uint16_t* aligned to 64-byte boundaries. This enables hardware prefetching and eliminates pointer-chasing overhead.
  • Branchless Logic: Uses a symmetric $N \times N$ matrix to remove if (row < col) branching inside hot loops, preventing pipeline flushes.

2. Computation

  • SIMD Vectorization: Inner loops are refactored to be autovectorizable, processing 16-32 edge weights per CPU cycle.
  • Logical Edge Pruning: Edges connecting vertices within the same component are marked as MAX_VALUE to skip expensive union-find lookups in subsequent iterations.
  • Static Scheduling: OpenMP schedule(static) is used to eliminate dynamic scheduling lock contention.

3. Communication

  • Bulk Broadcasting: Replaced row-by-row broadcasting with a single bulk MPI_Bcast, saturating the interconnect bandwidth.
  • Tree Flattening: Explicit path compression before parallel regions ensures $O(1)$ component lookups.

📂 Project Structure

├── src/
│   ├── new_main.c                  # Optimized MPI+OpenMP implementation
│   └── main.c                      # Baseline implementation
├── utils/
│   ├── plot_tools.py               # Visualization tools (example)
├── hpc_tools.sh                    # Unified script for cluster/remote usage
├── hpc_tools_local.sh              # Unified script for local usage
├── Makefile                        # Build automation
└── README.md

⚙️ Build & Run

Prerequisites

  • GCC (with OpenMP support)
  • MPICH or OpenMPI

1. Compile

# For cluster/remote usage:
./hpc_tools.sh compile

# For local usage:
./hpc_tools_local.sh compile

Compiler Flags Used: -O3 -fopenmp -march=native -ftree-vectorize -funroll-loops -std=gnu11

2. Generate Data

# To generate graphs (cluster/remote):
./hpc_tools.sh generate-graphs

# To generate graphs (local):
./hpc_tools_local.sh generate-graphs

3. Run Locally

# To benchmark (cluster/remote):
./hpc_tools.sh benchmark

# To benchmark (local):
./hpc_tools_local.sh benchmark

# To compare implementations (cluster/remote):
./hpc_tools.sh compare

# To compare implementations (local):
./hpc_tools_local.sh compare

☁️ Cluster Deployment

Designed for PBS/Torque clusters (e.g., typically found in HPC environments).

Command Description
`./hpc_tools.sh pbs-job serial parallel

PBS job script generation example:

# Print a PBS script for a parallel job:
./hpc_tools.sh pbs-job parallel > submit_parallel.pbs

📊 Analysis & Benchmarking

Use the compare command in the unified scripts to compare implementations and collect results. You can further analyze output logs or results as needed.


🛣 Future Roadmap

  • Sparse Matrix Support: Switch to CSR (Compressed Sparse Row) format to support N > 100,000 graphs.
  • Asynchronous MPI: Implement MPI_Iallreduce to overlap communication with computation.
  • Hierarchical Merging: Implement a divide-and-conquer strategy for massive scale (>1M nodes).
  • GPU Offloading: Port local edge scanning to CUDA/OpenACC.

About

High-Performance Computing graduate course at the University of Trento, academic year 2024/2025

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 64.6%
  • Python 17.7%
  • Shell 13.8%
  • Makefile 3.9%