Skip to content

Performance Tuning

github-actions[bot] edited this page Jan 28, 2026 · 1 revision

Performance Tuning Guide

This guide covers performance optimization techniques for Elio applications.

Overview

Elio is designed for high performance through:

  • Lock-free data structures (Chase-Lev deque)
  • Work-stealing scheduler
  • Efficient I/O backends (io_uring, epoll)
  • Custom coroutine frame allocator
  • Minimal synchronization overhead

Scheduler Tuning

Thread Count

#include <elio/runtime/scheduler.hpp>

// Default: matches hardware concurrency
scheduler sched;

// Custom thread count
scheduler sched(8);  // 8 worker threads

// For I/O-bound workloads, consider more threads than cores
scheduler sched(std::thread::hardware_concurrency() * 2);

// For CPU-bound workloads, match core count
scheduler sched(std::thread::hardware_concurrency());

Dynamic Thread Adjustment

// Enable dynamic thread pool sizing
sched.set_min_threads(2);
sched.set_max_threads(16);

// Scheduler will add threads under load and remove idle threads

Thread Affinity

Pin coroutines to specific workers for cache locality:

#include <elio/coro/affinity.hpp>

coro::task<void> cache_sensitive_work() {
    // Pin to current worker
    co_await coro::pin_to_current_worker();

    // All subsequent work stays on this worker
    process_data();
}

// Or set affinity explicitly
coro::set_affinity(handle, worker_id);

I/O Backend Selection

io_uring vs epoll

Elio auto-detects the best available backend:

#include <elio/io/io_context.hpp>

// Auto-detect (prefers io_uring)
io::io_context ctx;

// Force specific backend
io::io_context ctx(io::io_context::backend_type::io_uring);
io::io_context ctx(io::io_context::backend_type::epoll);

// Check active backend
std::cout << "Backend: " << ctx.get_backend_name() << std::endl;

io_uring advantages:

  • Batched syscalls (fewer context switches)
  • Kernel-side I/O completion
  • Better for high-throughput scenarios

epoll fallback:

  • Works on older kernels (pre-5.1)
  • Lower memory overhead
  • Adequate for moderate workloads

io_uring Kernel Requirements

For best io_uring performance:

  • Linux 5.1+: Basic io_uring
  • Linux 5.6+: Full features
  • Linux 5.11+: Multi-shot accept

Memory Management

Coroutine Frame Allocator

Elio uses a thread-local pool allocator for coroutine frames:

// Configured in frame_allocator.hpp
static constexpr size_t MAX_FRAME_SIZE = 256;  // Max pooled size
static constexpr size_t POOL_SIZE = 1024;       // Pool capacity

// Statistics (if enabled)
auto stats = coro::frame_allocator::get_stats();
std::cout << "Allocations: " << stats.allocations << std::endl;
std::cout << "Pool hits: " << stats.pool_hits << std::endl;

Avoiding Allocations

Keep coroutine frames small for pool allocation:

// Bad: Large array in coroutine frame (can't use pool)
coro::task<void> large_frame() {
    char buffer[8192];  // Too large for pool
    co_await read_data(buffer);
}

// Good: Allocate separately
coro::task<void> small_frame() {
    auto buffer = std::make_unique<char[]>(8192);
    co_await read_data(buffer.get());
}

Synchronization Primitives

Mutex Performance

Elio's mutex uses atomic fast-path for uncontended cases:

#include <elio/sync/primitives.hpp>

sync::mutex mtx;

// Fast path: atomic CAS (~10ns)
// Slow path: suspend and queue (~100ns + context switch)

coro::task<void> critical_section() {
    co_await mtx.lock();
    // ... critical section ...
    mtx.unlock();
}

// Use try_lock to avoid blocking
if (mtx.try_lock()) {
    // Got lock immediately
    mtx.unlock();
} else {
    // Skip or retry later
}

Reader-Writer Lock

For read-heavy workloads:

sync::shared_mutex rw_mtx;

// Multiple concurrent readers (atomic counter, no blocking)
coro::task<void> reader() {
    co_await rw_mtx.lock_shared();
    auto data = read_data();
    rw_mtx.unlock_shared();
}

// Exclusive writers
coro::task<void> writer() {
    co_await rw_mtx.lock();
    write_data();
    rw_mtx.unlock();
}

Channel Selection

Choose appropriate channel type:

// Bounded channel: back-pressure, bounded memory
sync::channel<int> ch(100);

// Unbounded channel: faster but can grow indefinitely
sync::unbounded_channel<int> uch;

// SPSC queue: single producer/consumer (fastest)
runtime::spsc_queue<int> spsc(1000);

Network Performance

Connection Pooling

HTTP client uses connection pooling by default:

http::client_config config;
config.max_connections_per_host = 10;  // Pool size per host
config.pool_idle_timeout = std::chrono::seconds(60);

http::client client(ctx, config);

Buffer Sizes

Tune read buffer sizes for your workload:

http::client_config config;
config.read_buffer_size = 16384;  // 16KB (default: 8KB)

// For large payloads
config.read_buffer_size = 65536;  // 64KB

TCP Settings

Configure TCP options for performance:

// Enable TCP_NODELAY for latency-sensitive applications
net::tcp_stream stream = /* ... */;
stream.set_nodelay(true);

// Adjust send/receive buffers
stream.set_send_buffer_size(65536);
stream.set_recv_buffer_size(65536);

Profiling and Monitoring

Scheduler Statistics

// Get scheduler metrics
auto stats = sched.get_stats();
std::cout << "Tasks spawned: " << stats.tasks_spawned << std::endl;
std::cout << "Tasks completed: " << stats.tasks_completed << std::endl;
std::cout << "Steals: " << stats.work_steals << std::endl;
std::cout << "Average queue depth: " << stats.avg_queue_depth << std::endl;

Logging Overhead

Debug logging has overhead; disable in production:

// Set at compile time
// cmake -DELIO_DEBUG=OFF ..

// Or at runtime
elio::log::set_level(elio::log::level::warning);

Coroutine Stack Tracing

Use virtual stack for debugging without significant overhead:

// Enable in debug builds only
#ifdef ELIO_DEBUG
    auto* frame = coro::current_frame();
    print_stack_trace(frame);
#endif

Benchmarking Tips

Warm-up

// Warm up allocators and caches
for (int i = 0; i < 1000; i++) {
    warmup_task().go();
}
sched.sync();

// Now measure
auto start = std::chrono::steady_clock::now();
// ... actual benchmark ...
auto end = std::chrono::steady_clock::now();

Avoid Measurement Overhead

// Bad: timing inside hot loop
for (int i = 0; i < 1000000; i++) {
    auto start = now();  // Overhead!
    do_work();
    auto end = now();
    record(end - start);
}

// Good: time the whole batch
auto start = now();
for (int i = 0; i < 1000000; i++) {
    do_work();
}
auto end = now();
auto avg = (end - start) / 1000000;

Use Release Builds

Always benchmark with optimizations:

cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .

Common Performance Issues

Problem: High Latency Spikes

Causes:

  • Work stealing delays
  • GC pauses in other processes
  • Kernel scheduling

Solutions:

  • Pin critical tasks to workers
  • Use CPU affinity for scheduler threads
  • Consider real-time scheduling

Problem: Low Throughput

Causes:

  • Lock contention
  • Inefficient I/O batching
  • Small buffer sizes

Solutions:

  • Profile lock contention
  • Use io_uring for batching
  • Increase buffer sizes

Problem: High Memory Usage

Causes:

  • Unbounded channels
  • Large coroutine frames
  • Connection pool growth

Solutions:

  • Use bounded channels
  • Allocate large buffers separately
  • Limit connection pool size

Quick Reference

Scenario Recommendation
I/O-bound 2x core count threads
CPU-bound 1x core count threads
Latency-critical Pin to workers, io_uring
Throughput-critical Large buffers, batching
Memory-constrained Bounded channels, small pools
Read-heavy sync Use shared_mutex

See Also

Clone this wiki locally