A dual-mode llama.cpp integration project supporting both desktop (Electron) and distributed cluster (Inferno OS) deployments.
This project provides two ways to run llama.cpp inference:
- Desktop Mode: An Electron application with Node.js native addon for local, single-user inference
- Distributed Mode: Pure Limbo implementation for Inferno OS, optimized for distributed cognition across thousands of tiny inference engines in load-balanced clusters
- Load LLM models through a user-friendly interface
- Process text prompts asynchronously in a separate thread
- Built with Electron for cross-platform compatibility
- Direct integration with llama.cpp via a Node.js addon
- Deploy thousands of modular isolates as Dis VM instances
- Load balancing across cluster with multiple strategies
- Distributed cognition with collective inference capacity
- Auto-scaling based on load and resource availability
- Aggregate throughput of 10,000+ tokens/sec with 1000+ nodes
- Limbot: AI chat assistant CLI with conversation history
- Dish Integration: Interactive distributed shell for cluster access
Choose your deployment mode:
- Desktop Mode Setup - For local single-user inference
- Distributed Mode Setup - For cluster deployment with thousands of nodes
- Node.js (v16+)
- npm or yarn
- C++ compiler (GCC, Clang, or MSVC)
- CMake (for building llama.cpp)
- Git
-
Clone this repository:
git clone https://github.com/aruntemme/llama.cpp-electron.git cd llama.cpp-electron -
Install dependencies:
npm install -
Clone and build llama.cpp (required before building the Node.js addon):
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp mkdir build cd build cmake .. cmake --build . --config Release cd ../..
-
Build the Node.js addon:
npm run build
-
Start the application:
npm start
- Launch the application
- Click "Select Model" to choose a llama.cpp compatible model file (.bin or .gguf)
- Enter a prompt in the text area
- Click "Process Prompt" to analyze the text
- View the results in the results section
For distributed cluster deployment with thousands of tiny inference engines.
- Inferno OS installed (or Inferno emulator)
- Limbo compiler
- llama.cpp compatible models
-
Check Inferno installation:
cd inferno ./deploy.sh check -
Compile Limbo modules:
./deploy.sh compile
-
Deploy to cluster:
./deploy.sh deploy-local # For local testing # or ./deploy.sh deploy-cluster # For distributed cluster
-
Initialize cluster:
cd inferno ./llamboctl init -
Spawn inference nodes:
./llamboctl spawn --count 1000 --type tiny
-
Check cluster status:
./llamboctl status
-
Process inference requests:
# Requests are automatically load-balanced across nodes ./deploy.sh test
-
Monitor cluster:
./llamboctl health ./llamboctl metrics --export prometheus
-
Use Limbot AI chat assistant:
# Interactive chat mode ./llamboctl limbot # One-shot inference ./llamboctl limbot "What is distributed computing?"
-
Use Dish distributed shell:
# Launch interactive shell ./llamboctl dish
Edit inferno/cluster-config.yaml to configure:
- Node types (tiny: 128MB, medium: 1GB, large: 8GB)
- Node counts (100 to 10,000+)
- Load balancing strategy (round-robin, least-loaded, random)
- Auto-scaling parameters
- Network topology
See inferno/README.md for complete documentation.
You'll need to download LLM model files separately. Compatible models include:
- GGUF format models (recommended)
- Quantized models for better performance
- Other formats supported by llama.cpp
You can download models from Hugging Face or other repositories.
For Desktop Mode: Place models in a location accessible by the application.
For Distributed Mode: Place models in /models directory for cluster nodes to access.
See ARCHITECTURE.md for detailed documentation on both architectures.
Desktop Mode: Single Electron process with Node.js addon → llama.cpp C++ library
- Performance: ~10 tokens/sec
- Resource: GB of RAM required
- Use case: Single-user desktop application
Distributed Mode: Thousands of Inferno Dis VM instances with Limbo implementation
- Performance: ~10,000+ tokens/sec aggregate (1000 nodes)
- Resource: 128MB per tiny node, scales horizontally
- Use case: Distributed cluster, massive parallel inference
- Model loading errors: Ensure your model file is compatible with llama.cpp
- Addon building errors: Make sure llama.cpp is properly built before building the addon
- Performance issues: Large models may require more memory and processing power
- Cannot find llama.h: Make sure you've built llama.cpp using the steps above
- Loading model fails: Verify the model path is correct and the model is in a supported format
- Electron startup errors: Check the terminal output for detailed error messages
- Compilation errors: Ensure Inferno environment is properly configured
- Node spawn failures: Check resource limits (ulimit) and available ports
- Load balancing issues: Verify cluster configuration in
cluster-config.yaml - Module loading errors: Ensure Limbo modules are compiled to .dis bytecode
See inferno/README.md for detailed troubleshooting.
llama.cpp-electron/
├── src/ # Desktop mode (Electron)
│ ├── addon/ # C++ Node.js addon
│ │ ├── llama_addon.cpp
│ │ └── binding.gyp
│ ├── main.js # Electron main process
│ ├── renderer.js # Frontend logic
│ ├── preload.js # IPC bridge
│ ├── index.html # UI
│ └── styles.css
├── inferno/ # Distributed mode (Inferno OS)
│ ├── llambo.m # Module definition
│ ├── llambo.b # Implementation
│ ├── llambotest.b # Test suite
│ ├── cluster-config.yaml # Cluster configuration
│ ├── deploy.sh # Deployment script
│ ├── llamboctl # Cluster control utility
│ └── README.md # Detailed documentation
├── llama.cpp/ # Submodule
├── ARCHITECTURE.md # Architecture documentation
├── README.md # This file
└── package.json
| Mode | Deployment | Throughput | Latency | Scalability |
|---|---|---|---|---|
| Desktop | Single machine | ~10 tok/s | 100ms | Limited by local resources |
| Distributed (100 nodes) | Cluster | ~1,000 tok/s | 50ms | Horizontal scaling |
| Distributed (1000 nodes) | Cluster | ~10,000 tok/s | 45ms | Thousands of nodes |
Desktop Mode:
- Personal AI assistant
- Local development and testing
- Single-user applications
- Privacy-focused deployments
Distributed Mode:
- Large-scale inference services
- Multi-tenant platforms
- Research clusters
- Edge computing networks
- Distributed AI systems
This project is licensed under the ISC License - see the LICENSE file for details.
- llama.cpp - Inference engine
- Electron - Desktop application framework
- Node.js - JavaScript runtime
- Inferno OS - Distributed operating system
- Limbo - Programming language for Inferno
- ARCHITECTURE.md - Detailed architecture for both modes
- inferno/README.md - Complete Inferno/Limbo documentation
- Desktop Mode: See above sections
- Distributed Mode: See
inferno/directory
Contributions are welcome! Areas of interest:
Desktop Mode:
- UI/UX improvements
- Additional llama.cpp features
- Performance optimizations
Distributed Mode:
- FFI bindings to llama.cpp C library
- Advanced load balancing algorithms
- Consensus and cognitive fusion strategies
- Monitoring and telemetry
- Production deployment tools