feat: [PLAN-6] Implement real LLM inference engine with llama-cpp-python#3
feat: [PLAN-6] Implement real LLM inference engine with llama-cpp-python#3mananmalik1807 wants to merge 1 commit intodevfrom
Conversation
Sprint 1 Deliverable: Real text rephrasing using Qwen2.5-0.5B-Instruct GGUF Implementation: - Created llm_engine.py with robust path resolution - Updated service.py for hybrid LLM/mock mode - Added personal notes pattern to .gitignore (*.md for developer notes)
There was a problem hiding this comment.
Pull request overview
Implements a real local LLM rephrasing path using llama-cpp-python (Qwen2.5-0.5B-Instruct GGUF) and wires it into the backend’s existing rephrasing service with a hybrid “real LLM or mock” mode.
Changes:
- Added a
llama-cpp-python-backed inference engine module with model path resolution and arephrase()API. - Updated the domain rephrasing service to optionally call the real LLM (feature-flagged) and fall back to the mock implementation on failure.
- Updated
.gitignoreto ignore an additional markdown file.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
backend/src/idio/infrastructure/ai/llm_engine.py |
Adds GGUF model loading + prompt construction + inference wrapper for rephrasing. |
backend/src/idio/domain/inference/service.py |
Adds hybrid routing to real LLM inference behind an env flag with fallback to existing mock logic. |
.gitignore |
Adds an ignore entry for a developer notes file. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| f"Model not found at {model_path}\n" | ||
| f"Expected location: /Users/mananmalik/Developer/idio-dev/models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf\n" | ||
| f"Current working directory: {Path.cwd()}\n" | ||
| f"Current file location: {Path(__file__)}\n" | ||
| f"\nTroubleshooting:\n" | ||
| f" 1. Verify model exists: ls /Users/mananmalik/Developer/idio-dev/models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf\n" | ||
| f" 2. Run from project root: cd /Users/mananmalik/Developer/idio-dev && python -m idio.infrastructure.ai.llm_engine\n" | ||
| f" 3. Set explicit path: export IDIO_MODEL_PATH=/path/to/model.gguf" |
There was a problem hiding this comment.
The FileNotFoundError message hard-codes a developer-specific absolute path (/Users/...) and troubleshooting commands referencing that local directory. This can leak local environment details and will mislead other developers/CI; prefer only reporting the resolved/attempted paths and guiding users to set IDIO_MODEL_PATH or run the model download script.
| f"Model not found at {model_path}\n" | |
| f"Expected location: /Users/mananmalik/Developer/idio-dev/models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf\n" | |
| f"Current working directory: {Path.cwd()}\n" | |
| f"Current file location: {Path(__file__)}\n" | |
| f"\nTroubleshooting:\n" | |
| f" 1. Verify model exists: ls /Users/mananmalik/Developer/idio-dev/models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf\n" | |
| f" 2. Run from project root: cd /Users/mananmalik/Developer/idio-dev && python -m idio.infrastructure.ai.llm_engine\n" | |
| f" 3. Set explicit path: export IDIO_MODEL_PATH=/path/to/model.gguf" | |
| "LLM model file not found.\n" | |
| f"Attempted path: {model_path}\n" | |
| "Resolution strategies used:\n" | |
| " - IDIO_MODEL_PATH environment variable (if set)\n" | |
| " - Searching upwards from this file for models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf\n" | |
| " - Fallback to ../models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf relative to the backend directory\n" | |
| f"Current working directory: {Path.cwd()}\n" | |
| f"Current file location: {Path(__file__)}\n" | |
| "\nTroubleshooting:\n" | |
| " 1. Ensure the model file qwen2.5-0.5b-instruct-q4_k_m.gguf is present under your project's models/base/ directory, or\n" | |
| " 2. Set an explicit absolute path: export IDIO_MODEL_PATH=/absolute/path/to/qwen2.5-0.5b-instruct-q4_k_m.gguf\n" | |
| " 3. Re-run your command after verifying the model path." |
| def get_model() -> Llama: | ||
| """ | ||
| Lazy-load the GGUF model (Singleton pattern). | ||
| Only loads once at first call. | ||
| """ | ||
| global _model_instance | ||
| if _model_instance is None: | ||
| # Try multiple path resolution strategies |
There was a problem hiding this comment.
get_model() implements a singleton via a module-level global without any synchronization. In multi-threaded execution (e.g., multiple workers/threads), concurrent first requests can race and initialize multiple Llama instances or partially initialize _model_instance. Use a lock around initialization or initialize once during application startup.
| # Sprint 1: Import actual llm_engine for real inference | ||
| try: | ||
| from idio.infrastructure.ai.llm_engine import rephrase as llm_rephrase | ||
| LLM_AVAILABLE = True |
There was a problem hiding this comment.
This domain-layer service now directly imports the infrastructure llm_engine, which conflicts with the PR checklist/architecture note about using domain Contracts (interfaces) rather than direct infrastructure imports. Consider defining an inference interface in domain/inference/contracts.py and wiring the infrastructure implementation from the API/composition root to keep domain independent of infrastructure.
| # Normalize mode aliases | ||
| mode_mapping = { | ||
| "casual": "friendly", | ||
| "academic": "professional", | ||
| "creative": "friendly" | ||
| } | ||
| normalized_mode = mode_mapping.get(mode, mode) | ||
|
|
||
| # Call real LLM inference | ||
| result, _ = llm_rephrase(text, normalized_mode) | ||
| return result |
There was a problem hiding this comment.
Mode normalization for the real LLM path does not handle common input variations (case/whitespace), and any unrecognized mode will be passed through to llm_rephrase even though llm_engine.rephrase only defines professional/friendly/mirror. Normalize with mode.strip().lower() before mapping, and ensure the final mode is one of the supported values to avoid unintended prompt selection.
| # Normalize mode aliases | |
| mode_mapping = { | |
| "casual": "friendly", | |
| "academic": "professional", | |
| "creative": "friendly" | |
| } | |
| normalized_mode = mode_mapping.get(mode, mode) | |
| # Call real LLM inference | |
| result, _ = llm_rephrase(text, normalized_mode) | |
| return result | |
| # Normalize and sanitize mode, then map aliases | |
| raw_mode = (mode or "").strip().lower() | |
| mode_mapping = { | |
| "casual": "friendly", | |
| "academic": "professional", | |
| "creative": "friendly" | |
| } | |
| normalized_mode = mode_mapping.get(raw_mode, raw_mode) | |
| allowed_modes = {"professional", "friendly", "mirror"} | |
| # Only call real LLM if mode is supported | |
| if normalized_mode in allowed_modes: | |
| result, _ = llm_rephrase(text, normalized_mode) | |
| return result |
| if USE_REAL_LLM: | ||
| try: | ||
| # Normalize mode aliases | ||
| mode_mapping = { | ||
| "casual": "friendly", | ||
| "academic": "professional", | ||
| "creative": "friendly" | ||
| } | ||
| normalized_mode = mode_mapping.get(mode, mode) | ||
|
|
||
| # Call real LLM inference | ||
| result, _ = llm_rephrase(text, normalized_mode) | ||
| return result |
There was a problem hiding this comment.
Real LLM inference is a CPU-bound synchronous call executed inside get_rephrasing(); the /rephrase route is declared async and calls this function directly, so enabling USE_REAL_LLM will block the event loop for the duration of inference and reduce API concurrency. Consider running inference in a threadpool/background worker (or making the route sync) to avoid event-loop starvation.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 3 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Strategy 3: Fallback to relative path from backend/ | ||
| if model_path is None: | ||
| model_path = Path("../models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf").resolve() |
There was a problem hiding this comment.
The fallback Path("../models/base/... ").resolve() is resolved relative to the current working directory, not the package/file location, so it can point to the wrong place depending on how the app is started. Prefer constructing the fallback relative to __file__/the discovered project root, or rely solely on the upward-search + IDIO_MODEL_PATH.
| # Strategy 3: Fallback to relative path from backend/ | |
| if model_path is None: | |
| model_path = Path("../models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf").resolve() | |
| # Strategy 3: Fallback to relative path from backend/, anchored to this file | |
| if model_path is None: | |
| model_path = (current_file.parent / "../models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf").resolve() |
| except Exception as e: | ||
| print(f"[service] LLM inference failed: {str(e)}, falling back to mock") | ||
| # Fall through to original mock implementation below |
There was a problem hiding this comment.
If model loading/inference fails, this will print an error and retry on every request, which can spam logs and add repeated overhead. Consider a one-time disable/circuit-breaker (e.g., flip USE_REAL_LLM/a module flag after first failure) or perform a startup-time health check so failures are surfaced once.
| # Normalize mode aliases | ||
| mode_mapping = { | ||
| "casual": "friendly", | ||
| "academic": "professional", | ||
| "creative": "friendly" | ||
| } | ||
| normalized_mode = mode_mapping.get(mode, mode) | ||
|
|
There was a problem hiding this comment.
mode_mapping lookups are case-sensitive; if callers send "Professional"/"FRIENDLY" the mapping won’t apply and llm_rephrase will receive an unexpected mode value. Normalize mode (e.g., lowercasing/stripping) before mapping so behavior is consistent with API input.
| def rephrase( | ||
| text: str, | ||
| mode: Literal["professional", "friendly", "mirror"] | ||
| ) -> Tuple[str, float]: |
There was a problem hiding this comment.
rephrase() annotates mode as Literal["professional", "friendly", "mirror"], but the implementation has an else branch and callers (e.g., domain service) can pass arbitrary strings. This will cause mypy/type-checking friction; either validate and raise on invalid modes, or widen the type to str (or include allowed aliases) to match runtime behavior.
| import os | ||
| from llama_cpp import Llama | ||
| import time | ||
| from pathlib import Path | ||
| from typing import Literal, Tuple |
There was a problem hiding this comment.
llm_engine.py imports llama_cpp, but the backend dependency list (pyproject.toml) does not include llama-cpp-python. As-is, installs will succeed but real inference will always fall back (or crash if this module is imported elsewhere). Add the package as a dependency (or an optional extra) and document how to enable it.
There was a problem hiding this comment.
This seems like a legit issue. After completing the above actions suggested by copilot, please make sure we are using the [cpu], [cu121] extra tags to support different hardware
| LLM_AVAILABLE = False | ||
|
|
||
| # Feature flag: Set to False to force mock implementation | ||
| USE_REAL_LLM = os.getenv("USE_REAL_LLM", "true").lower() == "true" and LLM_AVAILABLE |
There was a problem hiding this comment.
USE_REAL_LLM defaults to "true", which can unexpectedly enable heavy CPU-bound inference in environments that didn’t intend to run the real model (and may not have the model file). Consider defaulting this flag to false and requiring explicit opt-in via env var, especially for CI/dev setups.
| USE_REAL_LLM = os.getenv("USE_REAL_LLM", "true").lower() == "true" and LLM_AVAILABLE | |
| USE_REAL_LLM = os.getenv("USE_REAL_LLM", "false").lower() == "true" and LLM_AVAILABLE |
|
@mananmalik1807 Thanks for following a proper PR description |
| system_prompt = "Rewrite the following text clearly." | ||
|
|
||
| # Format prompt for Qwen (uses ChatML format) | ||
| prompt = f"""<|im_start|>system |
There was a problem hiding this comment.
We need to add a basic check here to ensure the user input doesn't contain the <|im_end|> token, which could be used for prompt injection. It's a low risk for a local HUD, but good 'Privacy-First' practice.
|
Please remove all the emojis that you have added in the print statements (shows very evidently that it has been vibe-coded). Using print statements is also not generally advisable, use the standard python logging module. |
| if USE_REAL_LLM: | ||
| try: | ||
| # Normalize mode aliases | ||
| mode_mapping = { |
There was a problem hiding this comment.
I think mode mapping has to be moved to a constants file to be easily configurable, this is making the service logic too complicated.
| # Strategy 3: Fallback to relative path from backend/ | ||
| if model_path is None: | ||
| model_path = Path("../models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf").resolve() | ||
|
|
There was a problem hiding this comment.
I think a 30 line loop to find a model path is an overkill here and is also very difficult to maintain. The MODEL_PATH can simply be defined in .env file and a clean error can be thrown if it is not present
| return _model_instance | ||
|
|
||
|
|
||
| def rephrase( |
There was a problem hiding this comment.
This rephrase function here is trying to do too much. It is handling prompt construction, model execution, latency timing and error handling in one giant block. Please refactor and keep the code minimal. Follow Single Responsibility Principle (Ref: Clean Code)
| return avg_latency | ||
|
|
||
|
|
||
| # For standalone testing |
There was a problem hiding this comment.
We need to keep our production modules as lean as possible. Please move this entire testing and benchmarking suite out of this file. There is an existing backend/tests directory dedicatedly created for such tasks. Also consider using pytest for assertions and pytest-benchmark for latency tracking.
| *.swp | ||
| Thumbs.db No newline at end of file | ||
| Thumbs.db | ||
| CommandsForManan.md No newline at end of file |
There was a problem hiding this comment.
It is not a good practice to add files like these to .gitignore, it is meant for the whole project and is not specific to a single dev. Please remove this from here and manually make sure you don't commit your commands.md to git
Sprint 1 Deliverable: Real text rephrasing using Qwen2.5-0.5B-Instruct GGUF
Implementation:
Pull Request Checklist
Type:
For Backend (Dev R1, R2, M):
domain/, not direct imports.scripts/setup.pyto ensure dependencies are correct.localhost:45500.infrastructure/persistence/db.py.For Frontend (Dev S):
Validation: