feat: [PLAN-6] Implement real LLM inference engine with llama-cpp-python by mananmalik1807 · Pull Request #3 · shknth/idio

mananmalik1807 · 2026-02-16T07:56:14Z

Sprint 1 Deliverable: Real text rephrasing using Qwen2.5-0.5B-Instruct GGUF

Implementation:

Created llm_engine.py with robust path resolution
Updated service.py for hybrid LLM/mock mode
Added personal notes pattern to .gitignore (*.md for developer notes)

Pull Request Checklist

Type:

Bug fix
[✅] New Feature
Refactor
Documentation

For Backend (Dev R1, R2, M):

[✅] I have used the Contracts (Interfaces) in domain/, not direct imports.
[✅] I have run scripts/setup.py to ensure dependencies are correct.
[✅] I have tested the API locally at localhost:45500.
If I changed the DB, I updated infrastructure/persistence/db.py.

For Frontend (Dev S):

I have tested with the Mock API running.
I have checked the UI in both Light and Dark modes.

Validation:

[✅] My code runs locally without crashing.
[✅] I have added/updated tests if applicable.

Sprint 1 Deliverable: Real text rephrasing using Qwen2.5-0.5B-Instruct GGUF Implementation: - Created llm_engine.py with robust path resolution - Updated service.py for hybrid LLM/mock mode - Added personal notes pattern to .gitignore (*.md for developer notes)

Copilot

Pull request overview

Implements a real local LLM rephrasing path using llama-cpp-python (Qwen2.5-0.5B-Instruct GGUF) and wires it into the backend’s existing rephrasing service with a hybrid “real LLM or mock” mode.

Changes:

Added a llama-cpp-python-backed inference engine module with model path resolution and a rephrase() API.
Updated the domain rephrasing service to optionally call the real LLM (feature-flagged) and fall back to the mock implementation on failure.
Updated .gitignore to ignore an additional markdown file.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 6 comments.

File	Description
`backend/src/idio/infrastructure/ai/llm_engine.py`	Adds GGUF model loading + prompt construction + inference wrapper for rephrasing.
`backend/src/idio/domain/inference/service.py`	Adds hybrid routing to real LLM inference behind an env flag with fallback to existing mock logic.
`.gitignore`	Adds an ignore entry for a developer notes file.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-16T08:00:15Z

backend/src/idio/infrastructure/ai/llm_engine.py

+                f"Model not found at {model_path}\n"
+                f"Expected location: /Users/mananmalik/Developer/idio-dev/models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf\n"
+                f"Current working directory: {Path.cwd()}\n"
+                f"Current file location: {Path(__file__)}\n"
+                f"\nTroubleshooting:\n"
+                f"  1. Verify model exists: ls /Users/mananmalik/Developer/idio-dev/models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf\n"
+                f"  2. Run from project root: cd /Users/mananmalik/Developer/idio-dev && python -m idio.infrastructure.ai.llm_engine\n"
+                f"  3. Set explicit path: export IDIO_MODEL_PATH=/path/to/model.gguf"


The FileNotFoundError message hard-codes a developer-specific absolute path (/Users/...) and troubleshooting commands referencing that local directory. This can leak local environment details and will mislead other developers/CI; prefer only reporting the resolved/attempted paths and guiding users to set IDIO_MODEL_PATH or run the model download script.

Suggested change

f"Model not found at {model_path}\n"

f"Expected location: /Users/mananmalik/Developer/idio-dev/models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf\n"

f"Current working directory: {Path.cwd()}\n"

f"Current file location: {Path(__file__)}\n"

f"\nTroubleshooting:\n"

f" 1. Verify model exists: ls /Users/mananmalik/Developer/idio-dev/models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf\n"

f" 2. Run from project root: cd /Users/mananmalik/Developer/idio-dev && python -m idio.infrastructure.ai.llm_engine\n"

f" 3. Set explicit path: export IDIO_MODEL_PATH=/path/to/model.gguf"

"LLM model file not found.\n"

f"Attempted path: {model_path}\n"

"Resolution strategies used:\n"

" - IDIO_MODEL_PATH environment variable (if set)\n"

" - Searching upwards from this file for models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf\n"

" - Fallback to ../models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf relative to the backend directory\n"

f"Current working directory: {Path.cwd()}\n"

f"Current file location: {Path(__file__)}\n"

"\nTroubleshooting:\n"

" 1. Ensure the model file qwen2.5-0.5b-instruct-q4_k_m.gguf is present under your project's models/base/ directory, or\n"

" 2. Set an explicit absolute path: export IDIO_MODEL_PATH=/absolute/path/to/qwen2.5-0.5b-instruct-q4_k_m.gguf\n"

" 3. Re-run your command after verifying the model path."

backend/src/idio/infrastructure/ai/llm_engine.py

Copilot · 2026-02-16T08:00:15Z

backend/src/idio/infrastructure/ai/llm_engine.py

+def get_model() -> Llama:
+    """
+    Lazy-load the GGUF model (Singleton pattern).
+    Only loads once at first call.
+    """
+    global _model_instance
+    if _model_instance is None:
+        # Try multiple path resolution strategies


get_model() implements a singleton via a module-level global without any synchronization. In multi-threaded execution (e.g., multiple workers/threads), concurrent first requests can race and initialize multiple Llama instances or partially initialize _model_instance. Use a lock around initialization or initialize once during application startup.

Copilot · 2026-02-16T08:00:15Z

backend/src/idio/domain/inference/service.py

+# Sprint 1: Import actual llm_engine for real inference
+try:
+    from idio.infrastructure.ai.llm_engine import rephrase as llm_rephrase
+    LLM_AVAILABLE = True


This domain-layer service now directly imports the infrastructure llm_engine, which conflicts with the PR checklist/architecture note about using domain Contracts (interfaces) rather than direct infrastructure imports. Consider defining an inference interface in domain/inference/contracts.py and wiring the infrastructure implementation from the API/composition root to keep domain independent of infrastructure.

Copilot · 2026-02-16T08:00:16Z

backend/src/idio/domain/inference/service.py

+            # Normalize mode aliases
+            mode_mapping = {
+                "casual": "friendly",
+                "academic": "professional",
+                "creative": "friendly"
+            }
+            normalized_mode = mode_mapping.get(mode, mode)
+
+            # Call real LLM inference
+            result, _ = llm_rephrase(text, normalized_mode)
+            return result


Mode normalization for the real LLM path does not handle common input variations (case/whitespace), and any unrecognized mode will be passed through to llm_rephrase even though llm_engine.rephrase only defines professional/friendly/mirror. Normalize with mode.strip().lower() before mapping, and ensure the final mode is one of the supported values to avoid unintended prompt selection.

Suggested change

# Normalize mode aliases

mode_mapping = {

"casual": "friendly",

"academic": "professional",

"creative": "friendly"

}

normalized_mode = mode_mapping.get(mode, mode)

# Call real LLM inference

result, _ = llm_rephrase(text, normalized_mode)

return result

# Normalize and sanitize mode, then map aliases

raw_mode = (mode or "").strip().lower()

mode_mapping = {

"casual": "friendly",

"academic": "professional",

"creative": "friendly"

}

normalized_mode = mode_mapping.get(raw_mode, raw_mode)

allowed_modes = {"professional", "friendly", "mirror"}

# Only call real LLM if mode is supported

if normalized_mode in allowed_modes:

result, _ = llm_rephrase(text, normalized_mode)

return result

Copilot · 2026-02-16T08:00:16Z

backend/src/idio/domain/inference/service.py

+    if USE_REAL_LLM:
+        try:
+            # Normalize mode aliases
+            mode_mapping = {
+                "casual": "friendly",
+                "academic": "professional",
+                "creative": "friendly"
+            }
+            normalized_mode = mode_mapping.get(mode, mode)
+
+            # Call real LLM inference
+            result, _ = llm_rephrase(text, normalized_mode)
+            return result


Real LLM inference is a CPU-bound synchronous call executed inside get_rephrasing(); the /rephrase route is declared async and calls this function directly, so enabling USE_REAL_LLM will block the event loop for the duration of inference and reduce API concurrency. Consider running inference in a threadpool/background worker (or making the route sync) to avoid event-loop starvation.

Copilot

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-16T15:21:35Z

backend/src/idio/infrastructure/ai/llm_engine.py

+            # Strategy 3: Fallback to relative path from backend/
+            if model_path is None:
+                model_path = Path("../models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf").resolve()


The fallback Path("../models/base/... ").resolve() is resolved relative to the current working directory, not the package/file location, so it can point to the wrong place depending on how the app is started. Prefer constructing the fallback relative to __file__/the discovered project root, or rely solely on the upward-search + IDIO_MODEL_PATH.

Suggested change

# Strategy 3: Fallback to relative path from backend/

if model_path is None:

model_path = Path("../models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf").resolve()

# Strategy 3: Fallback to relative path from backend/, anchored to this file

if model_path is None:

model_path = (current_file.parent / "../models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf").resolve()

Copilot · 2026-02-16T15:21:35Z

backend/src/idio/domain/inference/service.py

+        except Exception as e:
+            print(f"[service] LLM inference failed: {str(e)}, falling back to mock")
+            # Fall through to original mock implementation below


If model loading/inference fails, this will print an error and retry on every request, which can spam logs and add repeated overhead. Consider a one-time disable/circuit-breaker (e.g., flip USE_REAL_LLM/a module flag after first failure) or perform a startup-time health check so failures are surfaced once.

Copilot · 2026-02-16T15:21:35Z

backend/src/idio/domain/inference/service.py

+            # Normalize mode aliases
+            mode_mapping = {
+                "casual": "friendly",
+                "academic": "professional",
+                "creative": "friendly"
+            }
+            normalized_mode = mode_mapping.get(mode, mode)
+


mode_mapping lookups are case-sensitive; if callers send "Professional"/"FRIENDLY" the mapping won’t apply and llm_rephrase will receive an unexpected mode value. Normalize mode (e.g., lowercasing/stripping) before mapping so behavior is consistent with API input.

Copilot · 2026-02-16T15:21:36Z

backend/src/idio/infrastructure/ai/llm_engine.py

+def rephrase(
+    text: str,
+    mode: Literal["professional", "friendly", "mirror"]
+) -> Tuple[str, float]:


rephrase() annotates mode as Literal["professional", "friendly", "mirror"], but the implementation has an else branch and callers (e.g., domain service) can pass arbitrary strings. This will cause mypy/type-checking friction; either validate and raise on invalid modes, or widen the type to str (or include allowed aliases) to match runtime behavior.

Copilot · 2026-02-16T15:21:36Z

backend/src/idio/infrastructure/ai/llm_engine.py

+import os
+from llama_cpp import Llama
+import time
+from pathlib import Path
+from typing import Literal, Tuple


llm_engine.py imports llama_cpp, but the backend dependency list (pyproject.toml) does not include llama-cpp-python. As-is, installs will succeed but real inference will always fall back (or crash if this module is imported elsewhere). Add the package as a dependency (or an optional extra) and document how to enable it.

This seems like a legit issue. After completing the above actions suggested by copilot, please make sure we are using the [cpu], [cu121] extra tags to support different hardware

Copilot · 2026-02-16T15:21:36Z

backend/src/idio/domain/inference/service.py

+    LLM_AVAILABLE = False
+
+# Feature flag: Set to False to force mock implementation
+USE_REAL_LLM = os.getenv("USE_REAL_LLM", "true").lower() == "true" and LLM_AVAILABLE


USE_REAL_LLM defaults to "true", which can unexpectedly enable heavy CPU-bound inference in environments that didn’t intend to run the real model (and may not have the model file). Consider defaulting this flag to false and requiring explicit opt-in via env var, especially for CI/dev setups.

Suggested change

USE_REAL_LLM = os.getenv("USE_REAL_LLM", "true").lower() == "true" and LLM_AVAILABLE

USE_REAL_LLM = os.getenv("USE_REAL_LLM", "false").lower() == "true" and LLM_AVAILABLE

radhaneelamani · 2026-02-19T17:59:42Z

@mananmalik1807 Thanks for following a proper PR description

radhaneelamani · 2026-02-19T18:11:33Z

backend/src/idio/infrastructure/ai/llm_engine.py

+        system_prompt = "Rewrite the following text clearly."
+
+    # Format prompt for Qwen (uses ChatML format)
+    prompt = f"""<|im_start|>system


We need to add a basic check here to ensure the user input doesn't contain the <|im_end|> token, which could be used for prompt injection. It's a low risk for a local HUD, but good 'Privacy-First' practice.

radhaneelamani · 2026-02-19T18:14:20Z

Please remove all the emojis that you have added in the print statements (shows very evidently that it has been vibe-coded). Using print statements is also not generally advisable, use the standard python logging module.

radhaneelamani · 2026-02-19T18:16:35Z

backend/src/idio/domain/inference/service.py

+    if USE_REAL_LLM:
+        try:
+            # Normalize mode aliases
+            mode_mapping = {


I think mode mapping has to be moved to a constants file to be easily configurable, this is making the service logic too complicated.

radhaneelamani · 2026-02-19T18:19:35Z

backend/src/idio/infrastructure/ai/llm_engine.py

+            # Strategy 3: Fallback to relative path from backend/
+            if model_path is None:
+                model_path = Path("../models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf").resolve()
+


I think a 30 line loop to find a model path is an overkill here and is also very difficult to maintain. The MODEL_PATH can simply be defined in .env file and a clean error can be thrown if it is not present

radhaneelamani · 2026-02-19T18:22:37Z

backend/src/idio/infrastructure/ai/llm_engine.py

+    return _model_instance
+
+
+def rephrase(


This rephrase function here is trying to do too much. It is handling prompt construction, model execution, latency timing and error handling in one giant block. Please refactor and keep the code minimal. Follow Single Responsibility Principle (Ref: Clean Code)

radhaneelamani · 2026-02-19T18:25:05Z

backend/src/idio/infrastructure/ai/llm_engine.py

+    return avg_latency
+
+
+# For standalone testing


We need to keep our production modules as lean as possible. Please move this entire testing and benchmarking suite out of this file. There is an existing backend/tests directory dedicatedly created for such tasks. Also consider using pytest for assertions and pytest-benchmark for latency tracking.

radhaneelamani · 2026-02-19T18:26:36Z

.gitignore

 *.swp
-Thumbs.db
+Thumbs.db
+CommandsForManan.md


It is not a good practice to add files like these to .gitignore, it is meant for the whole project and is not specific to a single dev. Please remove this from here and manually make sure you don't commit your commands.md to git

Copilot AI review requested due to automatic review settings February 16, 2026 07:56

Copilot started reviewing on behalf of mananmalik1807 February 16, 2026 07:56 View session

Copilot AI reviewed Feb 16, 2026

View reviewed changes

mananmalik1807 requested a review from Copilot February 16, 2026 15:16

Copilot started reviewing on behalf of mananmalik1807 February 16, 2026 15:17 View session

Copilot AI reviewed Feb 16, 2026

View reviewed changes

radhaneelamani reviewed Feb 19, 2026

View reviewed changes

radhaneelamani requested changes Feb 19, 2026

View reviewed changes

	USE_REAL_LLM = os.getenv("USE_REAL_LLM", "true").lower() == "true" and LLM_AVAILABLE
	USE_REAL_LLM = os.getenv("USE_REAL_LLM", "false").lower() == "true" and LLM_AVAILABLE

Comments

Conversation

mananmalik1807 commented Feb 16, 2026

Pull Request Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

radhaneelamani Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

radhaneelamani commented Feb 19, 2026

Uh oh!

radhaneelamani Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

radhaneelamani commented Feb 19, 2026

Uh oh!

radhaneelamani Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

radhaneelamani Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

radhaneelamani Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

radhaneelamani Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

radhaneelamani Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!