Skip to content

Comments

feat: [PLAN-6] Implement real LLM inference engine with llama-cpp-python#3

Open
mananmalik1807 wants to merge 1 commit intodevfrom
dev-PLAN-6
Open

feat: [PLAN-6] Implement real LLM inference engine with llama-cpp-python#3
mananmalik1807 wants to merge 1 commit intodevfrom
dev-PLAN-6

Conversation

@mananmalik1807
Copy link
Collaborator

Sprint 1 Deliverable: Real text rephrasing using Qwen2.5-0.5B-Instruct GGUF

Implementation:

  • Created llm_engine.py with robust path resolution
  • Updated service.py for hybrid LLM/mock mode
  • Added personal notes pattern to .gitignore (*.md for developer notes)

Pull Request Checklist

Type:

  • Bug fix
  • [✅] New Feature
  • Refactor
  • Documentation

For Backend (Dev R1, R2, M):

  • [✅] I have used the Contracts (Interfaces) in domain/, not direct imports.
  • [✅] I have run scripts/setup.py to ensure dependencies are correct.
  • [✅] I have tested the API locally at localhost:45500.
  • If I changed the DB, I updated infrastructure/persistence/db.py.

For Frontend (Dev S):

  • I have tested with the Mock API running.
  • I have checked the UI in both Light and Dark modes.

Validation:

  • [✅] My code runs locally without crashing.
  • [✅] I have added/updated tests if applicable.

Sprint 1 Deliverable: Real text rephrasing using Qwen2.5-0.5B-Instruct GGUF

Implementation:
- Created llm_engine.py with robust path resolution
- Updated service.py for hybrid LLM/mock mode
- Added personal notes pattern to .gitignore (*.md for developer notes)
Copilot AI review requested due to automatic review settings February 16, 2026 07:56
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements a real local LLM rephrasing path using llama-cpp-python (Qwen2.5-0.5B-Instruct GGUF) and wires it into the backend’s existing rephrasing service with a hybrid “real LLM or mock” mode.

Changes:

  • Added a llama-cpp-python-backed inference engine module with model path resolution and a rephrase() API.
  • Updated the domain rephrasing service to optionally call the real LLM (feature-flagged) and fall back to the mock implementation on failure.
  • Updated .gitignore to ignore an additional markdown file.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 6 comments.

File Description
backend/src/idio/infrastructure/ai/llm_engine.py Adds GGUF model loading + prompt construction + inference wrapper for rephrasing.
backend/src/idio/domain/inference/service.py Adds hybrid routing to real LLM inference behind an env flag with fallback to existing mock logic.
.gitignore Adds an ignore entry for a developer notes file.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +53 to +60
f"Model not found at {model_path}\n"
f"Expected location: /Users/mananmalik/Developer/idio-dev/models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf\n"
f"Current working directory: {Path.cwd()}\n"
f"Current file location: {Path(__file__)}\n"
f"\nTroubleshooting:\n"
f" 1. Verify model exists: ls /Users/mananmalik/Developer/idio-dev/models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf\n"
f" 2. Run from project root: cd /Users/mananmalik/Developer/idio-dev && python -m idio.infrastructure.ai.llm_engine\n"
f" 3. Set explicit path: export IDIO_MODEL_PATH=/path/to/model.gguf"
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FileNotFoundError message hard-codes a developer-specific absolute path (/Users/...) and troubleshooting commands referencing that local directory. This can leak local environment details and will mislead other developers/CI; prefer only reporting the resolved/attempted paths and guiding users to set IDIO_MODEL_PATH or run the model download script.

Suggested change
f"Model not found at {model_path}\n"
f"Expected location: /Users/mananmalik/Developer/idio-dev/models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf\n"
f"Current working directory: {Path.cwd()}\n"
f"Current file location: {Path(__file__)}\n"
f"\nTroubleshooting:\n"
f" 1. Verify model exists: ls /Users/mananmalik/Developer/idio-dev/models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf\n"
f" 2. Run from project root: cd /Users/mananmalik/Developer/idio-dev && python -m idio.infrastructure.ai.llm_engine\n"
f" 3. Set explicit path: export IDIO_MODEL_PATH=/path/to/model.gguf"
"LLM model file not found.\n"
f"Attempted path: {model_path}\n"
"Resolution strategies used:\n"
" - IDIO_MODEL_PATH environment variable (if set)\n"
" - Searching upwards from this file for models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf\n"
" - Fallback to ../models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf relative to the backend directory\n"
f"Current working directory: {Path.cwd()}\n"
f"Current file location: {Path(__file__)}\n"
"\nTroubleshooting:\n"
" 1. Ensure the model file qwen2.5-0.5b-instruct-q4_k_m.gguf is present under your project's models/base/ directory, or\n"
" 2. Set an explicit absolute path: export IDIO_MODEL_PATH=/absolute/path/to/qwen2.5-0.5b-instruct-q4_k_m.gguf\n"
" 3. Re-run your command after verifying the model path."

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +25
def get_model() -> Llama:
"""
Lazy-load the GGUF model (Singleton pattern).
Only loads once at first call.
"""
global _model_instance
if _model_instance is None:
# Try multiple path resolution strategies
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_model() implements a singleton via a module-level global without any synchronization. In multi-threaded execution (e.g., multiple workers/threads), concurrent first requests can race and initialize multiple Llama instances or partially initialize _model_instance. Use a lock around initialization or initialize once during application startup.

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +12
# Sprint 1: Import actual llm_engine for real inference
try:
from idio.infrastructure.ai.llm_engine import rephrase as llm_rephrase
LLM_AVAILABLE = True
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This domain-layer service now directly imports the infrastructure llm_engine, which conflicts with the PR checklist/architecture note about using domain Contracts (interfaces) rather than direct infrastructure imports. Consider defining an inference interface in domain/inference/contracts.py and wiring the infrastructure implementation from the API/composition root to keep domain independent of infrastructure.

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +44
# Normalize mode aliases
mode_mapping = {
"casual": "friendly",
"academic": "professional",
"creative": "friendly"
}
normalized_mode = mode_mapping.get(mode, mode)

# Call real LLM inference
result, _ = llm_rephrase(text, normalized_mode)
return result
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mode normalization for the real LLM path does not handle common input variations (case/whitespace), and any unrecognized mode will be passed through to llm_rephrase even though llm_engine.rephrase only defines professional/friendly/mirror. Normalize with mode.strip().lower() before mapping, and ensure the final mode is one of the supported values to avoid unintended prompt selection.

Suggested change
# Normalize mode aliases
mode_mapping = {
"casual": "friendly",
"academic": "professional",
"creative": "friendly"
}
normalized_mode = mode_mapping.get(mode, mode)
# Call real LLM inference
result, _ = llm_rephrase(text, normalized_mode)
return result
# Normalize and sanitize mode, then map aliases
raw_mode = (mode or "").strip().lower()
mode_mapping = {
"casual": "friendly",
"academic": "professional",
"creative": "friendly"
}
normalized_mode = mode_mapping.get(raw_mode, raw_mode)
allowed_modes = {"professional", "friendly", "mirror"}
# Only call real LLM if mode is supported
if normalized_mode in allowed_modes:
result, _ = llm_rephrase(text, normalized_mode)
return result

Copilot uses AI. Check for mistakes.
Comment on lines +32 to +44
if USE_REAL_LLM:
try:
# Normalize mode aliases
mode_mapping = {
"casual": "friendly",
"academic": "professional",
"creative": "friendly"
}
normalized_mode = mode_mapping.get(mode, mode)

# Call real LLM inference
result, _ = llm_rephrase(text, normalized_mode)
return result
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real LLM inference is a CPU-bound synchronous call executed inside get_rephrasing(); the /rephrase route is declared async and calls this function directly, so enabling USE_REAL_LLM will block the event loop for the duration of inference and reduce API concurrency. Consider running inference in a threadpool/background worker (or making the route sync) to avoid event-loop starvation.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +46 to +48
# Strategy 3: Fallback to relative path from backend/
if model_path is None:
model_path = Path("../models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf").resolve()
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback Path("../models/base/... ").resolve() is resolved relative to the current working directory, not the package/file location, so it can point to the wrong place depending on how the app is started. Prefer constructing the fallback relative to __file__/the discovered project root, or rely solely on the upward-search + IDIO_MODEL_PATH.

Suggested change
# Strategy 3: Fallback to relative path from backend/
if model_path is None:
model_path = Path("../models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf").resolve()
# Strategy 3: Fallback to relative path from backend/, anchored to this file
if model_path is None:
model_path = (current_file.parent / "../models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf").resolve()

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +48
except Exception as e:
print(f"[service] LLM inference failed: {str(e)}, falling back to mock")
# Fall through to original mock implementation below
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If model loading/inference fails, this will print an error and retry on every request, which can spam logs and add repeated overhead. Consider a one-time disable/circuit-breaker (e.g., flip USE_REAL_LLM/a module flag after first failure) or perform a startup-time health check so failures are surfaced once.

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +41
# Normalize mode aliases
mode_mapping = {
"casual": "friendly",
"academic": "professional",
"creative": "friendly"
}
normalized_mode = mode_mapping.get(mode, mode)

Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mode_mapping lookups are case-sensitive; if callers send "Professional"/"FRIENDLY" the mapping won’t apply and llm_rephrase will receive an unexpected mode value. Normalize mode (e.g., lowercasing/stripping) before mapping so behavior is consistent with API input.

Copilot uses AI. Check for mistakes.
Comment on lines +78 to +81
def rephrase(
text: str,
mode: Literal["professional", "friendly", "mirror"]
) -> Tuple[str, float]:
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rephrase() annotates mode as Literal["professional", "friendly", "mirror"], but the implementation has an else branch and callers (e.g., domain service) can pass arbitrary strings. This will cause mypy/type-checking friction; either validate and raise on invalid modes, or widen the type to str (or include allowed aliases) to match runtime behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +8 to +12
import os
from llama_cpp import Llama
import time
from pathlib import Path
from typing import Literal, Tuple
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llm_engine.py imports llama_cpp, but the backend dependency list (pyproject.toml) does not include llama-cpp-python. As-is, installs will succeed but real inference will always fall back (or crash if this module is imported elsewhere). Add the package as a dependency (or an optional extra) and document how to enable it.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a legit issue. After completing the above actions suggested by copilot, please make sure we are using the [cpu], [cu121] extra tags to support different hardware

LLM_AVAILABLE = False

# Feature flag: Set to False to force mock implementation
USE_REAL_LLM = os.getenv("USE_REAL_LLM", "true").lower() == "true" and LLM_AVAILABLE
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

USE_REAL_LLM defaults to "true", which can unexpectedly enable heavy CPU-bound inference in environments that didn’t intend to run the real model (and may not have the model file). Consider defaulting this flag to false and requiring explicit opt-in via env var, especially for CI/dev setups.

Suggested change
USE_REAL_LLM = os.getenv("USE_REAL_LLM", "true").lower() == "true" and LLM_AVAILABLE
USE_REAL_LLM = os.getenv("USE_REAL_LLM", "false").lower() == "true" and LLM_AVAILABLE

Copilot uses AI. Check for mistakes.
@radhaneelamani
Copy link
Collaborator

@mananmalik1807 Thanks for following a proper PR description

system_prompt = "Rewrite the following text clearly."

# Format prompt for Qwen (uses ChatML format)
prompt = f"""<|im_start|>system
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add a basic check here to ensure the user input doesn't contain the <|im_end|> token, which could be used for prompt injection. It's a low risk for a local HUD, but good 'Privacy-First' practice.

@radhaneelamani
Copy link
Collaborator

Please remove all the emojis that you have added in the print statements (shows very evidently that it has been vibe-coded). Using print statements is also not generally advisable, use the standard python logging module.

if USE_REAL_LLM:
try:
# Normalize mode aliases
mode_mapping = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mode mapping has to be moved to a constants file to be easily configurable, this is making the service logic too complicated.

# Strategy 3: Fallback to relative path from backend/
if model_path is None:
model_path = Path("../models/base/qwen2.5-0.5b-instruct-q4_k_m.gguf").resolve()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a 30 line loop to find a model path is an overkill here and is also very difficult to maintain. The MODEL_PATH can simply be defined in .env file and a clean error can be thrown if it is not present

return _model_instance


def rephrase(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rephrase function here is trying to do too much. It is handling prompt construction, model execution, latency timing and error handling in one giant block. Please refactor and keep the code minimal. Follow Single Responsibility Principle (Ref: Clean Code)

return avg_latency


# For standalone testing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to keep our production modules as lean as possible. Please move this entire testing and benchmarking suite out of this file. There is an existing backend/tests directory dedicatedly created for such tasks. Also consider using pytest for assertions and pytest-benchmark for latency tracking.

*.swp
Thumbs.db No newline at end of file
Thumbs.db
CommandsForManan.md No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a good practice to add files like these to .gitignore, it is meant for the whole project and is not specific to a single dev. Please remove this from here and manually make sure you don't commit your commands.md to git

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants