βββ βββββββ βββββββ ββββββ βββ βββββββ ββββββββββββ βββ
βββ ββββββββββββββββββββββββββββ ββββββββ βββββββββββββ βββ
βββ βββ ββββββ βββββββββββ βββ ββββββββββ ββββββ βββ
βββ βββ ββββββ βββββββββββ βββ βββββββββ ββββββββββ
ββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ ββββββ
ββββββββ βββββββ ββββββββββ βββββββββββ βββββββ βββββββββββ βββββ
This project provides a full-featured, locally running terminal interface for interacting with the NGen3 Assistant, a flagship model from the NGen3 Series by TNSA AI. It is designed for users who want a powerful, customizable, offline chat experience with advanced debugging, context control, and VRAM-aware model loading.
- Local Model Runtime: Load merged NGen3 models directly from disk.
- Dynamic Context Management: Automatically trims conversation history to prevent context overflows.
- Custom Stopping Criteria: Supports multi-token stop sequences for clean and controlled outputs.
- HTML-Safe Output: Responses are filtered to strip unsafe HTML using
bleach. - VRAM Awareness: Displays GPU memory usage before and after model loading.
- Configurable Generation Settings: Temperature, top-p, repetition penalty, and more.
- Debug Mode: Shows full prompts sent to the model for transparent debugging.
Ensure the following Python packages are installed:
pip install torch transformers bleachPlace your merged NGen3 model in a folder and set the path in the script:
MERGED_MODEL_PATH = r"C:/NGen3-7B/0625"Run the script with:
python terminal_chat.pyIf the model loads successfully, you will see:
- Model path
- Device information
- Effective max context length
- VRAM usage
Then the chat interface starts:
You: <your message>
NGen3: <model response>
exit,quit,byeβ Quit the programclear,/clearβ Reset full conversation history
The script:
- Uses a system prompt defining NGen3 Assistant
- Manages token budget dynamically
- Supports streaming responses
- Cleans unwanted stop sequences and special tokens
- Ensure the GPU has enough VRAM for the merged model
- Prefer BF16 when supported; fallback to FP16 or FP32 based on device
- On CPU, inference will be slow
When exiting, the script:
- Clears CUDA cache
- Frees memory
- Resets model and tokenizer
This script is ideal for:
- Offline inference
- Debugging model behavior
- Building local AI tools on top of NGen3
- Testing merged or fine-tuned models
It gives you powerful control over your modelβs inputs, outputs, and system behavior β all inside a simple terminal UI.