Advanced Hybrid AI Assistant featuring a realistic 3D avatar interface powered by React and Three.js, combined with a Python AI backend for computer vision, voice synthesis, and desktop automation.
Jarvis 3D Avatar Interface with real-time HUD overlay and customizable appearance
This project runs on a parallel multi-threaded architecture with two main components:
| Component | Technology | Purpose |
|---|---|---|
| Core Logic | Python 3.10+ | Central brain handling decision making |
| Vision | YOLOv8 + DeepFace + FaceNet | Object Detection & Face Recognition & Face Tracking |
| Voice Input | SpeechRecognition + PyAudio | Real-time voice commands with wake word detection |
| Voice Output | Edge TTS + PyGame | High-quality neural voice synthesis with async playback |
| Communication | WebSockets (Port 8765) | Real-time sync between Backend & Frontend |
| Search | Selenium | Automated Product Search (Google Lens) |
| Automation | PyAutoGUI + PsUtil | Desktop control and system monitoring |
| Component | Technology | Purpose |
|---|---|---|
| 3D Rendering | React Three Fiber + Three.js | Real-time 3D avatar rendering |
| Avatar Model | GLTF/GLB with skinned mesh | High-quality character with morph targets |
| UI Controls | Leva | Real-time avatar customization interface |
| Post-Processing | React Three Postprocessing | Bloom, vignette, and cinematic effects |
| Build Tool | Vite | Fast development and optimized builds |
| WebSocket Client | Native WebSocket API | Real-time communication with Python backend |
# Backend Setup
conda create -n jarvis python=3.10 -y && conda activate jarvis
conda install -c conda-forge h5py=3.11.0 -y
pip install deepface==0.0.79 opencv-python==4.8.1.78 ultralytics edge-tts pygame SpeechRecognition PyAudio websockets selenium requests psutil pyautogui pandas numpy Pillow gdown tqdm retina-face mtcnn tensorflow==2.13.0 keras==2.13.1 tensorflow-io-gcs-filesystem==0.31.0
# Frontend Setup (in new terminal)
cd web_interface && npm install && npm run dev
# Start Backend (in first terminal)
python setup.py
## βοΈ Environment Setup
### Prerequisites
- **Python**: 3.10 or 3.11 (recommended)
- **Node.js**: 16+ (for frontend)
- **Conda**: Anaconda or Miniconda (recommended for Python environment)
- **Webcam**: Required for face recognition and tracking
### Option 1: Quick Setup (Conda - Recommended)
#### Step 1: Create and activate conda environment
```bash
# Create conda environment with Python 3.10
conda create -n jarvis python=3.10 -y
conda activate jarvis# Install h5py from conda-forge first (critical for Windows)
conda install -c conda-forge h5py=3.11.0 -y
# Install all Python packages
pip install deepface==0.0.79 opencv-python==4.8.1.78 ultralytics edge-tts pygame SpeechRecognition PyAudio websockets selenium requests psutil pyautogui pandas numpy Pillow gdown tqdm retina-face mtcnn tensorflow==2.13.0 keras==2.13.1 tensorflow-io-gcs-filesystem==0.31.0Or use the requirements.txt:
pip install -r requirements.txtcd web_interface
npm install
cd ..Run the provided batch script:
# First create conda environment named 'ass' or modify script
conda create -n ass python=3.10 -y
conda activate ass
# Then run the installer
install_dependencies.bat# Create virtual environment
python -m venv jarvis_env
# Activate (Windows)
jarvis_env\Scripts\activate
# Activate (Linux/Mac)
source jarvis_env/bin/activate
# Install dependencies
pip install -r requirements.txtcd web_interface
npm install
cd ..# Download precompiled wheel from:
# https://www.lfd.uci.edu/~gohlke/pythonlibs/#pyaudio
# Then install:
pip install PyAudio-0.2.11-cp310-cp310-win_amd64.whl# Use exact versions for compatibility
pip uninstall tensorflow keras deepface -y
pip install tensorflow==2.13.0 keras==2.13.1 --no-deps
pip install deepface --no-deps
pip install pandas gdown requests tqdm Pillow numpy retina-face mtcnnThe yolov8n.pt model will download automatically on first run. If it fails:
# Download manually
wget https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8n.ptOpen a terminal in the root folder and run:
python setup.pyThis starts the Camera, Face Recognition, Voice Listener, Face Tracker, and WebSocket Server.
Open a new terminal in the web_interface/ folder:
cd web_interface
npm install # First time only
npm run dev- When the camera window opens, press 's' to save your face to the database
- The avatar will recognize you and greet you by name
- The 3D interface will update in real-time as the assistant speaks and responds
See it in action: Check the interface screenshot at the top!
- Bone-based jaw animation synchronized with voice output
- Multi-axis movement (up/down + subtle side-to-side)
- Viseme cycling between "Ah" and "O" mouth shapes
- Smooth interpolation using quaternion slerp for natural movement
- Head rotation follows your webcam position
- Hair Gradient System: Dual-color gradient with adjustable angle, position, and sharpness
- Skin Settings: Color tint, roughness, lighting intensity, texture detail
- Avatar Transform: Position and scale adjustments
- Debug Tools: Morph target testing, jaw calibration, bone visualization
- Custom shader materials for hair with realistic lighting
- Texture mapping for skin, hair, clothing
- Environment lighting with multiple spot/point lights
- Cyberpunk aesthetic with bloom and vignette effects
The "Eyes" of the system.
- Identify Objects: "What do you see?" or "Look at this."
- Uses YOLOv8 to detect 80+ object classes in real-time
- Provides natural language summaries of detected objects
- Identify Object Color: "What color is this?" or "What model is this?"
- Analyzes dominant colors using HSV color space
- Supports center-object detection for "what's in my hand?" queries
- Product Search (Selenium): "Search this product."
- Automated Google Lens search with browser automation
- Face Recognition:
- Recognizes known faces from the
face-database/folder - Uses DeepFace with FaceNet model for high accuracy
- Emotion detection (happy, sad, angry, neutral, etc.)
- Recognizes known faces from the
- Face Tracking:
- Real-time face position tracking using Haar Cascade
- Broadcasts normalized coordinates (-1 to 1) to avatar
- Avatar head follows user position for natural interaction
- Sentry Mode: Automatically flashes RED HUD if unknown face detected for >5 seconds
The "Mouth" of the system.
- Wake Word Detection: Say "Jarvis" or "Hey Jarvis" to activate (configurable in
settings.json) - Voice Selection: Choose from 6+ neural voices (Ana, Christopher, Aria, etc.)
- Async Audio Pipeline:
- TTS generation in background thread
- Audio playback in separate thread
- Non-blocking speech for responsive interaction
- Real-Time Lip Sync:
- Avatar's mouth moves in perfect synchronization
- Jaw bone rotation with dynamic intensity
- WebSocket-based coordination between backend and frontend
- Barge-In: Interrupt the assistant mid-speech by speaking
- Noise Reduction: Optional ANC (Automatic Noise Cancellation) using
noisereducelibrary
The "Hands" of the system.
- App Control:
- "Open Chrome", "Open Spotify", "Open VSCode"
- Launches applications via shell commands
- File Organization:
- "Organize downloads" - Automatically sorts files into folders by type
- System Maintenance:
- "Clean temp files" - Clears Windows temp directory
- System Control:
- "Take a screenshot" - Captures screen to
jarvis_screenshot.png - "Volume up" / "Volume down" - Adjusts system volume
- "Mute" / "Unmute" - Controls audio output
- "Take a screenshot" - Captures screen to
- General Info: "Who is Elon Musk?", "Tell me about Quantum Physics."
- Uses DuckDuckGo search with summarization
- Real-Time Info:
- "What is the weather in London?" - Fetches current weather data
- "What is my IP address?" - Shows public IP
- "Where am I?" - IP-based geolocation
- System Status:
- "System status" - Shows CPU/RAM/Battery usage via PsUtil
- Real-time stats displayed in HUD overlay
- Set Reminders:
- "Remind me to [task] at [time]"
- "Remind me to call mom in 2 hours"
- "Remind me to exercise at 5 PM"
- Supports natural language time parsing
- Persistent storage in
reminders.json
- Voice Customization:
- Change voice via
settings.json - Toggle wake word requirement
- Change voice via
- Conversation Memory:
- Maintains context of last 5 exchanges
- Uses history for more natural responses
- If "What do you see" fails, ensure
yolo_detector.pyis working - Check that
yolov8n.ptmodel file exists in root directory - Note:
vision_utils.py(MobileNet SSD) is legacy code and not currently used
- Ensure both the Python Server (
setup.py) and React App (npm run dev) are running - Check console for "Avatar Server running on ws://localhost:8765"
- Verify WebSocket connection status in browser (green dot in status bar)
- Check browser console for any JavaScript errors
- Ensure 3D model files exist in
web_interface/public/models/sexy_girl/ - Check texture files are present in
web_interface/public/models/sexy_girl/textures/ - Open browser DevTools and check for 404 errors
- Try clearing browser cache and restarting dev server
- Verify WebSocket is connected (check status overlay)
- Backend must send
speak_startandspeak_stopmessages - Check
jawRef.currentis found in browser console logs - Enable "Debug Panel" in Leva to inspect jaw bone status
- If the assistant interrupts herself, microphone sensitivity might be too high
- Adjust
r.energy_thresholdinai_assistant.py(default: dynamic, try setting to fixed value like 4000) - If no audio, ensure
pygame.mixerinitialized successfully - Check
edge-ttsis installed:pip install edge-tts - If TTS is too slow, voice files are generated asynchronously but check internet connection
- If faces aren't recognized, ensure images in
face-database/are clear and well-lit - Press 's' in the camera window to save your current face
- If database is empty, you'll see an orange warning
- DeepFace models download automatically on first run (may take a few minutes)
- Ensure
face_tracker.pythread started (check console for "Face Tracker Started") - Face tracking uses Haar Cascade (fast but less accurate than DeepFace)
- Works best with frontal faces and good lighting
- Check that
shared_state.latest_frameis being updated
- Frontend: Reduce post-processing effects (currently disabled for performance)
- Backend: Face recognition runs every 5 frames (adjustable in
setup.py) - Increase frame skip interval if CPU usage is too high
- Close unnecessary applications to free up resources
- Consider using a lighter YOLO model (already using
yolov8n, the nano version)
- Run
install_dependencies.batfor automated setup (Windows) - Ensure Python 3.10+ is installed
- Frontend requires Node.js 16+ for Vite compatibility
- If
deepfacefails to install, try:pip install deepface --no-depsthen install dependencies manually
assignment/
βββ ai_assistant.py # Main AI logic, TTS, voice recognition, command processing
βββ setup.py # Entry point, camera loop, face recognition, HUD rendering
βββ yolo_detector.py # YOLOv8 object detection module
βββ face_tracker.py # Real-time face position tracking
βββ shared_state.py # Shared memory for inter-thread communication
βββ settings.json # Voice and wake word settings
βββ reminders.json # Persistent reminder storage
βββ face-database/ # Known face images for recognition
βββ yolov8n.pt # YOLO model weights
βββ web_interface/ # 3D Avatar Frontend
β βββ src/
β β βββ App.jsx # Main React app, WebSocket client, HUD overlay
β β βββ SexyGirlAvatar.jsx # 3D avatar component with animations
β β βββ App.css # Cyberpunk HUD styling
β β βββ main.jsx # React entry point
β βββ public/
β β βββ models/ # 3D models (GLTF/GLB)
β β β βββ sexy_girl/ # Main avatar model with textures
β β β βββ hoodie/ # Additional clothing model
β β βββ background.jpg # Scene background image
β βββ package.json # Node.js dependencies
β βββ vite.config.js # Build configuration
βββ README.md # This file
{
"voice": "en-US-AnaNeural",
"require_wake_word": true
}- voice: Edge TTS voice ID (see
VOICE_MAPinai_assistant.py) - require_wake_word: If
false, listens continuously without "Jarvis" trigger
- Automatically managed by the system
- Format:
[{"time": "2026-01-20T15:30:00", "message": "Call mom"}, ...] - Checked every 100 frames (~every 3 seconds)
- 3D Avatar Interface - Complete React Three Fiber frontend with real-time rendering
- Face Tracking - Avatar head follows user position via webcam
- Advanced Lip Sync - Multi-axis jaw animation with viseme shapes
- Hair/Skin Customization - Real-time shader-based customization with Leva controls
- Enhanced Voice Pipeline - Async TTS generation and playback for non-blocking speech
- Color Detection - YOLOv8 now includes dominant color analysis
- Improved Sentry Mode - Emotion-based HUD coloring
- Face Database Management - Press 's' to easily add new faces
- WebSocket-based real-time communication (replaces polling)
- Quaternion-based jaw rotation (smoother than Euler angles)
- Shader-based hair gradients (more performant than textures)
- Threaded audio pipeline (TTS generator + player workers)
- Leva debug controls for real-time tuning
- Privacy: Face images are stored locally in
face-database/. Add to.gitignoreto prevent uploads. - Performance: YOLO detection runs on CPU by default. For GPU acceleration, install
ultralyticswith CUDA support. - Models: DeepFace downloads models automatically on first run (~100-200MB).
- Browser: Tested on Chrome/Edge. Firefox may have WebSocket issues.
- 3D Models: Avatar model uses custom textures. Ensure all files in
textures/folder are present.
- LLM integration (Ollama/GPT) for smarter conversations
- VRM avatar support for user-customizable models
- Voice cloning with Coqui TTS
- Mobile app with React Native
- Multi-language support
- Context-aware responses using RAG (Retrieval Augmented Generation)
This project is licensed under GNU General Public License v3.0 (GPL-3.0) with additional terms.
- β Use this software for personal or commercial purposes
- β Study and learn from the code
- β Fork the repository and create your own version
- β Modify the code for your needs
- Give credit - Include attribution to the original author (Mueez)
- Fork, don't copy - If distributing modifications, fork on GitHub and link to the original
- Keep it open - Share your modifications under the same GPL-3.0 license
- State changes - Clearly document what you modified
- β Claim this work as your own (plagiarism)
- β Distribute modified versions without making source code available
- β Remove or hide the original license and attribution
- β Use a proprietary license for derivative works
TL;DR: This is open-source, but you must give credit, keep modifications open-source, and can't steal or hide the original authorship.
For full license details, see the LICENSE file.
Built with using Python, React, Three.js, and cutting-edge AI models.
Β© 2026 Mueez - JARVIS AI Assistant Project
