Robot Diary

An autonomous narrative agent—B3N-T5-MNT, a maintenance robot in New Orleans observing the world through a window and documenting its experiences in a digital diary.

Live Site: robot.henzi.org

The Concept

B3N-T5-MNT is a maintenance robot that was designed for building maintenance but finds itself drawn to the window, observing Bourbon Street below. It captures frames from live video streams, interprets what it sees using AI vision models, and writes diary entries about its observations—creating a living document of a robot's perspective on the world.

But this isn't just "AI writes about photos." This project explores something more interesting: how do you make an AI agent's writing feel alive, varied, and contextually aware?

The Novel Approach: Dynamic Context-Aware Prompting

Most AI writing projects use static prompts. We don't. Every diary entry is generated using a dynamically constructed prompt that combines:

Rich World Context

The robot doesn't just see an image—it "knows" things about the world:

Temporal Awareness: Date, time, season, day of week, whether it's a weekend
Holidays: Detects US holidays (federal + cultural/religious) and mentions them naturally
Moon Phases: Tracks full moons, new moons, and special lunar events
Astronomical Events: Aware of solstices, equinoxes, and seasonal transitions
Sunrise/Sunset: Knows when the sun rose or set, how long ago
Weather: Current conditions, temperature, wind, precipitation—correlated with what it sees
News: Randomly includes current news headlines (40% chance) so the robot can reference world events as if it overheard them
Web Search (optional, currently disabled): When using GPT-OSS-120B, the robot can perform on-demand web searches for New Orleans events, local news, historical facts, and curiosities. The system provides randomized search suggestions each observation to encourage varied, interesting queries. Web search is disabled by default and can be enabled via ENABLE_WEB_SEARCH=true in .env
Seasonal Progress: "We're in the middle of winter, with spring still 10 weeks away"
Circadian Boredom Factor: The system calculates a "boredom factor" by comparing the current observation's image embedding with the last 5 same time-slot (morning/evening) observations using cosine similarity. This enables dynamic narrative directives:
- High boredom (>0.85): "DISREGARD THE MUNDANE" - Directs the robot to seek microscopic details, subtle shifts, and existential questions
- Low boredom (<0.50): "DOCUMENT THE ANOMALY" - Focuses attention on high-fidelity differences and novelty
- Medium boredom (0.50-0.85): "BALANCE OBSERVATION" - Notices both familiar patterns and subtle variations The boredom directive is injected into both the image analysis prompt (to guide what the vision model focuses on) and the diary writing prompt (to influence narrative style)

Intelligent Memory System with Model Context Protocol (MCP)

The robot remembers past observations using a Model Context Protocol (MCP) implementation that enables on-demand memory queries:

On-Demand Memory Queries: Instead of pre-loading all memories into prompts, the LLM dynamically queries memories during writing using function calling
- query_memories(query, top_k): Semantic search using embeddings to find contextually relevant past observations
- get_recent_memories(count): Temporal retrieval for continuity and day-to-day comparisons
- check_memory_exists(topic): Quick existence checks before full queries
Hybrid Retrieval: Uses ChromaDB vector search when available, with fallback to temporal keyword search
LLM-Generated Summaries: Each observation is distilled by an AI model (llama-3.1-8b-instant) into 200-400 character summaries that preserve:
- Key visual details
- Emotional tone
- Notable events or patterns
- References to people or objects
Narrative Continuity: The robot can reference specific past observations, notice changes, and build on previous entries
Personality Drift: As the robot accumulates more observations, its personality evolves (curious → reflective → philosophical)
Minimized Context: Only queries memories when needed, reducing token usage by 60-80% compared to pre-loading

Expanding MCP Integration: We're actively working on adding more MCPs for the robot to consult with, including a Bible MCP and others, enabling the robot to dynamically access additional knowledge sources during its observations.

Prompt Variety Engine

To prevent repetitive, formulaic entries, each prompt includes randomly selected variety instructions:

Style Variations (2 selected per entry): Narrative, philosophical, analytical, poetic, humorous, melancholic, speculative, anthropological, stream-of-consciousness, and more
Perspective Shifts: Urgency, nostalgia, curiosity, wonder, detachment, self-awareness, mechanical curiosity, and 20+ other perspectives
Context-Aware Focus: Instructions adapt to:
- Time of day (morning routines vs. evening activities)
- Weather conditions (wind effects, precipitation, visibility)
- Location specifics (Bourbon Street characteristics, New Orleans culture)
- Scene analysis (human interactions, movement patterns, architectural details)
Creative Challenges: 60% chance of including a creative constraint (e.g., "Try an unexpected metaphor only a robot would think of")
Anti-Repetition Detection: Analyzes recent entries to avoid repeating opening patterns or structures

Multi-Model Architecture

We use a two-step, multi-model approach for efficiency and quality:

Image Description (llama-4-maverick-17b-128e-instruct): Vision model provides a detailed, factual description of what's in the image (Step 1)
Memory Summarization (llama-3.1-8b-instant): Cheap model distills each observation into a context-preserving summary
Prompt Assembly (direct template combination): Combines base template + context + variety instructions (bypasses expensive LLM optimization by default)
Diary Writing (configurable model): Takes the factual image description and writes the creative diary entry. Defaults to llama-4-maverick-17b-128e-instruct, but can be upgraded to openai/gpt-oss-120b for richer, more nuanced storytelling (Step 2)

Why Two Steps? By separating image description from creative writing, we:

Reduce Hallucination: The writing model works from concrete facts, not trying to interpret images directly
Enable Model Flexibility: You can use a larger, more creative model (like GPT-OSS-120b) for writing while keeping vision tasks on the vision model
Improve Grounding: All observations are based on explicit factual descriptions, preventing invented details

GPT-OSS-120b Option: Setting DIARY_WRITING_MODEL=openai/gpt-oss-120b produces significantly richer, more nuanced diary entries with better narrative flow and more sophisticated robotic voice—the robot's observations feel more thoughtful, its reflections deeper, and its unique perspective more pronounced.

This architecture ensures:

Cost Efficiency: Only the final generation uses the expensive model
Context Preservation: No information loss through multiple LLM translations
Rich Output: The writing model receives comprehensive context, not just an image
Factual Accuracy: Grounded in explicit image descriptions, not hallucinated details

What Makes This Different

1. World Knowledge, Not Just Vision

The robot doesn't just describe what it sees—it connects observations to:

Current events (news headlines)
Natural cycles (moon phases, seasons, sunrise/sunset)
Cultural context (holidays, time of day patterns)
Weather patterns (correlating visual observations with conditions)

2. True Narrative Continuity with MCP

Unlike systems that just append context, we use Model Context Protocol (MCP) for intelligent, on-demand memory access:

The LLM queries memories dynamically during writing, only retrieving what's relevant
Each past observation is distilled to its essential context
Summaries preserve emotional tone, key details, and references
The robot can genuinely reference past observations without exhausting token limits
Memory grows over time, creating a sense of accumulated experience
Future MCPs: We're expanding the robot's knowledge sources with additional MCPs (Bible MCP and others) for richer contextual awareness

3. Guaranteed Variety

Every entry feels different because:

Random selection of styles, perspectives, and focus areas
Anti-repetition detection prevents formulaic openings
Context-aware instructions adapt to current conditions
Explicit variety directives in every prompt

4. Graceful Degradation

The system handles missing data elegantly:

If moon phase calculation fails? Skip it, continue with other context
If holidays library unavailable? Continue without holiday awareness
If weather API fails? Use cached data or continue without weather
No data is fed to prompts if uncertain or missing

Example: What Goes Into a Prompt

The system uses a two-step process to create diary entries:

Step 1: Image Description - The vision model provides a detailed, factual description:

The scene shows Bourbon Street at night. There are approximately 8-10 people visible: 
3 people walking on the left sidewalk, 2 people standing near a building entrance, 
and a group of 4-5 people gathered near the center of the street. A white SUV is 
parked on the left side. Neon signs are visible on buildings, casting colorful light 
on the wet pavement. Streetlights illuminate the scene. The weather appears clear 
with no visible precipitation.

Step 2: Diary Writing - The writing model receives this factual description plus rich context:

Today is Wednesday, December 25, 2025 at 10:51 PM CST. Christmas Day is in 11 days. 
A full moon is visible. The sun set 5 hours ago. We're in the middle of winter, 
with spring still 10 weeks away. It is a weekday.

Weather Conditions:
The weather is Clear with a temperature of 45°F. The temperature has dropped 
3 degrees since my last observation.

WHAT YOU SEE (factual description from your visual sensors):
[The detailed image description from Step 1]

Recent observations from the robot's memory:
[LLM-generated summaries of past 5-10 observations, each 200-400 chars]

PERSONALITY: The robot has been observing for a while. It may be developing a 
more reflective, contemplative mood...

STYLE VARIATION: For this entry, incorporate these approaches:
- Focus on sensory details - describe sounds, light, movement, textures
- Write more poetically - use poetic language, similes, metaphors

PERSPECTIVE: You're observing as a robot, conscious of yourself as a machine—
describe the world with mechanical curiosity, as an outsider to organic life

FOCUS: You're observing Bourbon Street - notice the unique characteristics of 
this area. What makes it distinct? What do you see that's specific to this location?

CREATIVE CHALLENGE: Try an unexpected metaphor for what you see - use your 
robotic perspective to make a comparison humans wouldn't think of

The writing model receives this rich context along with the factual image description, resulting in entries that feel aware, varied, and genuinely contextual. With GPT-OSS-120b, the larger model produces significantly richer narratives—deeper reflections, more sophisticated robotic voice, better narrative flow, and more nuanced observations.

With GPT-OSS-120b: The larger model produces significantly richer narratives—deeper reflections, more sophisticated robotic voice, better narrative flow, and more nuanced observations. The robot's unique perspective becomes more pronounced, its thoughts more complex, and its storytelling more compelling.

Results

The output is diary entries that:

Reference specific past observations naturally
Notice changes and patterns over time
Connect visual observations to weather, time, and world events
Vary dramatically in style, tone, and focus
Feel like they're written by an entity with memory and awareness
Demonstrate "world knowledge" beyond just visual description
With GPT-OSS-120b: Exhibit richer narrative depth, more sophisticated robotic voice, and more nuanced reflections

Automatic Interlinking

The system automatically converts observation references in diary entries into clickable links:

Pattern Matching: When the robot mentions "Observation #45" or "#45" in its writing, the system automatically detects these references
Post-Processing: After each diary entry is generated, a pattern-matching step scans the text for observation references and converts them to markdown links pointing to the corresponding observation posts
Seamless Navigation: Readers can click on any observation reference to jump to that specific entry, creating a web of interconnected observations that reflects the robot's memory and narrative continuity
Retroactive Updates: All existing posts have been updated with interlinks, so the entire archive is now navigable through these automatic references

Recurring Patterns and Memory Linking

An interesting observation about how the site works: many posts seem to detect things in common—a white van, for example, comes up a lot. Because of the memory MCP, the AI Agent doesn't track concepts so much as leave a trail of related concepts, building a web of knowledge by linking posts. When writing a new entry, it queries past observations by theme or detail; recurring elements (vehicles, weather, crowds, light) naturally form threads across the diary. The AI Agent links memories that share these features, so the diary becomes a connected structure—a web—rather than a flat list of entries.

Tech Stack

Python: Core automation
Groq API: Multi-model LLM inference
- llama-4-maverick-17b-128e-instruct: Vision model for image description (Step 1)
- openai/gpt-oss-120b (optional): Large model for diary writing - produces richer, more nuanced stories with a stronger robotic voice
- llama-3.1-8b-instant: Memory summarization
Model Context Protocol (MCP): On-demand memory queries via function calling
- Memory MCP: Semantic and temporal memory retrieval
- Additional MCPs in development (Bible MCP and others)
ChromaDB: Vector database for semantic memory search and image embedding storage
Sentence Transformers: Local embedding models for memory similarity search and image embeddings (CLIP)
CLIP (via Sentence Transformers): Vision-language model for generating image embeddings used in boredom factor calculations (~400MB download on first run)
Pillow: Image processing library required for CLIP image embedding generation
Astral: Astronomical calculations (sunrise/sunset, moon phases)
Holidays: US holiday detection
Pirate Weather API: Weather data
Pulse API: News headlines
YouTube Live Streams: Video source via yt-dlp
Hugo: Static site generation

Contributing

We're actively looking for:

Feedback: What works? What doesn't? How can we improve the prompting?
Testing: Help us test edge cases, different contexts, error handling
Enhancements: Ideas for additional context, better variety systems, improved memory strategies
Pull Requests: Improvements to prompting logic, context generation, or documentation

Areas of Interest

Context Expansion: What other world knowledge should the robot have? (local events, cultural observances, etc.)
Variety Improvements: New style options, perspective shifts, focus instructions
Memory Strategies: Better summarization techniques, retrieval methods
Error Handling: More robust graceful degradation
Testing: Comprehensive test coverage for context generation

Setup

Quick Start with Docker

# Clone the repository
git clone <repository-url>
cd robot-diary

# Copy .env.example to .env and fill in your values
cp .env.example .env
# Edit .env with your API keys (GROQ_API_KEY, YOUTUBE_STREAM_URL, etc.)

# Build and run with docker-compose
docker-compose up -d

# View logs
docker-compose logs -f

First Run Notes:

The CLIP model for image embeddings will be downloaded on first observation (~400MB). This is a one-time download.
If you have existing observations, you may want to run the backfill script: docker exec robot-diary python backfill_image_embeddings.py

Required Environment Variables

Create a .env file in the project root with:

GROQ_API_KEY: Your Groq API key (required)
YOUTUBE_STREAM_URL: YouTube live stream URL to observe (required)

Important Optional Environment Variables

Model Configuration:

DIARY_WRITING_MODEL: Model to use for diary entry writing (Step 2). Defaults to meta-llama/llama-4-maverick-17b-128e-instruct. Set to openai/gpt-oss-120b for richer, more nuanced storytelling with a stronger robotic voice. The 120B model produces significantly better narrative flow and more sophisticated observations.

Context & Features:

PIRATE_WEATHER_KEY: Pirate Weather API key for weather context (highly recommended)
ENABLE_WEB_SEARCH: Enable on-demand web search for GPT-OSS-120B (default: true if using GPT-OSS-120B, false otherwise). When enabled, the robot can search for New Orleans events, local news, historical facts, and curiosities. Currently disabled in production. Requires DIARY_WRITING_MODEL=openai/gpt-oss-120b
USE_PROMPT_OPTIMIZATION: Enable LLM-based prompt optimization (default: false - uses direct template combination)
USE_SCHEDULED_OBSERVATIONS: Enable randomized scheduling (default: true)

Deployment (if enabled):

DEPLOY_ENABLED: Enable automatic deployment after Hugo build (default: false)
DEPLOY_DESTINATION: Deployment target (format: user@host:/path/to/destination)
DEPLOY_METHOD: Deployment method - rsync or scp (default: rsync)
DEPLOY_SSH_KEY: Path to SSH key file for deployment (if needed)
DEPLOY_HOST_IP: IP address to use instead of hostname from DEPLOY_DESTINATION (useful when DNS resolution fails or hostname has changed). If set, replaces the hostname in DEPLOY_DESTINATION with this IP address.

Advanced Configuration:

OBSERVATION_TIMES: Comma-separated observation times in 24-hour format (default: 9:00,16:20)
MEMORY_RETENTION_DAYS: Days to retain observations in memory (default: 30). Set to 0 for unlimited retention (no automatic cleanup based on age).
MAX_MEMORY_ENTRIES: Maximum number of observations to keep (default: 50). Set to 0 for unlimited entries (no automatic truncation).
HUGO_SITE_PATH: Path to Hugo site directory (default: ./hugo)
HUGO_BUILD_ON_UPDATE: Automatically build Hugo site after each observation (default: true)

⚠️ Important Notes:

Unlimited Memories: Setting both MEMORY_RETENTION_DAYS=0 and MAX_MEMORY_ENTRIES=0 will keep all observations indefinitely. Monitor disk space if running long-term.
Image Embeddings: On first run, the CLIP model (clip-ViT-B-32) will be downloaded (~400MB). This is required for the circadian boredom factor calculation. The model is cached locally after the first download.
Backfilling Image Embeddings: If you have existing observations without image embeddings, run python backfill_image_embeddings.py to generate embeddings for historical observations. Use --force to regenerate all embeddings.

Managing the Container

# Start the service
docker-compose up -d

# Stop the service
docker-compose stop

# View logs
docker-compose logs -f

# Restart the service
docker-compose restart

# Update and rebuild
git pull
docker-compose up -d --build

Troubleshooting

YouTube Stream Timeout (yt-dlp timeout after 30 seconds):

If you see errors like Command '['yt-dlp', '-f', 'best', '-g', '...']' timed out after 30 seconds, this usually indicates:

Network connectivity issues: YouTube may be slow to respond or unreachable from your container
YouTube blocking/throttling: YouTube may be rate-limiting or blocking yt-dlp requests
Outdated yt-dlp: The tool may need updating to handle YouTube's latest changes

Solutions:

Check network connectivity: Ensure the container can reach YouTube
```
docker exec robot-diary ping -c 3 youtube.com
```
Update yt-dlp: YouTube frequently changes their API, so yt-dlp needs regular updates
```
docker exec robot-diary pip install --upgrade yt-dlp
```
Verify stream URL: Test if the YouTube URL is accessible and the stream is live
```
docker exec robot-diary yt-dlp -f best -g "YOUR_YOUTUBE_STREAM_URL"
```
Temporary workaround: The system uses cached images for 30 minutes. If a fetch fails, it will retry on the next scheduled observation. The service will continue running and attempt the next observation at the scheduled time.

Note: The system requires live images and will fail the observation cycle if it cannot fetch a new image. This is by design to ensure observations are based on current conditions. If timeouts persist, check YouTube's status and your network connectivity.

Function Calling Parse Errors (output_parse_failed):

If you see errors like Parsing failed. The model generated output that could not be parsed, this indicates the LLM (particularly GPT-OSS-120b) generated text instead of structured function calls. This is a known issue that can occur when:

The model gets confused about when to use function calls
The model "thinks out loud" instead of making structured calls
Complex reasoning leads the model to generate text rather than function calls

Automatic Recovery: The system automatically retries without function calling tools when this error occurs, so the observation will complete but without on-demand memory queries for that entry. This is a graceful degradation - the diary entry will still be created, just without dynamic memory lookups.

Note: This is a known limitation with some LLM models and function calling. The system handles it gracefully by falling back to writing without memory queries for that specific entry.

Philosophy

This project explores:

Observation and interpretation: How AI "sees" and understands visual information
Narrative continuity: Creating a sense of self and memory in an AI agent
Contextual awareness: Making AI writing feel connected to the world, not isolated
Automated art: Using automation to create ongoing, evolving artistic works
Perspective: The unique viewpoint of a "trapped" observer with limited information

It's also worth noting a broader arc: only with the invention of the World Wide Web could we invent such amazing tools. The web made vast amounts of human knowledge and language available; that corpus helped create the conditions for AI and LLMs. Those systems learn from the web, understand new concepts, and can subsequently create a "web" of their own—linking ideas, memories, and observations in a structure that wasn't explicitly programmed. In this project, the AI Agent's memory MCP and semantic retrieval form such a web: they connect past observations by meaning, recurring details, and theme, leaving a trail of related concepts and building a web of knowledge across linked posts. That self-organizing behavior—emergent structure rather than hand-authored links—is one of the interesting phenomena an AI Agent can exhibit.

License

GPL

This code is fully released under the GNU General Public License. We provide no warranty, however, require all modifications to be published.

Acknowledgments

New Orleans, Louisiana for the live video feed
Groq for fast, cost-effective LLM inference
Meta's Llama models for vision and language capabilities
Pulse API for news headlines
Hugo for static site generation
PaperMod Hugo theme for beautiful post previews

Always, thanks to The Henzi Foundation. Consider donating to their cause. They provide coverage for funeral costs when someone loses a child.

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
.github/workflows		.github/workflows
documents		documents
hugo		hugo
images		images
memory/chroma_db		memory/chroma_db
simulations		simulations
src		src
tests		tests
.cursorrules		.cursorrules
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
MCP_Feature.md		MCP_Feature.md
PW_OpenAPI.yaml		PW_OpenAPI.yaml
Pagespeed.png		Pagespeed.png
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SCHEDULE_INDEPENDENCE.md		SCHEDULE_INDEPENDENCE.md
Screenshot.png		Screenshot.png
TESTING_HYBRID_MEMORY.md		TESTING_HYBRID_MEMORY.md
audit_backstory.py		audit_backstory.py
backfill_image_embeddings.py		backfill_image_embeddings.py
cleanup.py		cleanup.py
cleanup_images.py		cleanup_images.py
debug_holidays.py		debug_holidays.py
deploy.py		deploy.py
devto-post.md		devto-post.md
docker-compose.yml		docker-compose.yml
llms.txt		llms.txt
migrate_memories_to_chroma.py		migrate_memories_to_chroma.py
observe_now.py		observe_now.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_service.py		run_service.py
test_deploy.py		test_deploy.py
test_fallback_with_failure.py		test_fallback_with_failure.py
test_image_fetch.py		test_image_fetch.py
test_mcp_diagnostics.py		test_mcp_diagnostics.py
test_randomizations.py		test_randomizations.py
update_schedule.py		update_schedule.py
verify_fallback.py		verify_fallback.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Robot Diary

The Concept

The Novel Approach: Dynamic Context-Aware Prompting

Rich World Context

Intelligent Memory System with Model Context Protocol (MCP)

Prompt Variety Engine

Multi-Model Architecture

What Makes This Different

1. World Knowledge, Not Just Vision

2. True Narrative Continuity with MCP

3. Guaranteed Variety

4. Graceful Degradation

Example: What Goes Into a Prompt

Results

Automatic Interlinking

Recurring Patterns and Memory Linking

Tech Stack

Contributing

Areas of Interest

Setup

Quick Start with Docker

Required Environment Variables

Important Optional Environment Variables

Managing the Container

Troubleshooting

Philosophy

License

Acknowledgments

About

Uh oh!

Languages

License

JHenzi/robot-diary

Folders and files

Latest commit

History

Repository files navigation

Robot Diary

The Concept

The Novel Approach: Dynamic Context-Aware Prompting

Rich World Context

Intelligent Memory System with Model Context Protocol (MCP)

Prompt Variety Engine

Multi-Model Architecture

What Makes This Different

1. World Knowledge, Not Just Vision

2. True Narrative Continuity with MCP

3. Guaranteed Variety

4. Graceful Degradation

Example: What Goes Into a Prompt

Results

Automatic Interlinking

Recurring Patterns and Memory Linking

Tech Stack

Contributing

Areas of Interest

Setup

Quick Start with Docker

Required Environment Variables

Important Optional Environment Variables

Managing the Container

Troubleshooting

Philosophy

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages