A high-performance, modular service for finding the most relevant agents to answer user queries. This standalone implementation supports both REST and MCP (Model Context Protocol) interfaces with swappable search and LLM backends.
The system is composed of 4 modular Python files:
agent_finder.py: Web server with REST and MCP endpointswho_handler.py: Core orchestration logic with cachingsearch_backend.py: Swappable search interface (Azure Search, Elasticsearch, etc.)llm_backend.py: Swappable LLM interface (Azure OpenAI, OpenAI, Anthropic, etc.)
pip install -r requirements.txt# Search Backend Configuration
export SEARCH_PROVIDER=azure # Options: azure, elasticsearch, qdrant
export SEARCH_ENDPOINT="https://your-search.search.windows.net"
export SEARCH_API_KEY="your-search-api-key"
export SEARCH_INDEX="nlweb_sites"
# LLM Backend Configuration
export LLM_PROVIDER=azure_openai # Options: azure_openai, openai, anthropic
export LLM_ENDPOINT="https://your-openai.openai.azure.com"
export LLM_API_KEY="your-llm-api-key"
export LLM_MODEL="gpt-4"
export LLM_EMBEDDING_MODEL="text-embedding-3-large"
export LLM_MAX_CONCURRENT=50
# Optional: Server Configuration
export WHO_SERVER_PORT=8080
export WHO_SERVER_HOST=0.0.0.0
# Optional: WHO Handler Settings
export WHO_SCORE_THRESHOLD=70
export WHO_MAX_RESULTS=10
export WHO_SEARCH_TOP_K=50
export WHO_CACHE_TTL=3600python agent_finder.pyThe server will start on http://localhost:8080 by default.
Request:
curl -X POST http://localhost:8080/who \
-H "Content-Type: application/json" \
-d '{"query": "where can I buy running shoes?"}'Response:
{
"results": [
{
"name": "Nike.com",
"url": "https://www.nike.com",
"score": 95,
"description": "Official Nike store with extensive running shoe collection"
},
{
"name": "Adidas.com",
"url": "https://www.adidas.com",
"score": 92,
"description": "Adidas official store featuring running footwear"
}
],
"query": "where can I buy running shoes?"
}The MCP endpoint follows the Model Context Protocol specification for tool-based interactions.
Initialize:
curl -X POST http://localhost:8080/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"method": "initialize",
"params": {"protocolVersion": "2024-11-05"},
"id": 1
}'List Tools:
curl -X POST http://localhost:8080/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"method": "tools/list",
"id": 2
}'Call WHO Tool:
curl -X POST http://localhost:8080/mcp \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"method": "tools/call",
"params": {
"name": "who",
"arguments": {"query": "where can I buy running shoes?"}
},
"id": 3
}'Health Check:
curl http://localhost:8080/healthStatistics:
curl http://localhost:8080/statsClear Caches:
curl -X POST http://localhost:8080/clear-cache| Variable | Description | Default |
|---|---|---|
| Search Backend | ||
SEARCH_PROVIDER |
Search backend provider | azure |
SEARCH_ENDPOINT |
Search service endpoint | Required |
SEARCH_API_KEY |
Search service API key | Required |
SEARCH_INDEX |
Search index name | nlweb_sites |
| LLM Backend | ||
LLM_PROVIDER |
LLM provider | azure_openai |
LLM_ENDPOINT |
LLM service endpoint | Required |
LLM_API_KEY |
LLM service API key | Required |
LLM_MODEL |
LLM model name | gpt-4 |
LLM_EMBEDDING_MODEL |
Embedding model name | text-embedding-3-large |
LLM_MAX_CONCURRENT |
Max concurrent LLM calls | 50 |
| Server | ||
WHO_SERVER_PORT |
Server port | 8080 |
WHO_SERVER_HOST |
Server host | 0.0.0.0 |
| WHO Handler | ||
WHO_SCORE_THRESHOLD |
Min score to include site | 70 |
WHO_MAX_RESULTS |
Max results to return | 10 |
WHO_SEARCH_TOP_K |
Sites to retrieve from search | 50 |
WHO_CACHE_TTL |
Cache TTL in seconds | 3600 |
WHO_MAX_CACHE_ENTRIES |
Max search cache entries | 10000 |
WHO_RANKING_CACHE_ENTRIES |
Max ranking cache entries | 100000 |
Edit search_backend.py and implement the SearchBackend interface:
class MySearchBackend(SearchBackend):
async def initialize(self):
# Initialize your client
pass
async def search(self, query: str, vector: List[float], top_k: int = 30) -> List[Dict[str, Any]]:
# Return list of {"url", "json_ld", "name", "site"}
pass
async def close(self):
# Cleanup connections
passThen update the factory function:
def get_search_backend() -> SearchBackend:
if SEARCH_CONFIG["provider"] == "mysearch":
return MySearchBackend()Edit llm_backend.py and implement the LLMBackend interface:
class MyLLMBackend(LLMBackend):
async def initialize(self):
# Initialize your client
pass
async def get_embedding(self, text: str) -> List[float]:
# Return embedding vector
pass
async def rank_site(self, query: str, site_json: str) -> Dict[str, Any]:
# Return {"score": 0-100, "description": "..."}
pass
async def close(self):
# Cleanup
passThe system uses three levels of caching:
- Embedding Cache: Never expires, embeddings are stable
- Search Cache: TTL-based, caches query → search results
- Ranking Cache: TTL-based, caches (query, site) → ranking
- Search: Up to 50 concurrent connections to search backend
- LLM: Configurable concurrent calls (default 25)
- Request Handling: Fully async, supports 50+ concurrent requests
With default settings:
- Base: ~200MB
- Full caches: 2-4GB
- Can scale to 16GB+ with increased cache sizes
The /stats endpoint provides real-time metrics:
{
"queries_processed": 1234,
"cache_hits": 890,
"cache_misses": 344,
"total_sites_ranked": 10280,
"embedding_cache_size": 567,
"search_cache_size": 234,
"ranking_cache_size": 8901
}Create a Dockerfile:
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY *.py .
CMD ["python", "server.py"]Build and run:
docker build -t who-handler .
docker run -p 8080:8080 --env-file .env who-handlerCreate /etc/systemd/system/who-handler.service:
[Unit]
Description=WHO Handler Service
After=network.target
[Service]
Type=simple
User=www-data
WorkingDirectory=/opt/who-handler
EnvironmentFile=/opt/who-handler/.env
ExecStart=/usr/bin/python3 /opt/who-handler/server.py
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.targetserver {
listen 80;
server_name who.example.com;
location / {
proxy_pass http://localhost:8080;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_connect_timeout 10s;
proxy_send_timeout 30s;
proxy_read_timeout 30s;
}
}-
"No search results found"
- Check search index name and credentials
- Verify the index contains data
-
"Embedding error"
- Verify LLM endpoint and API key
- Check embedding model name is correct
-
Slow responses
- Check
LLM_MAX_CONCURRENTsetting - Monitor
/statsfor cache effectiveness - Consider increasing cache sizes
- Check
-
High memory usage
- Reduce cache sizes via environment variables
- Monitor
/statsendpoint for cache sizes
This is a standalone implementation for the WHO handler functionality is made available under the MIT License
For issues or questions, please refer to the main NLWeb project documentation.