-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Question:
When I am performing the RAG training task, error prompts frequently appear:
client_session: <aiohttp.client.ClientSession object at 0x7ff7e5760130>
ERROR:2026-01-24 05:17:38,366:Unclosed client session
This results in resource leakage:
ERROR:2026-01-24 08:17:42,169:MCP server error during rollout: Timed out while waiting for response to ClientRequest. Waited 5.0 seconds.
Ultimately, MCP timeout + resource leak → numerous rollout failures, no triplets
Solution:
I found that the unclosed aiohttp.client.ClientSession is created and cached by LiteLLM when it calls the HTTP request initiated by you through LitellmModel, and it has never been closed.
The solution adopted is to switch to httpx
export DISABLE_AIOHTTP_TRANSPORT=True
However, the training will become slower,
Before modification: Training Progress: 0%| | 319/125000 [4:52:35<1796:12:54, 51.86s/it]
After modification, Training Progress: 0%| | 319/125000 [10:27:08<2410:59:31, 69.61s/it]
I previously attempted to add the following code in the async def training_rollout_async function in agent-lightning/examples/rag/rag_agent.py, but it was unsuccessful. I'll try again later when needed
finally:
if runner_task and not runner_task.done():
try:
runner_task.cancel()
await asyncio.wait_for(runner_task, timeout=2.0)
except (asyncio.TimeoutError, asyncio.CancelledError, Exception):
pass
try:
from litellm.llms.custom_httpx.async_client_cleanup import (
close_litellm_async_clients,
)
await close_litellm_async_clients()
except Exception:
pass