feat(scrape_tool): add support for JavaScript rendering using Playwright#4402
feat(scrape_tool): add support for JavaScript rendering using Playwright#4402mrsadeghi wants to merge 1 commit intocrewAIInc:mainfrom
Conversation
asynchronous human-in-the-loop handling and related fixes. - Extend human_input provider with async support: AsyncExecutorContext, handle_feedback_async, async prompt helpers (_prompt_input_async, _async_readline), and async training/regular feedback loops in SyncHumanInputProvider. - Add async handler methods in CrewAgentExecutor and AgentExecutor (_ahandle_human_feedback, _ainvoke_loop) to integrate async provider flows. - Change PlusAPI.get_agent to an async httpx call and adapt caller in agent_utils to run it via asyncio.run. - Simplify listener execution in flow.Flow to correctly pass HumanFeedbackResult to listeners and unify execution path for router outcomes. - Remove deprecated types/hitl.py definitions. - Add tests covering chained router feedback, rejected paths, and mixed router/non-router listeners to prevent regressions.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
This PR is being reviewed by Cursor Bugbot
Details
Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
| client = PlusAPI(api_key=get_auth_token()) | ||
| _print_current_organization() | ||
| response = client.get_agent(from_repository) | ||
| response = asyncio.run(client.get_agent(from_repository)) |
There was a problem hiding this comment.
asyncio.run() fails inside already-running event loops
High Severity
asyncio.run(client.get_agent(from_repository)) raises RuntimeError when called from within an already-running event loop. Since load_agent_from_repository is invoked from a sync Pydantic model validator, constructing an Agent(from_repository=...) during kickoff_async or any other async context will crash. The previous sync requests-based get_agent worked in all contexts. Additionally, existing tests that @patch get_agent with a regular MagicMock will also break because asyncio.run() expects a coroutine, not a MagicMock.
Additional Locations (1)
| raw = await reader.readline() | ||
| return raw.decode().rstrip("\n") | ||
| except (OSError, NotImplementedError, ValueError): | ||
| return await asyncio.to_thread(input) |
There was a problem hiding this comment.
Repeated connect_read_pipe on stdin corrupts state
Medium Severity
_async_readline calls connect_read_pipe on sys.stdin every invocation, but the transport is never closed. The first call puts stdin's fd in non-blocking mode and registers a persistent transport. On subsequent calls (expected during the multi-round feedback loop in _handle_regular_feedback_async), connect_read_pipe fails, the fallback calls asyncio.to_thread(input), but input() may malfunction because stdin is still in non-blocking mode from the first call, and the lingering transport can race for bytes.
| "python-docx~=1.2.0", | ||
| "youtube-transcript-api~=1.2.2", | ||
| "pymupdf~=1.26.6", | ||
| "playwright>=1.57.0", |
There was a problem hiding this comment.
Playwright added as hard dependency instead of optional
Medium Severity
playwright is added to the required dependencies list, making it mandatory for all crewai-tools users. Playwright is a heavyweight package that also requires separate browser binary installation (playwright install). The implementation already handles ImportError at runtime (lines 87–88 of the scrape tool), clearly indicating it was designed to be optional. It belongs in [project.optional-dependencies] alongside similar optional packages like selenium and browserbase.
| async def get_agent(self, handle: str) -> httpx.Response: | ||
| url = urljoin(self.base_url, f"{self.AGENTS_RESOURCE}/{handle}") | ||
| async with httpx.AsyncClient() as client: | ||
| return await client.get(url, headers=self.headers) |
There was a problem hiding this comment.
Async get_agent drops trust_env=False proxy setting
Low Severity
The sync _make_request explicitly sets session.trust_env = False to ignore proxy environment variables, but the new async get_agent uses httpx.AsyncClient() which defaults to trust_env=True. This means the async version will pick up HTTP_PROXY/HTTPS_PROXY environment variables that the sync version intentionally ignores, potentially causing requests to route through unintended proxies or fail in corporate/CI environments.


Description
This PR introduces an optional
render_jsparameter to theScrapeWebsiteTool, enabling it to handle modern JavaScript-heavy websites (SPAs, React, etc.) that cannot be scraped with standard HTTP requests.Key Changes
render_js: bool = FalsetoScrapeWebsiteToolconstructor.playwrightas an optional rendering engine.render_jsis False, it continues to use the lightweightrequestslibrary.pyproject.tomlto includeplaywrightas a dependency.Why this is needed?
Currently, the
ScrapeWebsiteToolfails to capture content from websites that require client-side rendering. By adding Playwright support, we significantly expand the tool's capability to gather data from modern web applications.Testing
https://quotes.toscrape.com/js/.render_js: Captured ~200 chars (missing content).render_js: Captured ~1500 chars (full content).tests/tools/test_scrape_website_tool.pyand passed successfully usinguv run pytest.Note for Maintainers
Due to environment-specific pre-commit hook issues on Windows (related to
.venv/bin/activatepath), local hooks were bypassed after manual formatting withruff.Note
Medium Risk
Moderate risk due to new async execution paths (HITL feedback loops, async HTTP client) and adding Playwright-based rendering, which can introduce event-loop, dependency, and runtime environment issues.
Overview
Adds optional JavaScript rendering to
ScrapeWebsiteToolvia a newrender_jsflag that switches scraping fromrequeststo Playwright page rendering before parsing with BeautifulSoup;playwrightis added as a dependency and new unit tests cover both JS-rendered and default paths.Extends HITL handling to support async feedback loops: introduces
handle_feedback_asyncin the human input provider contract with non-blocking stdin reads, updates both agent executors to await async feedback processing, and adds_ainvoke_loopto the experimental executor to support async re-execution during feedback.Updates flow router listener dispatch to always execute listeners for router outcomes (passing through
last_human_feedbackwhen present) and adds regression tests for chained router outcomes. Separately, switchesPlusAPI.get_agentto asynchttpxand adapts call sites/tests accordingly (including usingasyncio.run).Written by Cursor Bugbot for commit fd457d1. This will update automatically on new commits. Configure here.