Skip to content

feat(scrape_tool): add support for JavaScript rendering using Playwright#4402

Open
mrsadeghi wants to merge 1 commit intocrewAIInc:mainfrom
mrsadeghi:feat/add-playwright-rendering
Open

feat(scrape_tool): add support for JavaScript rendering using Playwright#4402
mrsadeghi wants to merge 1 commit intocrewAIInc:mainfrom
mrsadeghi:feat/add-playwright-rendering

Conversation

@mrsadeghi
Copy link

@mrsadeghi mrsadeghi commented Feb 7, 2026

Description

This PR introduces an optional render_js parameter to the ScrapeWebsiteTool, enabling it to handle modern JavaScript-heavy websites (SPAs, React, etc.) that cannot be scraped with standard HTTP requests.

Key Changes

  • Added render_js: bool = False to ScrapeWebsiteTool constructor.
  • Integrated playwright as an optional rendering engine.
  • Implemented a fallback mechanism: if render_js is False, it continues to use the lightweight requests library.
  • Added comprehensive unit tests with mocking to verify both JS and non-JS workflows.
  • Updated pyproject.toml to include playwright as a dependency.

Why this is needed?

Currently, the ScrapeWebsiteTool fails to capture content from websites that require client-side rendering. By adding Playwright support, we significantly expand the tool's capability to gather data from modern web applications.

Testing

  • Manual Test: Verified on https://quotes.toscrape.com/js/.
    • Without render_js: Captured ~200 chars (missing content).
    • With render_js: Captured ~1500 chars (full content).
  • Automated Test: Added tests/tools/test_scrape_website_tool.py and passed successfully using uv run pytest.

Note for Maintainers

Due to environment-specific pre-commit hook issues on Windows (related to .venv/bin/activate path), local hooks were bypassed after manual formatting with ruff.


Note

Medium Risk
Moderate risk due to new async execution paths (HITL feedback loops, async HTTP client) and adding Playwright-based rendering, which can introduce event-loop, dependency, and runtime environment issues.

Overview
Adds optional JavaScript rendering to ScrapeWebsiteTool via a new render_js flag that switches scraping from requests to Playwright page rendering before parsing with BeautifulSoup; playwright is added as a dependency and new unit tests cover both JS-rendered and default paths.

Extends HITL handling to support async feedback loops: introduces handle_feedback_async in the human input provider contract with non-blocking stdin reads, updates both agent executors to await async feedback processing, and adds _ainvoke_loop to the experimental executor to support async re-execution during feedback.

Updates flow router listener dispatch to always execute listeners for router outcomes (passing through last_human_feedback when present) and adds regression tests for chained router outcomes. Separately, switches PlusAPI.get_agent to async httpx and adapts call sites/tests accordingly (including using asyncio.run).

Written by Cursor Bugbot for commit fd457d1. This will update automatically on new commits. Configure here.

asynchronous human-in-the-loop handling and related fixes.

- Extend human_input provider with async support: AsyncExecutorContext, handle_feedback_async, async prompt helpers (_prompt_input_async, _async_readline), and async training/regular feedback loops in SyncHumanInputProvider.
- Add async handler methods in CrewAgentExecutor and AgentExecutor (_ahandle_human_feedback, _ainvoke_loop) to integrate async provider flows.
- Change PlusAPI.get_agent to an async httpx call and adapt caller in agent_utils to run it via asyncio.run.
- Simplify listener execution in flow.Flow to correctly pass HumanFeedbackResult to listeners and unify execution path for router outcomes.
- Remove deprecated types/hitl.py definitions.
- Add tests covering chained router feedback, rejected paths, and mixed router/non-router listeners to prevent regressions.
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

client = PlusAPI(api_key=get_auth_token())
_print_current_organization()
response = client.get_agent(from_repository)
response = asyncio.run(client.get_agent(from_repository))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asyncio.run() fails inside already-running event loops

High Severity

asyncio.run(client.get_agent(from_repository)) raises RuntimeError when called from within an already-running event loop. Since load_agent_from_repository is invoked from a sync Pydantic model validator, constructing an Agent(from_repository=...) during kickoff_async or any other async context will crash. The previous sync requests-based get_agent worked in all contexts. Additionally, existing tests that @patch get_agent with a regular MagicMock will also break because asyncio.run() expects a coroutine, not a MagicMock.

Additional Locations (1)

Fix in Cursor Fix in Web

raw = await reader.readline()
return raw.decode().rstrip("\n")
except (OSError, NotImplementedError, ValueError):
return await asyncio.to_thread(input)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeated connect_read_pipe on stdin corrupts state

Medium Severity

_async_readline calls connect_read_pipe on sys.stdin every invocation, but the transport is never closed. The first call puts stdin's fd in non-blocking mode and registers a persistent transport. On subsequent calls (expected during the multi-round feedback loop in _handle_regular_feedback_async), connect_read_pipe fails, the fallback calls asyncio.to_thread(input), but input() may malfunction because stdin is still in non-blocking mode from the first call, and the lingering transport can race for bytes.

Fix in Cursor Fix in Web

"python-docx~=1.2.0",
"youtube-transcript-api~=1.2.2",
"pymupdf~=1.26.6",
"playwright>=1.57.0",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Playwright added as hard dependency instead of optional

Medium Severity

playwright is added to the required dependencies list, making it mandatory for all crewai-tools users. Playwright is a heavyweight package that also requires separate browser binary installation (playwright install). The implementation already handles ImportError at runtime (lines 87–88 of the scrape tool), clearly indicating it was designed to be optional. It belongs in [project.optional-dependencies] alongside similar optional packages like selenium and browserbase.

Fix in Cursor Fix in Web

async def get_agent(self, handle: str) -> httpx.Response:
url = urljoin(self.base_url, f"{self.AGENTS_RESOURCE}/{handle}")
async with httpx.AsyncClient() as client:
return await client.get(url, headers=self.headers)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Async get_agent drops trust_env=False proxy setting

Low Severity

The sync _make_request explicitly sets session.trust_env = False to ignore proxy environment variables, but the new async get_agent uses httpx.AsyncClient() which defaults to trust_env=True. This means the async version will pick up HTTP_PROXY/HTTPS_PROXY environment variables that the sync version intentionally ignores, potentially causing requests to route through unintended proxies or fail in corporate/CI environments.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants