feat: Add StagehandCrawler with AI-powered browser automation#1854
feat: Add StagehandCrawler with AI-powered browser automation#1854Mantisus wants to merge 17 commits intoapify:masterfrom
StagehandCrawler with AI-powered browser automation#1854Conversation
Co-authored-by: Copilot <copilot@github.com>
There was a problem hiding this comment.
Pull request overview
Adds first-class Stagehand integration to Crawlee Python by introducing a StagehandCrawler (built on PlaywrightCrawler) plus corresponding browser-pool plugin/controller, enabling AI-driven page actions (act, extract, observe, execute) while keeping Crawlee’s existing routing/sessions/proxy/navigation-hook features.
Changes:
- Introduces
StagehandCrawler+ Stagehand-specific crawling contexts and exports them fromcrawlee.crawlers. - Adds
StagehandBrowserPlugin/StagehandBrowserController,StagehandOptions, andStagehandPage, integrated withBrowserPool. - Adds Stagehand documentation + examples, updates architecture docs, and replaces the older “Playwright with Stagehand” guide; updates dependencies and adds unit tests.
Reviewed changes
Copilot reviewed 21 out of 23 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
uv.lock |
Locks new optional Stagehand dependency set and adds stagehand extra resolution entries. |
pyproject.toml |
Adds stagehand optional dependency group and includes it in all. |
src/crawlee/browsers/__init__.py |
Exposes Stagehand browser plugin/controller and types via optional imports. |
src/crawlee/browsers/_stagehand_types.py |
Defines StagehandOptions and StagehandPage AI-method wrappers. |
src/crawlee/browsers/_stagehand_browser_plugin.py |
Implements StagehandBrowserPlugin lifecycle and Stagehand client initialization. |
src/crawlee/browsers/_stagehand_browser_controller.py |
Implements CDP connection + lazy session start, page creation, and header injection for Stagehand. |
src/crawlee/crawlers/__init__.py |
Exposes Stagehand crawler + contexts via optional imports. |
src/crawlee/crawlers/_stagehand/__init__.py |
Adds Stagehand crawler module exports with optional-deps handling. |
src/crawlee/crawlers/_stagehand/_stagehand_crawler.py |
Adds StagehandCrawler built on PlaywrightCrawler and auto-configures a Stagehand BrowserPool. |
src/crawlee/crawlers/_stagehand/_stagehand_crawling_context.py |
Adds Stagehand-specific crawling context dataclasses and type-narrowed page. |
src/crawlee/crawlers/_playwright/_playwright_crawler.py |
Refactors Playwright crawler to support overridable context classes and generic context typing via _build_context. |
tests/unit/browsers/test_stagehand_browser_plugin.py |
Adds unit tests for plugin activation and Stagehand client init parameter wiring. |
tests/unit/browsers/test_stagehand_browser_controller.py |
Adds unit tests for lazy session start, concurrency behavior, proxies, and header behavior. |
tests/unit/crawlers/_stagehand/test_stagehand_crawler.py |
Adds unit tests verifying context types, hook contexts, and StagehandPage AI-method delegation. |
docs/guides/stagehand_crawler.mdx |
New guide documenting StagehandCrawler, options, AI methods, and Browserbase usage. |
docs/guides/code_examples/stagehand_crawler/basic_example.py |
Example demonstrating act() + extract() with JSON schema. |
docs/guides/code_examples/stagehand_crawler/browserbase_example.py |
Example demonstrating Browserbase environment configuration. |
docs/guides/playwright_crawler_stagehand.mdx |
Removes old guide that described manual Stagehand integration with PlaywrightCrawler. |
docs/guides/code_examples/playwright_crawler_stagehand/support_classes.py |
Removes old example support classes for the manual Stagehand integration. |
docs/guides/code_examples/playwright_crawler_stagehand/browser_classes.py |
Removes old example browser plugin/controller classes for the manual Stagehand integration. |
docs/guides/code_examples/playwright_crawler_stagehand/stagehand_run.py |
Removes old “manual integration” runnable example. |
docs/guides/architecture_overview.mdx |
Updates architecture diagrams/text to include StagehandCrawler + contexts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Docs check fails due to the current versioning logic. |
vdusek
left a comment
There was a problem hiding this comment.
Mostly doc-related / style things Maybe you could also align the `.rules.md. file (about the double backticks and line width for docstrings).
| """ | ||
|
|
||
| def __init__(self, page: Page, session: AsyncSession) -> None: | ||
| # super().__init__() skipped - Page attribute access delegates to self._page via __getattr__. |
There was a problem hiding this comment.
This explains the intent, but not the fragility 🙂 In the case that _impl_obj on Page becomes a slot, for example, this proxy will stop working. I don't think this will ever happen, but perhaps we should add something like if hasattr(type(page), '_impl_obj'): raise uh-oh to actually verify our expectations.
There was a problem hiding this comment.
Perhaps your comment refers to the previous implementation using super().__init__(page._impl_obj)? After @vdusek feedback, I replaced it with delegation to self._page via __getattr__. Or am I missing something?
| self._browser_context = None | ||
|
|
||
| def _on_page_close(self, page: StagehandPage) -> None: | ||
| self._pages.remove(page) |
There was a problem hiding this comment.
This could throw in some edge cases - can you add a membership check in _pages to be sure?
| await self._browser.close() | ||
| finally: | ||
| self._session = None | ||
| self._browser_context = None |
There was a problem hiding this comment.
Nit, but shouldn't we set _browser to None here, too?
There was a problem hiding this comment.
This is done intentionally to keep the logic of is_browser_connected simple.
Because the browser connection is established on the first new_page() call:
self._browser is None - connection not yet established, controller is ready to accept pages.
self._browser.is_connected() - returns False after close(), correctly signalling a disconnected state.
| class StagehandPreNavCrawlingContext(PlaywrightPreNavCrawlingContext): | ||
| """The pre navigation crawling context used by the `StagehandCrawler`.""" | ||
|
|
||
| page: StagehandPage |
There was a problem hiding this comment.
The JS version also exposes the raw stagehand instance. Is this omission intentional?
There was a problem hiding this comment.
This is related to the architectural differences in Stagehand between JS and Python. In JS, the raw stagehand object provides direct in-process calls to AI methods. In Python, this happens via a REST API exposed by AsyncSession.
However, we can expose the session as a public attribute on StagehandPage. This might be useful for some users.
| """ | ||
| return await self._session.extract(page=self._page, **kwargs) | ||
|
|
||
| async def execute(self, **kwargs: Unpack[SessionExecuteParamsNonStreaming]) -> SessionExecuteResponse: |
There was a problem hiding this comment.
This is called agent in the JS version. Maybe we should match that?
There was a problem hiding this comment.
My motivation was to keep the method name consistent with the stagehand-python, for users who will be referring to the Stagehand documentation. But I won’t insist on it too much.
| @pytest.fixture | ||
| async def patched_crawler(stagehand_session_mock: MagicMock) -> AsyncGenerator[StagehandCrawler, None]: | ||
| """StagehandCrawler with real Playwright but Stagehand session mocked.""" | ||
| stagehand_client = MagicMock(spec=AsyncStagehand) |
There was a problem hiding this comment.
Makes a lot of sense to mock out stagehand for unit testing purposes, but some kind of e2e test that would actually go through the whole setup would be very useful, too. The JS version has one 🙂
There was a problem hiding this comment.
Yeah. However, it requires configuring secrets in the Apify CI environment (model API keys, and maybe Browserbase credentials). But I don't have the permissions for that.
| context: TPostNavContext, | ||
| ) -> TCrawlingContext: ... | ||
|
|
||
| def _build_context( |
There was a problem hiding this comment.
I'm not sure about this refactor - it adds an overloaded method with one code path per call site, essentially. Why is it better than the previous state?
There was a problem hiding this comment.
Honestly, I've long wanted to centralise context creation in one place - this was just a good occasion to do it. If you find this approach inconvenient, we can revert to the original or split into 3 methods (_build_pre_nav_context, _build_post_nav_context, _build_crawling_context).
| """Browserbase project ID, required when `env='BROWSERBASE'`. If not provided, read from | ||
| the `BROWSERBASE_PROJECT_ID` environment variable.""" | ||
|
|
||
| model: str = 'openai/gpt-4.1-mini' |
There was a problem hiding this comment.
That's a fairly dated model, wouldn't 5.4-nano work better?
There was a problem hiding this comment.
I used the same model as JS. But if we're ready to upgrade the model, then yes, I think the 5.4-nano would be better.
Description
Adds
StagehandCrawler- a new browser crawler powered by Stagehand that lets users interact with pages using natural language instead of CSS selectors or XPath. ExtendsPlaywrightCrawlerand inherits all of its features: routing, sessions, autoscaling, proxies, and navigation hooks.StagehandPageextends PlaywrightPagewith four AI methods:act(),extract(),observe(), andexecute().StagehandOptionsconfigures the AI model, execution environment (LOCAL/BROWSERBASE), and session parameters.StagehandBrowserPluginandStagehandBrowserControllerintegrate Stagehand into the browser pool, managing session lifecycle and fingerprint header injection.BrowserLaunchOptions.Issues
Testing
StagehandBrowserController,StagehandBrowserPlugin, andStagehandCrawlerwith Stagehand mocked out - no real LLM connection required to run the test suite.