feat: Add `StagehandCrawler` with AI-powered browser automation by Mantisus · Pull Request #1854 · apify/crawlee-python

Mantisus · 2026-04-22T22:47:29Z

Description

Adds StagehandCrawler - a new browser crawler powered by Stagehand that lets users interact with pages using natural language instead of CSS selectors or XPath. Extends PlaywrightCrawler and inherits all of its features: routing, sessions, autoscaling, proxies, and navigation hooks.

StagehandPage extends Playwright Page with four AI methods: act(), extract(), observe(), and execute().
StagehandOptions configures the AI model, execution environment (LOCAL / BROWSERBASE), and session parameters.
StagehandBrowserPlugin and StagehandBrowserController integrate Stagehand into the browser pool, managing session lifecycle and fingerprint header injection.
Because Stagehand controls the browser launch internally and Playwright connects via CDP, only Chromium is supported, and browser configuration is limited to the subset accepted by Stagehand's BrowserLaunchOptions.
Added a new guide covering basic usage, AI page operations, and Browserbase integration.

Issues

Closes: StagehandCrawler + Stagehand browser plugin #1738

Testing

Added unit tests for the StagehandBrowserController, StagehandBrowserPlugin, and StagehandCrawler with Stagehand mocked out - no real LLM connection required to run the test suite.

Co-authored-by: Copilot <copilot@github.com>

Copilot

Pull request overview

Adds first-class Stagehand integration to Crawlee Python by introducing a StagehandCrawler (built on PlaywrightCrawler) plus corresponding browser-pool plugin/controller, enabling AI-driven page actions (act, extract, observe, execute) while keeping Crawlee’s existing routing/sessions/proxy/navigation-hook features.

Changes:

Introduces StagehandCrawler + Stagehand-specific crawling contexts and exports them from crawlee.crawlers.
Adds StagehandBrowserPlugin/StagehandBrowserController, StagehandOptions, and StagehandPage, integrated with BrowserPool.
Adds Stagehand documentation + examples, updates architecture docs, and replaces the older “Playwright with Stagehand” guide; updates dependencies and adds unit tests.

Reviewed changes

Copilot reviewed 21 out of 23 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`uv.lock`	Locks new optional Stagehand dependency set and adds `stagehand` extra resolution entries.
`pyproject.toml`	Adds `stagehand` optional dependency group and includes it in `all`.
`src/crawlee/browsers/__init__.py`	Exposes Stagehand browser plugin/controller and types via optional imports.
`src/crawlee/browsers/_stagehand_types.py`	Defines `StagehandOptions` and `StagehandPage` AI-method wrappers.
`src/crawlee/browsers/_stagehand_browser_plugin.py`	Implements `StagehandBrowserPlugin` lifecycle and Stagehand client initialization.
`src/crawlee/browsers/_stagehand_browser_controller.py`	Implements CDP connection + lazy session start, page creation, and header injection for Stagehand.
`src/crawlee/crawlers/__init__.py`	Exposes Stagehand crawler + contexts via optional imports.
`src/crawlee/crawlers/_stagehand/__init__.py`	Adds Stagehand crawler module exports with optional-deps handling.
`src/crawlee/crawlers/_stagehand/_stagehand_crawler.py`	Adds `StagehandCrawler` built on `PlaywrightCrawler` and auto-configures a Stagehand `BrowserPool`.
`src/crawlee/crawlers/_stagehand/_stagehand_crawling_context.py`	Adds Stagehand-specific crawling context dataclasses and type-narrowed `page`.
`src/crawlee/crawlers/_playwright/_playwright_crawler.py`	Refactors Playwright crawler to support overridable context classes and generic context typing via `_build_context`.
`tests/unit/browsers/test_stagehand_browser_plugin.py`	Adds unit tests for plugin activation and Stagehand client init parameter wiring.
`tests/unit/browsers/test_stagehand_browser_controller.py`	Adds unit tests for lazy session start, concurrency behavior, proxies, and header behavior.
`tests/unit/crawlers/_stagehand/test_stagehand_crawler.py`	Adds unit tests verifying context types, hook contexts, and StagehandPage AI-method delegation.
`docs/guides/stagehand_crawler.mdx`	New guide documenting `StagehandCrawler`, options, AI methods, and Browserbase usage.
`docs/guides/code_examples/stagehand_crawler/basic_example.py`	Example demonstrating `act()` + `extract()` with JSON schema.
`docs/guides/code_examples/stagehand_crawler/browserbase_example.py`	Example demonstrating Browserbase environment configuration.
`docs/guides/playwright_crawler_stagehand.mdx`	Removes old guide that described manual Stagehand integration with `PlaywrightCrawler`.
`docs/guides/code_examples/playwright_crawler_stagehand/support_classes.py`	Removes old example support classes for the manual Stagehand integration.
`docs/guides/code_examples/playwright_crawler_stagehand/browser_classes.py`	Removes old example browser plugin/controller classes for the manual Stagehand integration.
`docs/guides/code_examples/playwright_crawler_stagehand/stagehand_run.py`	Removes old “manual integration” runnable example.
`docs/guides/architecture_overview.mdx`	Updates architecture diagrams/text to include `StagehandCrawler` + contexts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Mantisus · 2026-05-03T22:58:23Z

Docs check fails due to the current versioning logic. ApiLink resolves to /api/class/<ClassName> instead of /api/next/class/<ClassName>, and since these classes don't exist in the stable API docs yet, it causes a broken link error.

vdusek

Mostly doc-related / style things Maybe you could also align the `.rules.md. file (about the double backticks and line width for docstrings).

vdusek

LGTM

janbuchar · 2026-05-06T12:01:04Z

+    """
+
+    def __init__(self, page: Page, session: AsyncSession) -> None:
+        # super().__init__() skipped - Page attribute access delegates to self._page via __getattr__.


This explains the intent, but not the fragility 🙂 In the case that _impl_obj on Page becomes a slot, for example, this proxy will stop working. I don't think this will ever happen, but perhaps we should add something like if hasattr(type(page), '_impl_obj'): raise uh-oh to actually verify our expectations.

Perhaps your comment refers to the previous implementation using super().__init__(page._impl_obj)? After @vdusek feedback, I replaced it with delegation to self._page via __getattr__. Or am I missing something?

janbuchar · 2026-05-06T12:25:05Z

+            self._browser_context = None
+
+    def _on_page_close(self, page: StagehandPage) -> None:
+        self._pages.remove(page)


This could throw in some edge cases - can you add a membership check in _pages to be sure?

janbuchar · 2026-05-06T12:26:47Z

+                await self._browser.close()
+        finally:
+            self._session = None
+            self._browser_context = None


Nit, but shouldn't we set _browser to None here, too?

This is done intentionally to keep the logic of is_browser_connected simple.
Because the browser connection is established on the first new_page() call:
self._browser is None - connection not yet established, controller is ready to accept pages.
self._browser.is_connected() - returns False after close(), correctly signalling a disconnected state.

janbuchar · 2026-05-06T12:28:36Z

+class StagehandPreNavCrawlingContext(PlaywrightPreNavCrawlingContext):
+    """The pre navigation crawling context used by the `StagehandCrawler`."""
+
+    page: StagehandPage


The JS version also exposes the raw stagehand instance. Is this omission intentional?

This is related to the architectural differences in Stagehand between JS and Python. In JS, the raw stagehand object provides direct in-process calls to AI methods. In Python, this happens via a REST API exposed by AsyncSession.

However, we can expose the session as a public attribute on StagehandPage. This might be useful for some users.

janbuchar · 2026-05-06T12:32:38Z

+        """
+        return await self._session.extract(page=self._page, **kwargs)
+
+    async def execute(self, **kwargs: Unpack[SessionExecuteParamsNonStreaming]) -> SessionExecuteResponse:


This is called agent in the JS version. Maybe we should match that?

My motivation was to keep the method name consistent with the stagehand-python, for users who will be referring to the Stagehand documentation. But I won’t insist on it too much.

janbuchar · 2026-05-06T12:40:58Z

+@pytest.fixture
+async def patched_crawler(stagehand_session_mock: MagicMock) -> AsyncGenerator[StagehandCrawler, None]:
+    """StagehandCrawler with real Playwright but Stagehand session mocked."""
+    stagehand_client = MagicMock(spec=AsyncStagehand)


Makes a lot of sense to mock out stagehand for unit testing purposes, but some kind of e2e test that would actually go through the whole setup would be very useful, too. The JS version has one 🙂

Yeah. However, it requires configuring secrets in the Apify CI environment (model API keys, and maybe Browserbase credentials). But I don't have the permissions for that.

janbuchar · 2026-05-06T12:50:40Z

+        context: TPostNavContext,
+    ) -> TCrawlingContext: ...
+
+    def _build_context(


I'm not sure about this refactor - it adds an overloaded method with one code path per call site, essentially. Why is it better than the previous state?

Honestly, I've long wanted to centralise context creation in one place - this was just a good occasion to do it. If you find this approach inconvenient, we can revert to the original or split into 3 methods (_build_pre_nav_context, _build_post_nav_context, _build_crawling_context).

janbuchar · 2026-05-06T12:54:20Z

+    """Browserbase project ID, required when `env='BROWSERBASE'`. If not provided, read from
+    the `BROWSERBASE_PROJECT_ID` environment variable."""
+
+    model: str = 'openai/gpt-4.1-mini'


That's a fairly dated model, wouldn't 5.4-nano work better?

I used the same model as JS. But if we're ready to upgrade the model, then yes, I think the 5.4-nano would be better.

Mantisus and others added 10 commits April 22, 2026 22:44

add stagehand plugin

1febb65

update typing for stagehand

85f18de

update plugin

a47b836

Merge branch 'master' into crawlee-stagehand-crawler

edefde7

Co-authored-by: Copilot <copilot@github.com>

synchronize params between modules

62b0c66

Merge branch 'master' into crawlee-stagehand-crawler

15ad00a

fix docs

0424afc

add tests

bd45915

add docs and fingerprint headers

33503cb

fix docs

4115ae2

Mantisus marked this pull request as ready for review May 3, 2026 22:52

Mantisus self-assigned this May 3, 2026

Mantisus requested a review from Copilot May 3, 2026 22:52

Copilot started reviewing on behalf of Mantisus May 3, 2026 22:53 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

fixes

476aa4f

Mantisus requested review from janbuchar and vdusek May 4, 2026 00:59

Mantisus added 4 commits May 4, 2026 01:03

fix test

9645808

resolve conflict and update stagehand

fb34b74

fix docstring

a86c049

Merge branch 'master' into crawlee-stagehand-crawler

ed6ea6f

vdusek requested changes May 5, 2026

View reviewed changes

fix docs style

95cc773

Mantisus requested a review from vdusek May 5, 2026 22:53

vdusek approved these changes May 6, 2026

View reviewed changes

janbuchar reviewed May 6, 2026

View reviewed changes

fix

77feedd

Conversation

Mantisus commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mantisus commented May 3, 2026

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mantisus May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Mantisus commented Apr 22, 2026 •

edited

Loading

Mantisus May 6, 2026 •

edited

Loading