Skip to content

Conversation

@Pijukatel
Copy link
Collaborator

@Pijukatel Pijukatel commented Jan 12, 2026

Description

  • Introduces a new argument crawler_id for BasicCrawler. This argument controls the shared state.

Draft for discussion. The main drawback is that this is another unique way of sharing something between different crawlers. Similar, but different existing approaches:

  • Statistics can be shared by explicitly passing an existing instance to the crawler.
  • Storages in general can be shared by properly setting a combination of configuration and storage_client arguments to the crawler
  • Storages can be shared by relying on default values (will be reused by default)
  • RequestManager can be reused by explicitly passing an existing instance to the request_manager argument

What are the alternatives?

  1. Explicitly passing state_kvs instead of crawler_id, otherwise autoincrement state counter - this is more aligned with the existing approach of how Statistics can be re-used
  2. Bring default_kvs_id Configuration value from SDK level to Crawlee level. This would allow to share or not share state based on what would be in the Configuration(default_kvs_id=...) (if using the same storage client...)
  3. Ignore and just document current behavior
  4. ...

Issues

Closes: #1627

Testing

  • TODO

Checklist

  • CI passed

@github-actions github-actions bot added this to the 132nd sprint - Tooling team milestone Jan 12, 2026
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Jan 12, 2026
@Pijukatel Pijukatel changed the title Add crawler fix: Do not share state between different crawlers unless requested Jan 12, 2026
@codecov
Copy link

codecov bot commented Jan 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.38%. Comparing base (0a0995c) to head (31e16d2).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1669      +/-   ##
==========================================
- Coverage   92.41%   92.38%   -0.03%     
==========================================
  Files         157      157              
  Lines       10478    10486       +8     
==========================================
+ Hits         9683     9688       +5     
- Misses        795      798       +3     
Flag Coverage Δ
unit 92.38% <100.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Pijukatel
Copy link
Collaborator Author

Better discuss this. After implementing this draft, I am leaning towards alternative 1 (see description)

@janbuchar , @barjin, @B4nan. Could you please share your point of view?

You can see the usage in code in the updated and new test in this PR.

@barjin
Copy link
Member

barjin commented Jan 12, 2026

Explicitly passing state_kvs instead of crawler_id

The state is just one key in the KVS though, it feels weird to me to make the state this prominent in our API. If it's about the entire KVS (so e.g. get_key_value_store will also return this KVS), then it makes a bit more sense to me.

Maybe it's unclear that the "crawler state" is actually stored in KVS - this we should IMO communicate better in the docs.

Having thought about this a bit more, I see the "state sharing" as a bug again :) Different crawler instances IMO shouldn't influence each other just because they are touching the same storage implementation (if this is intentional, it should be explicit).

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I am getting lost in this, I thought the id is rather internal thing to ensure two crawler instances created in one app context won't share the state. We expose the id so people can opt-in to sharing the state explicitly, but the important bit to me is that those IDs will be unique automatically. I can't think of a use case where one would want to create multiple crawlers and share their stats. Similarly, I don't think sharing the state object is something people would want to, at least not by default.

status_message_logging_interval: timedelta = timedelta(seconds=10),
status_message_callback: Callable[[StatisticsState, StatisticsState | None, str], Awaitable[str | None]]
| None = None,
crawler_id: int | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the option should be called just id as in the JS version, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Pepa chose that name because id is a bultin function in Python. If that's the case, I think we can safely shadow it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janbuchar exactly. But just id is fine for me as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point why I dislike it is, that "crawler.id" is very ambiguous, and it can be implemented to control:

  • only the state - people might get confused, why does id control only sharing of the state and not statistics? In that case it is not explicitly obvious from the argument and it would be better called something like state_id.
  • state and statistics - this would cause a situation where you can either share both or none and also Statistics can already be shared implicitly.

@Pijukatel
Copy link
Collaborator Author

Explicitly passing state_kvs instead of crawler_id

The state is just one key in the KVS though, it feels weird to me to make the state this prominent in our API. If it's about the entire KVS (so e.g. get_key_value_store will also return this KVS), then it makes a bit more sense to me.

Maybe it's unclear that the "crawler state" is actually stored in KVS - this we should IMO communicate better in the docs.

Having thought about this a bit more, I see the "state sharing" as a bug again :) Different crawler instances IMO shouldn't influence each other just because they are touching the same storage implementation (if this is intentional, it should be explicit).

What about having an optional argument use_state? The default will be a function that saves to the default kvs under an automatically incremented id and the user can pass whatever custom implementation if they want something custom, like sharing the same state between two crawlers.

This will be an easy and clear default without the need for extra arguments and maximum flexibility for custom use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Port fix of multiple crawler instances sharing state from JS version

5 participants