Skip to content

feat(nimbus): Add enrollment alert data GCS reader and Celery fetch task#15035

Open
yashikakhurana wants to merge 3 commits intomainfrom
15024
Open

feat(nimbus): Add enrollment alert data GCS reader and Celery fetch task#15035
yashikakhurana wants to merge 3 commits intomainfrom
15024

Conversation

@yashikakhurana
Copy link
Contributor

@yashikakhurana yashikakhurana commented Mar 25, 2026

Because

  • Now we have the enrollment count JSON available that we can use to set the monitoring data for each experiment, which can be used further to alert users for the thresholds and SRM.

This commit

  • Fetches the JSON data, parses and set it to the experiment monitoring_data field

Fixes #15024 #15036

@yashikakhurana yashikakhurana marked this pull request as ready for review March 25, 2026 17:19
@yashikakhurana yashikakhurana linked an issue Mar 25, 2026 that may be closed by this pull request
7 tasks
"fetch_monitoring_data": {
"task": "experimenter.jetstream.tasks.fetch_monitoring_data",
"schedule": crontab(minute=0, hour=8),
"options": {"expires": 3600},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this expiration option do?

Copy link
Contributor Author

@yashikakhurana yashikakhurana Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout is 1 hour for this request

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this useful here? We don't use it for any of the other tasks, and I can't really tell from the Celery docs what scenario this parameter is trying to handle.

If we do want this, I think a more appropriate timeout would be ~5min. Were you seeing really long execution times in testing this? Or is there another reason this should be 1hr?

},
"fetch_monitoring_data": {
"task": "experimenter.jetstream.tasks.fetch_monitoring_data",
"schedule": crontab(minute=0, hour=8),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but be aware that there will be times when the data isn't available by 8 (esp if this is UTC, which I think it is?). It might be nice to have a way to trigger this manually in those cases, but I think it's ok to worry about it later if it happens enough to be worth fixing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, let me note down it somewhere so that we don't miss it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like we're actually using these constants yet, can you wait to include them on the PR where they are actually needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay I can remove these, just added as was prepping for the next task

Comment on lines +108 to +110
except Exception as e:
logger.error(f"Failed to load monitoring data from GCS: {e}")
return {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just raise this exception and let the task handle it along with any logging. That fits with what we're doing in the other tasks, and it also doesn't seem useful to overwrite yesterday's data with blank data if there was an error getting the new data.

Comment on lines +141 to +142
logger.warning("No enrollment alert data found in GCS")
metrics.incr("fetch_monitoring_data.completed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an error log and a .failed metric status?


self.assertEqual(result, {})

@patch("experimenter.jetstream.client.load_data_from_gcs")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can patch these for everything without annotating every function by creating a fixture like this, and having it take a parameter to set the return value dynamically.

Comment on lines +3540 to +3542
result = get_monitoring_data()

self.assertEqual(result, {})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea see my other comment on this, but I really think this should result in an exception from get_monitoring_data that is handled by the task.

):
mock_get.return_value = experiment
# Should not raise, should log and continue
tasks.fetch_monitoring_data()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally sure this test is necessary, but I guess it ensures that a random exception doesn't break the task?

Can we test for something here, like that the status or log occurs?

try:
experiment = NimbusExperiment.objects.get(
slug=exp_slug,
status=NimbusConstants.Status.LIVE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work for COMPLETE also, no?

@patch("experimenter.jetstream.tasks.get_monitoring_data")
def test_fetch_monitoring_data_updates_live_experiment(self, mock_get_data):
experiment = NimbusExperimentFactory.create(
status=NimbusExperiment.Status.LIVE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe parametrize this so it takes both LIVE and COMPLETE.

Co-authored-by: Mike Williams <102263964+mikewilli@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Experimenter- Add enrollment alert thresholds and message templates Add enrollment alert data GCS reader and Celery fetch task

2 participants