Skip to content

Conversation

@vossmjp
Copy link
Contributor

@vossmjp vossmjp commented May 1, 2025

This PR describes changes to the experimental dynamic selection design to improve the tools to create custom backends and custom policies for dynamic selection.

Major Changes:

  • addition of policy_base to handle much of the boilerplate code for new custom policies, and default_backend to do the same for new custom backends.
  • addition of ResourceAdapter support provides the ability to serve different flavors of resource with the same backend. (sycl::queue vs sycl::queue*)

@vossmjp vossmjp requested review from danhoeflinger and egfefey May 1, 2025 21:16
The `get_submission_group` is not easily implemented in a meaningful way. It must return a type that
defines a member function `wait`. But since there is no way to know how to wait on all previous submissions
for an arbitrary backend resource, this function will likely need to return a dummy type or
`get_submission_group` be undefined for the default backend.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we could provide a way to have the user give us the wait type, do we have a shot? Or maybe it's better to leave this out to make the default case simple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets consider if there's a meaningful default or not. I'm not sure that there is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a meaningful default is to call wait on all resources, if they have a wait function.

@akukanov akukanov added the RFC label May 6, 2025
Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks pretty good. Mostly just quite minor changes for now.

Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pass at feedback

@vossmjp vossmjp changed the title WIP: Custom backends for Dynamic Selection Custom backends and policies for Dynamic Selection Aug 18, 2025
@vossmjp vossmjp changed the title Custom backends and policies for Dynamic Selection RFC: Custom backends and policies for Dynamic Selection Aug 18, 2025
@vossmjp vossmjp marked this pull request as ready for review August 18, 2025 21:37
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
- `b` an arbitrary identifier of type `T`
- `args` an arbitrary parameter pack of types `typename… Args`
- `s` is of type `S` and satisfies *Selection* and `is_same_v<resource_t<S>, resource_t<T>>` is `true`
- `f` a function object with signature `/*ret_type*/ fun(resource_t<T>, Args…);`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "fun" be "f"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here and in another place, also fixed ret_type.


## Proposed Design to Enable Easier Customization of Backends

This proposal presents a flexible backend system based on a `backend_base` template class and a `core_resource_backend` template that can be used for most resource types. For resource backends supporting reporting requirements, explicit specialization of the `core_resource_backend` should exist for that resource. This is not possible to be done generically. For simple types serving policies without reporting requirements, use of the generically written `core_resource_backend` may be possible.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This proposal presents a flexible backend system based on a `backend_base` template class and a `core_resource_backend` template that can be used for most resource types. For resource backends supporting reporting requirements, explicit specialization of the `core_resource_backend` should exist for that resource. This is not possible to be done generically. For simple types serving policies without reporting requirements, use of the generically written `core_resource_backend` may be possible.
This proposal presents a flexible backend system that can be used for most resource types. It is based on two template classes: `backend_base` and `core_resource_backend`. For resource backends supporting reporting requirements, explicit specialization of the `core_resource_backend` is necessary since reporting cannot be done generically. For simple types serving policies without reporting requirements, use of the generically written `core_resource_backend` may be possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taken


### Key Components

1. **`backend_base<ResourceType>`**: A proposed base class template that implements core backend functionality to inherit from.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. **`backend_base<ResourceType>`**: A proposed base class template that implements core backend functionality to inherit from.
1. **`backend_base<ResourceType>`**: A proposed base class template that implements common backend functionality to inherit from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have a "core_resource_backend" lets not overload the meaning of "core".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed this and in other places for backends.


### Reporting Requirements and Scratch Space Contract

Backends must now explicitly accept a (possibly empty) variadic list of reporting requirements describing the execution information the Policy will need. These reporting requirements are the same `execution_info` tag types used elsewhere in the Dynamic Selection API (for example `execution_info::task_time_t`, `execution_info::task_submission_t`, `execution_info::task_completion_t`).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Backends must now explicitly accept a (possibly empty) variadic list of reporting requirements describing the execution information the Policy will need. These reporting requirements are the same `execution_info` tag types used elsewhere in the Dynamic Selection API (for example `execution_info::task_time_t`, `execution_info::task_submission_t`, `execution_info::task_completion_t`).
Backends must accept a (possibly empty) variadic list of reporting requirements describing the execution information the Policy will need. These reporting requirements are the same `execution_info` tag types used elsewhere in the Dynamic Selection API (for example `execution_info::task_time_t`, `execution_info::task_submission_t`, `execution_info::task_completion_t`).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"now" and "explicitly" seem unnecessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taken

- Compile-time checks: The backend implementation should `static_assert` if any type in `ReportingReqs...` is not a supported reporting requirement for that backend. This makes unsupported combinations a hard error at compile time.
- Resource filtering: Some reporting requirements imply properties of the underlying resource or device (for example, timing via `task_time_t` may require device support for profiling tags and queues created with profiling enabled). Backends must examine the provided resources (or query devices when default-initializing resources) and filter out any resources that do not support all requested reporting requirements. Any special resource properties required to implement a reporting requirement must be checked here (for instance, checking `device.has(sycl::aspect::ext_oneapi_queue_profiling_tag)` in addition to creating queues with `sycl::property::queue::enable_profiling()` when `task_time_t` is requested). If after filtering the set of candidate resources there are no resources left that satisfy all requested reporting requirements, the backend must throw a `std::runtime_error` documenting that the requested reporting requirements cannot be satisfied on the available resources.
Copy link
Contributor Author

@vossmjp vossmjp Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"may require device support for profiling tags and queues created with profiling enabled" this is very SYCL specific. Profiling tags or profiling enabled has no generic meaning. Can you be more generic, for example "may limit to only devices that support profiling".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewritten to be more generic

};
```

When a policy tracks execution information (like task timing or completion), the backend needs to store temporary data with each selection. For example, the SYCL backend's `scratch_space_t<execution_info::task_time_t>` includes an extra `sycl::event` to store the "start" profiling tag needed to measure elapsed time. For policies without reporting requirements, `scratch_space_t<>` should be empty (or inherit from `no_scratch_t<>`), adding no overhead.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When a policy tracks execution information (like task timing or completion), the backend needs to store temporary data with each selection. For example, the SYCL backend's `scratch_space_t<execution_info::task_time_t>` includes an extra `sycl::event` to store the "start" profiling tag needed to measure elapsed time. For policies without reporting requirements, `scratch_space_t<>` should be empty (or inherit from `no_scratch_t<>`), adding no overhead.
When a policy tracks execution information (like task timing or completion), the backend may need to store temporary data with each selection. For example, the SYCL backend's `scratch_space_t<execution_info::task_time_t>` includes an extra `sycl::event` to store the "start" profiling tag needed to measure elapsed time. For policies without reporting requirements, `scratch_space_t<>` should be empty (or inherit from `no_scratch_t<>`), adding no overhead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taken

Returns the vector of resources stored during construction. Resources must be copyable/movable into a `std::vector` and remain valid throughout the backend's lifetime.
#### `get_submission_group()` Implementation
Returns a group object that waits on all resources. The `default_submission_group` attempts to call `wait()` on the `CoreResourceType` by applying `adapter` to the `ResourceType` object. The `CoreResourceType` must provide a `wait()` method that blocks until all work on that resource is complete. Note that the default implementation waits on each resource, not each submission (works for SYCL queues, oneTBB `task_group` objects, etc.). Adapters can enable types without a `wait()` method; for example, `[](auto pointer){ return *pointer; }` allows `sycl::queue*` to work by adapting to `sycl::queue`.
Copy link
Contributor Author

@vossmjp vossmjp Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not completely accurate. get_submission_group should wait for all outstanding submissions made via this backend instance. One way to accomplish that is to wait on all resources, such as all queues. But we shouldn't require that it waits on the resources, only that all submissions are done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this previous comment, I was reacting to the 1st sentence as if it described the purpose of get_submission_group. But in context, this falls under the default implementation section and given the text at the end of the paragraph, it's ok.

Policies specify reporting requirements (`task_time_t`, `task_submission_t`, `task_completion_t` from `oneapi::dpl::experimental::execution_info`) via the `policy_base` constructor. These are passed to the backend constructor, which filters devices based on feature availability. Examples: `auto_tune_policy` requires `task_time_t`; `dynamic_load_policy` requires `task_submission_t` and `task_completion_t`.
Selection handles must include a `scratch_space` member of type `backend_traits<Backend>::template selection_scratch_t<reqs...>` to provide backend storage for instrumentation.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are selection handles described anywhere? For a policy that needs task_time_t, task_submission_t or task_completion_t reported its not clear how those values are received by the policy via the handle. I suppose the example demonstrates it though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a good point, I think its a term that is under described, its been on the edge of "implementation details" where "selection" is actually defined, but with the addition of selection_scratch_t, I think we need to define it more clearly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a selection handle section to custom_policies.md and then linked to it from custom_backends.md.

Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Copy link
Contributor Author

@vossmjp vossmjp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I can't click approve since I'm one of the authors, but I approve!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants