Adaptive RSS auto-scaling #624
Conversation
📝 WalkthroughWalkthroughThis PR implements adaptive RSS queue autoscaling for DPDK network ports. Configuration is added via a new Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modules/infra/control/worker.c (1)
401-446:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winDivision-by-zero risk when affinity resolves to zero datapath CPUs.
n_cpusis used as a modulo divisor before any non-zero guard. If affinity is empty, this crashes in queue distribution.Suggested fix
size_t n_cpus = vec_len(cpus); + if (n_cpus == 0) { + LOG(ERR, "no datapath CPUs available in affinity mask"); + ret = -EINVAL; + goto end; + } size_t fwd_cursor = 0; size_t rev_cursor = (n_cpus > 0) ? (n_cpus - 1) : 0;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modules/infra/control/worker.c` around lines 401 - 446, n_cpus can be zero causing a divide-by-zero in the RX queue assignment loops (uses n_cpus as modulo); add an early guard after size_t n_cpus = vec_len(cpus); that checks if (n_cpus == 0) and handles it (for example log an error via errno_log or processLogger, and return/goto end or otherwise skip queue-to-CPU distribution) so subsequent uses of n_cpus in the rxq loop and in computing off/idx do not perform modulo zero; update any related variables (rev_cursor/fwd_cursor) only after the non-zero check and keep references to cpus, affinity, CPU_COUNT and port->n_rxq to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modules/infra/api/rss_autoscale.c`:
- Around line 63-64: The handlers rss_autoscale_cap_set_handler and
rss_autoscale_floor_set_handler currently cast req->n (uint16_t) to int16_t
which can wrap for values > INT16_MAX; before calling rss_autoscale_set_cap /
rss_autoscale_set_floor validate that req->n <= INT16_MAX and reject/return a
suitable api_out error if it exceeds that limit (do not perform the cast on
out-of-range values), otherwise safely cast to int16_t and proceed; reference
policy_cap/policy_floor semantics when constructing the error to avoid
accidental “unset/cleared” behavior.
In `@modules/infra/control/rss_autoscale.c`:
- Around line 463-479: The call to caps_ensure(s) inside
rss_autoscale_port_state_get is ignored; detect and handle its return value
instead of proceeding with possibly-uninitialized caps fields. Update
rss_autoscale_port_state_get to capture int rc = caps_ensure(s) and if rc < 0
return rc (or, alternatively, clear/zero the caps-related out-params and return
0 only when s->supports_scale is true) so that max_n and min_n (allowed_n[0])
are not read/returned when capability discovery failed; reference the function
rss_autoscale_port_state_get, the caps_ensure(s) call, and the caps field
accesses (s->caps.max_n and s->caps.allowed_n[0]) when making the change.
In `@modules/infra/datapath/port_rx.c`:
- Around line 65-70: rx_burst_done (which updates histogram and rxq_health)
isn't called from the bonded RX paths, so rxq_health isn't updated for bonded
queues; modify rx_bond_offload_process and rx_bond_process to invoke
rx_burst_done(port_id, queue_id, n_pkts) after their per-burst packet handling
(i.e., where they currently account for n_pkts or finish processing a received
burst) so bonded queues participate in histogram and autoscale health tracking
the same as other rx_*_process variants.
---
Outside diff comments:
In `@modules/infra/control/worker.c`:
- Around line 401-446: n_cpus can be zero causing a divide-by-zero in the RX
queue assignment loops (uses n_cpus as modulo); add an early guard after size_t
n_cpus = vec_len(cpus); that checks if (n_cpus == 0) and handles it (for example
log an error via errno_log or processLogger, and return/goto end or otherwise
skip queue-to-CPU distribution) so subsequent uses of n_cpus in the rxq loop and
in computing off/idx do not perform modulo zero; update any related variables
(rev_cursor/fwd_cursor) only after the non-zero check and keep references to
cpus, affinity, CPU_COUNT and port->n_rxq to locate the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5c88068d-31aa-4d0f-abfc-f4afe9d11bde
📒 Files selected for processing (17)
main/config.hmain/main.cmodules/infra/api/gr_infra.hmodules/infra/api/meson.buildmodules/infra/api/rss_autoscale.cmodules/infra/cli/meson.buildmodules/infra/cli/rss_autoscale.cmodules/infra/control/meson.buildmodules/infra/control/port_scale.cmodules/infra/control/port_scale.hmodules/infra/control/rss_autoscale.cmodules/infra/control/rss_autoscale.hmodules/infra/control/worker.cmodules/infra/control/worker.hmodules/infra/control/worker_test.cmodules/infra/datapath/main_loop.cmodules/infra/datapath/port_rx.c
| ret = rss_autoscale_set_cap(port_id, (int16_t)req->n); | ||
| return api_out(ret < 0 ? -ret : 0, 0, NULL); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify type/range mismatch across API contract and handlers.
rg -n --type=c 'struct gr_rss_autoscale_set_req|uint16_t n;' modules/infra/api/gr_infra.h
rg -n --type=c '\(int16_t\)req->n|rss_autoscale_set_(cap|floor)\(' modules/infra/api/rss_autoscale.c
rg -n --type=c 'ec_node_uint\("N",\s*1,\s*UINT16_MAX' modules/infra/cli/rss_autoscale.cRepository: DPDK/grout
Length of output: 490
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find where rss_autoscale_set_cap/floor are defined
rg -n --type=c "rss_autoscale_set_(cap|floor)\s*\(" modules/infra -S
# Show function bodies around the definitions (small window)
rg -n --type=c "static .*rss_autoscale_set_(cap|floor)\s*\(" modules/infra -S || true
# Also search entire repo if not found under modules/infra
rg -n --type=c "rss_autoscale_set_(cap|floor)\s*\(" -S --glob='*.c' --glob='*.h' || trueRepository: DPDK/grout
Length of output: 1610
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show function bodies for the relevant setters
nl -ba modules/infra/control/rss_autoscale.c | sed -n '140,240p'Repository: DPDK/grout
Length of output: 97
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Dump the relevant function bodies with line numbers (no nl available)
cat -n modules/infra/control/rss_autoscale.c | sed -n '140,240p'Repository: DPDK/grout
Length of output: 3828
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check whether rss_autoscale.c already references EINVAL / errno.h
rg -n "EINVAL|errno\.h" modules/infra/api/rss_autoscale.cRepository: DPDK/grout
Length of output: 150
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find api_out definition and how it's used (error sign conventions)
rg -n --type=c "api_out\s*\(" modules/infra/api/rss_autoscale.c
rg -n --type=c "struct api_out|api_out\s*\{" -S modules/infra/api || true
rg -n --type=c "api_out\s*\(" modules/infra/api -S --glob='*.c'
rg -n --type=c "api_out" modules/infra/api -S --glob='*.h'Repository: DPDK/grout
Length of output: 8293
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate api_out definition (function or macro)
rg -n --type=c "api_out\s*\(" modules/infra/api -S --glob='*.c'
rg -n --type=c "api_out" modules/infra/api -S --glob='*.h'
rg -n --type=c "struct api_out" modules/infra/api -S --glob='*.h' --glob='*.c'
# Try to find header where api_out/errno conventions are defined
rg -n --type=c "api_out" modules/infra -S --glob='*.h' --glob='*.c'Repository: DPDK/grout
Length of output: 4613
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for definition of api_out (struct/macro/function)
rg -n --type=c "api_out\s*=" -S .
rg -n --type=c "api_out\s*\(" -S --glob='*.h' --glob='*.c'
rg -n --type=c "struct\s+api_out" -S .
rg -n --type=c "static\s+inline\s+.*api_out|inline\s+.*api_out|`#define`\s+api_out" -S .Repository: DPDK/grout
Length of output: 36
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Broader search for api_out definition/usages
rg -n "api_out" -S --glob='*.h' --glob='*.c' modules/infra . | head -n 200Repository: DPDK/grout
Length of output: 16042
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find the port state struct definition to confirm types of policy_cap/policy_floor
rg -n "struct rss_autoscale_port_state" -S modules/infra/control/rss_autoscale.c modules/infra/control/rss_autoscale.h
# Show a small window around the struct definition in the .c
rg -n "policy_(cap|floor)" modules/infra/control/rss_autoscale.c | head -n 50Repository: DPDK/grout
Length of output: 2983
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show policy defaults + setter parts
cat -n modules/infra/control/rss_autoscale.c | sed -n '20,110p'
# Show clamp_to_policy and effective_* helpers showing negative == unset/cleared semantics
cat -n modules/infra/control/rss_autoscale.c | sed -n '140,220p'
cat -n modules/infra/control/rss_autoscale.c | sed -n '260,320p'Repository: DPDK/grout
Length of output: 9323
🏁 Script executed:
#!/bin/bash
set -euo pipefail
cat -n modules/infra/api/rss_autoscale.c | sed -n '1,140p'Repository: DPDK/grout
Length of output: 4056
Validate req->n range before casting to int16_t in rss_autoscale API handlers.
In modules/infra/api/rss_autoscale.c the uint16_t n request field is narrowed to int16_t in:
rss_autoscale_cap_set_handler(lines 63-64)rss_autoscale_floor_set_handler(lines 83-84)
For values > INT16_MAX, the cast wraps negative; since policy_cap/policy_floor use negative values to mean “unset/cleared” (and policy application checks > 0), clients can silently clear bounds instead of applying the requested cap/floor.
🔧 Proposed fix
static struct api_out rss_autoscale_cap_set_handler(const void *request, struct api_ctx *) {
const struct gr_rss_autoscale_set_req *req = request;
uint16_t port_id;
int ret = port_id_from_iface_id(req->iface_id, &port_id);
if (ret < 0)
return api_out(-ret, 0, NULL);
+ if (req->n > INT16_MAX)
+ return api_out(EINVAL, 0, NULL);
ret = rss_autoscale_set_cap(port_id, (int16_t)req->n);
return api_out(ret < 0 ? -ret : 0, 0, NULL);
}
@@
static struct api_out rss_autoscale_floor_set_handler(const void *request, struct api_ctx *) {
const struct gr_rss_autoscale_set_req *req = request;
uint16_t port_id;
int ret = port_id_from_iface_id(req->iface_id, &port_id);
if (ret < 0)
return api_out(-ret, 0, NULL);
+ if (req->n > INT16_MAX)
+ return api_out(EINVAL, 0, NULL);
ret = rss_autoscale_set_floor(port_id, (int16_t)req->n);
return api_out(ret < 0 ? -ret : 0, 0, NULL);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ret = rss_autoscale_set_cap(port_id, (int16_t)req->n); | |
| return api_out(ret < 0 ? -ret : 0, 0, NULL); | |
| static struct api_out rss_autoscale_cap_set_handler(const void *request, struct api_ctx *) { | |
| const struct gr_rss_autoscale_set_req *req = request; | |
| uint16_t port_id; | |
| int ret = port_id_from_iface_id(req->iface_id, &port_id); | |
| if (ret < 0) | |
| return api_out(-ret, 0, NULL); | |
| if (req->n > INT16_MAX) | |
| return api_out(EINVAL, 0, NULL); | |
| ret = rss_autoscale_set_cap(port_id, (int16_t)req->n); | |
| return api_out(ret < 0 ? -ret : 0, 0, NULL); | |
| } | |
| static struct api_out rss_autoscale_floor_set_handler(const void *request, struct api_ctx *) { | |
| const struct gr_rss_autoscale_set_req *req = request; | |
| uint16_t port_id; | |
| int ret = port_id_from_iface_id(req->iface_id, &port_id); | |
| if (ret < 0) | |
| return api_out(-ret, 0, NULL); | |
| if (req->n > INT16_MAX) | |
| return api_out(EINVAL, 0, NULL); | |
| ret = rss_autoscale_set_floor(port_id, (int16_t)req->n); | |
| return api_out(ret < 0 ? -ret : 0, 0, NULL); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modules/infra/api/rss_autoscale.c` around lines 63 - 64, The handlers
rss_autoscale_cap_set_handler and rss_autoscale_floor_set_handler currently cast
req->n (uint16_t) to int16_t which can wrap for values > INT16_MAX; before
calling rss_autoscale_set_cap / rss_autoscale_set_floor validate that req->n <=
INT16_MAX and reject/return a suitable api_out error if it exceeds that limit
(do not perform the cast on out-of-range values), otherwise safely cast to
int16_t and proceed; reference policy_cap/policy_floor semantics when
constructing the error to avoid accidental “unset/cleared” behavior.
| struct rss_autoscale_port_state *s = state_ensure(port_id); | ||
| if (s == NULL) | ||
| return -ENOMEM; | ||
| caps_ensure(s); | ||
| if (n_active) | ||
| *n_active = s->n_active; | ||
| if (n_load_recommended) | ||
| *n_load_recommended = s->n_load_recommended; | ||
| if (cap_effective) | ||
| *cap_effective = s->policy_cap; | ||
| if (floor_effective) | ||
| *floor_effective = s->policy_floor; | ||
| if (max_n) | ||
| *max_n = s->caps.max_n; | ||
| if (min_n) | ||
| *min_n = (s->caps.allowed_count > 0) ? s->caps.allowed_n[0] : 0; | ||
| return 0; |
There was a problem hiding this comment.
Ignored caps_ensure() failure in rss_autoscale_port_state_get.
Line 466 calls caps_ensure(s) but ignores its return value. If capability discovery fails (e.g., -ENODEV or -ENOTSUP), the function proceeds to return potentially uninitialized or stale caps fields (max_n, allowed_n[0]), which could mislead API consumers.
Consider either propagating the error or documenting that caps fields are only valid when supports_scale is true.
Proposed fix
int rss_autoscale_port_state_get(
uint16_t port_id,
uint16_t *n_active,
uint16_t *n_load_recommended,
int16_t *cap_effective,
int16_t *floor_effective,
uint16_t *max_n,
uint16_t *min_n
) {
struct rss_autoscale_port_state *s = state_ensure(port_id);
if (s == NULL)
return -ENOMEM;
- caps_ensure(s);
+ int ret = caps_ensure(s);
+ // caps fields valid only if ret == 0 or s->caps.supports_scale is true
if (n_active)
*n_active = s->n_active;
if (n_load_recommended)
*n_load_recommended = s->n_load_recommended;
if (cap_effective)
*cap_effective = s->policy_cap;
if (floor_effective)
*floor_effective = s->policy_floor;
- if (max_n)
- *max_n = s->caps.max_n;
- if (min_n)
- *min_n = (s->caps.allowed_count > 0) ? s->caps.allowed_n[0] : 0;
+ if (max_n)
+ *max_n = (ret == 0) ? s->caps.max_n : 0;
+ if (min_n)
+ *min_n = (ret == 0 && s->caps.allowed_count > 0) ? s->caps.allowed_n[0] : 0;
return 0;
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modules/infra/control/rss_autoscale.c` around lines 463 - 479, The call to
caps_ensure(s) inside rss_autoscale_port_state_get is ignored; detect and handle
its return value instead of proceeding with possibly-uninitialized caps fields.
Update rss_autoscale_port_state_get to capture int rc = caps_ensure(s) and if rc
< 0 return rc (or, alternatively, clear/zero the caps-related out-params and
return 0 only when s->supports_scale is true) so that max_n and min_n
(allowed_n[0]) are not read/returned when capability discovery failed; reference
the function rss_autoscale_port_state_get, the caps_ensure(s) call, and the caps
field accesses (s->caps.max_n and s->caps.allowed_n[0]) when making the change.
| // Combined per-rx_burst datapath bookkeeping: histogram + rss-autoscale | ||
| // saturation/idle tracking. Called from every rx_*_process variant. | ||
| static inline void rx_burst_done(uint16_t port_id, uint16_t queue_id, uint16_t n_pkts) { | ||
| rx_burst_histogram_inc(port_id, n_pkts); | ||
| rxq_health_update(port_id, queue_id, n_pkts); | ||
| } |
There was a problem hiding this comment.
Autoscale health tracking is not wired to all RX process paths.
rx_burst_done() is not used in rx_bond_offload_process() and rx_bond_process(), so those queues never update rxq_health. That can mislead the autoscale controller and cause incorrect park/scale actions on bonded traffic.
Suggested fix
uint16_t
rx_bond_offload_process(struct rte_graph *graph, struct rte_node *node, void **, uint16_t) {
@@
rx = rte_eth_rx_burst(ctx->rxq.port_id, ctx->rxq.queue_id, mbufs, ctx->burst_size);
+ rx_burst_done(ctx->rxq.port_id, ctx->rxq.queue_id, rx);
if (rx == 0)
return 0;
@@
uint16_t rx_bond_process(struct rte_graph *graph, struct rte_node *node, void **, uint16_t) {
@@
rx = rte_eth_rx_burst(ctx->rxq.port_id, ctx->rxq.queue_id, mbufs, ctx->burst_size);
+ rx_burst_done(ctx->rxq.port_id, ctx->rxq.queue_id, rx);
if (rx == 0)
return 0;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modules/infra/datapath/port_rx.c` around lines 65 - 70, rx_burst_done (which
updates histogram and rxq_health) isn't called from the bonded RX paths, so
rxq_health isn't updated for bonded queues; modify rx_bond_offload_process and
rx_bond_process to invoke rx_burst_done(port_id, queue_id, n_pkts) after their
per-burst packet handling (i.e., where they currently account for n_pkts or
finish processing a received burst) so bonded queues participate in histogram
and autoscale health tracking the same as other rx_*_process variants.
Add a thin abstraction layer (port_scale.{c,h}) that exposes a uniform
API to dynamically restrict the HW RSS distribution of a port to its
first N RX queues:
port_scale_caps_get(port, &caps) -> queries reta_size and the set of
N values the PMD will accept
port_scale_apply(port, n) -> builds a uniform reta[i]=i%n and
calls rte_eth_dev_rss_reta_update
port_scale_caps_next/prev(caps, n) -> walk the HW-allowed N list
The implementation is portable: it works on any PMD that advertises a
non-zero reta_size (Intel ixgbe/i40e/ice, Mellanox mlx5, Broadcom bnxt)
and on DPAA2 once the matching RETA emulation lands in the PMD. For
DPAA2 specifically, the caps query exposes the discrete dist_size list
{1, 2, 3, 4, 6, 7, 8, 12, ...} so the controller never asks the PMD for
a value it would reject with -ENOTSUP.
This commit only adds the abstraction. Default port behaviour is
unchanged: no RETA narrowing happens until the rss-autoscale controller
in a follow-up commit subscribes to GR_EVENT_IFACE_POST_ADD and pins
ports when --rss-autoscale is set.
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
493e1e2 to
ff12fe0
Compare
Introduce worker_park() / worker_unpark() so the control plane can suspend a dataplane worker between scaling events. Parked workers futex_wait() on a shared atomic word, leaving cpu-user time at zero and letting cpuidle (cpu-pw15 on Cortex-A72) engage as the dominant state -- the prerequisite for load-proportional CPU consumption. The mechanism is cooperative: the worker checks atomic_load(&paused) at every housekeeping interval (every HOUSEKEEPING_INTERVAL graph walks), and if set, drops out of the RCU online region and calls futex_wait(..., FUTEX_WAIT | FUTEX_PRIVATE_FLAG, 1, 1s_timeout). The 1s timeout makes shutdown and graph reconfig events visible within one second even while parked, without paying any wakeup cost during the wait. worker_unpark() stores 0 and issues a FUTEX_WAKE; the race with futex_wait is handled by the kernel (EAGAIN when the value already changed). This is opt-in -- workers that are never park()ed see no change in behavior. busy-poll and usleep semantics in the housekeeping branch are preserved. Signed-off-by: Maxime Leroy <maxime@leroys.fr>
61cba06 to
779119c
Compare
Introduce rss-autoscale: an opt-in controller that dynamically
narrows the HW RSS distribution per port to fewer active queues
when the workers have headroom, and parks the workers that no
longer receive packets via the futex primitive added in the
previous commit.
CLI: --rss-autoscale=N at startup, where N is the cluster grouping
size used to align scaling steps:
0 = disabled (legacy: all workers always poll all configured
queues, no parking, no RETA scale).
1 = enabled, per-core scaling (no cluster constraint).
2 = enabled, scale by groups of 2 (x86 with hyperthreading or
any SoC where 2 cores share an L2 cache).
N = scale by groups of N for larger cache-sharing topologies.
Scale-up is event-driven. Each rxq publishes a saturated flag via
rxq_health_update in the rx datapath after RSS_AUTOSCALE_CONSEC_FULL
consecutive full bursts; the transition wakes the control thread,
which steps the port to the next allowed dist-size in microseconds.
Scale-down is timer-driven per port. A libevent EV_PERSIST timer
fires every SCALE_DOWN_PERIOD_S (10 minutes) per port and steps
the port one allowed dist-size down when the aggregated busy/total
cycle ratio of the workers servicing it has stayed below 40% since
the previous fire. The timer is rearmed on every do_apply so a
fresh 10-minute window always separates a scale-up from the next
scale-down candidate, providing intrinsic anti-flapping.
External knobs (rss_autoscale_set_cap / rss_autoscale_set_floor,
exposed in the follow-up commit) carry an upper bound (thermal
shed) and a lower bound (force load for characterisation or burst
pre-arm). Both accept zero to clear. Values are validated against
the PMD's allowed dist-size set at API entry. When both are set
and cap < floor, cap wins as a safety override.
On GR_EVENT_IFACE_POST_ADD for a port whose PMD supports RETA and
exposes at least two cluster_size-aligned dist-sizes, the controller
narrows the port to N = cluster_size active queues. Ports that only
expose a single usable N (or no RETA at all) are left unmanaged: the
PMD's default HW RSS still distributes their traffic and their
workers are never parked. The event-driven path scales up as soon as
load justifies it.
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Change worker_queue_distribute() to lay out per-port RX queues using a "butterfly-reverse" pattern: even-indexed ports walk cpus[] forward from a global fwd_cursor, odd-indexed ports walk it backward from a global rev_cursor. Both cursors advance across ports so every worker still receives some rxq even when n_workers exceeds n_rxq_per_port (the naive "fill from each end" variant would otherwise leave middle CPUs idle). Endpoint placement: port 0 q0 -> cpus[0], port 1 q0 -> cpus[n-1]. On x86 with hyperthreading (where sibling threads share L1 and L2), this puts the two endpoint workers on different physical cores, giving each exclusive use of its sibling thread and its caches. At rss-autoscale dist_size=1 on every port, the two active workers can then run flat-out while the rest of the dataplane CPUs idle. As load grows (dist_size 2, 4, ...) each port expands into a disjoint range of cpus until n_active_total > n_workers, at which point cross-port workers reappear -- at full load the cache exclusivity benefit no longer matters. Direction is keyed on port_idx (position in grout's port vec). The vec ordering is the algorithm's invariant: stable across iterations, owned by grout, and enumerates every configured port exactly once. Update worker_test.c expectations for both queue_distribute_reduce and queue_distribute_increase to match the new layout. Both tests exercise the global-cursor advancement. Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Let the operator bound the controller's RX queue scaling per port with two attributes set through "grcli interface set", like the MTU: rss-cap (upper bound on the active queue count) and rss-floor (lower bound); 0 leaves a bound unset. The cap sheds load during thermal events, the floor keeps a minimum number of workers active even without traffic. When both are set and cap < floor, the cap wins. They live in the port configuration (iface_info_port) rather than in the controller, so the surgical set_attrs update keeps them across an unrelated reconfig. A value is validated against the controller's usable distribution sizes -- the PMD's advertised sizes after the cluster_size filtering, shared with caps_ensure -- and against the target queue count when the same request also changes nb-rxqs. Both are shown in "interface show" and "rss-autoscale show". Reconfiguration is handled in place: since only the queue count changes the usable grid, the controller rebuilds just its caps and re-clamps the current active count, applying only when it moves. An unrelated reconfig no longer resets n_active, rxq_health and the scale-down timer to baseline. Signed-off-by: Maxime Leroy <maxime@leroys.fr>
When rss-autoscale narrows a port's RETA to N queues, queues N..max no longer receive any traffic, yet the worker that owns them keeps polling them on every graph walk as long as it stays awake for its other, still active queues. On DPAA2 an empty rte_eth_rx_burst is not free: the qbman volatile-dequeue round-trip costs several hundred CPU cycles, so these dead polls steal time from the busy queues sharing the core. Give a deactivated queue a no-op process callback, rx_inactive_process, so the graph walk visits the node but issues no poll. The callback is chosen at graph build time from the port's current active count, and swapped in/out at scale time on each worker's live graph by worker_graph_rxq_set_active() without a graph reload -- the same node->process swap DPDK uses for pcap. The store is an aligned pointer write and the worker reads either the old or the new callback, both valid. Factor the rx process-callback selection into rx_process_for() so the build path and the runtime swap cannot drift. Signed-off-by: Maxime Leroy <maxime@leroys.fr>
779119c to
1a7b775
Compare
|
The power management algorithm per lcore in the SmartShare appliance is relatively simple: As input for the algorithm for each lcore, we look at the maximum number of objects processed by any "node" every time the "graph" is traversed. We are primarily targeting low power consumption during low traffic hours, and low scheduling latency during peak traffic hours. For long durations. we use nanosleep(), where the kernel puts the lcore in a low power state. We don't use any of the fancy power management hooks in DPDK. It's all done by our application. |
Why rx_usleep doesn't scale
Grout's existing rx-side usleep(us) micro-sleep saves CPU at modest loads but breaks
at line rate:
cap drops to a few microseconds → effectively zero. The worker busy-polls
anyway, just with extra syscall overhead.
overflow its ring before the worker wakes.
long enough to matter, which is exactly the regime where the energy savings
are negligible (the box was already idle).
What rss-autoscale does
Opt-in controller (--rss-autoscale=N, N = cluster grouping) that:
current load (via rte_eth_dev_rss_reta_update, with a portable emulation
on DPAA2 that maps reta[i] = i % N to dpni_set_rx_hash_dist).
HW burst-cap counts as "full"; 4 consecutive full → saturated. 1000
consecutive empty (n_pkts == 0) → idle. Mid-load (1..cap-1) is steady,
no transition. Counters are local, only edges trigger an atomic_store +
control_queue push to wake the controller (~10 µs reaction).
branch-predicted no-op at top of rxq_health_update.
daemons and ops: cap = upper bound, floor = lower bound. Cap wins on
conflict.
Known limitations
scale-up cannot redistribute it. The controller will scale up uselessly
until cap/max, leaving 11 unparked workers.
so changing N reshuffles all flows. Scale events should be rare. No
mitigation yet.
parsing in libecoli (to debug and fix)
RSS Autoscale Controller
Implements an opt-in RSS autoscale controller (
--rss-autoscale=N) that dynamically narrows per-port hardware RSS RETA to absorb current load and parks unused workers. Solves the limitation of rx-side usleep() which fails to scale at line-rate (max_sleep_us clamped by ring depth/line rate) and keeps all queues RSS-fed during idle sleep, risking ring overflow.Port Scaling Abstraction (
port_scale.{c,h})New module for dynamic RETA reprogramming. Computes allowed queue counts by intersecting driver constraints with DPAA2's discrete set (via hardcoded lookup: 1,2,3,4,6,7,8,12,14,16,24,28,32,...) or falling back to [1..max]. Provides capability helpers (caps_get/free, caps_next/prev) and apply function that constructs uniform RETA (
reta[i] = i % nacross groups) and callsrte_eth_dev_rss_reta_update(). Query function reads back active queue count from current RETA or returns 0 if never programmed.RXQ Health Tracking (
rss_autoscale.{h,c})Per-RXQ counters (consec_full, consec_empty) with atomic saturated/idle flags. Datapath
rxq_health_update()hook (called per RX burst) increments counters: full bursts (HW burst-cap received) and empty bursts. State transitions trigger notifications: 4 consecutive full → saturated; 1000 consecutive empty → idle; transitions back to normal clear flags and wake controller. Notification pushes to control queue (~10 µs). Zero cost when disabled (rss_autoscale == 0).Worker Parking (
worker.{c,h})Adds
atomic_int pausedfield and futex-based park/unpark.worker_park()stores 1 and datapath futex_waits with 1s timeout at housekeeping.worker_unpark()stores 0 and issues FUTEX_WAKE. Datapath leaves RCU offline during wait to avoid blocking grace periods; wakes every second to check shutdown/reconfig events. Resets timing on unpark so park duration is not charged as busy cycles.Butterfly-Reverse RXQ Allocation (
worker.c)Changes distribution logic from simple forward walk to butterfly-reverse with separate global forward/reverse cursors. Even-indexed ports walk
cpus[]forward from fwd_cursor; odd-indexed ports walk backward from rev_cursor. After each port, corresponding cursor advances by step derived from port's RXQ count. Improves endpoint placement: port 0 q0 → cpus[0], port 1 q0 → cpus[n-1]. On hyperthreaded systems (e.g., x86), endpoint workers land on different physical cores with exclusive sibling-thread and cache access. Enables multiple workers idle when dist_size=1 during low load.Control Loop (
rss_autoscale.c)Three-pass decision triggered on RXQ health transitions: (1) per-port load recommendation from atomic health flags, (2) global cluster-budget arbitration to constrain total cluster usage (filters allowed_n to keep multiples of cluster grouping size), (3) hardware apply. Exposes policy bounds:
rss_autoscale_set_cap()/rss_autoscale_clear_cap()for upper bound (shedding) andrss_autoscale_set_floor()/rss_autoscale_clear_floor()for lower bound (force load). Cap overrides floor on conflict. Immediate apply on cap/floor changes with worker park/unpark sync. Subscribes to interface-add events to pin new ports to configured autoscale value.Datapath Integration
port_rx.ccallsrx_burst_done()on every RX burst (replaces direct histogram increment) to update both histogram and health tracking.main_loop.cworker futex_wait loop integrated at housekeeping.CLI/API Infrastructure (
modules/infra/api/rss_autoscale.c,modules/infra/cli/rss_autoscale.c)New API handlers:
GR_RSS_AUTOSCALE_LIST(streams per-port state: active queues, recommended load, cap/floor/min/max),GR_RSS_AUTOSCALE_CAP_SET/CLEAR,GR_RSS_AUTOSCALE_FLOOR_SET/CLEAR. CLI commands:infra rss-autoscale show,set NAME cap N,set NAME floor N,clear NAME cap,clear NAME floor.Configuration
New
--rss-autoscale=Noption in main.c: N=0 (disabled/legacy), N=1 (per-core), N=2 (grouping-by-2), N>2 (grouping-by-N). Stored ingr_config.rss_autoscale. Tick=20ms, up_hold=20ms, dn_hold=1000ms. Boots with N=1; widens as load appears.