Adaptive RSS auto-scaling by maxime-leroy · Pull Request #624 · DPDK/grout

maxime-leroy · 2026-05-27T14:42:40Z

Why rx_usleep doesn't scale

Grout's existing rx-side usleep(us) micro-sleep saves CPU at modest loads but breaks
at line rate:

max_sleep_us is auto-clamped by ring depth / line rate, so at 25-100G the
cap drops to a few microseconds → effectively zero. The worker busy-polls
anyway, just with extra syscall overhead.
All queues remain RSS-fed during sleep, so a burst arriving on any queue can
overflow its ring before the worker wakes.
The "idle CPU" gain only materializes at very low rates where usleep is
long enough to matter, which is exactly the regime where the energy savings
are negligible (the box was already idle).

What rss-autoscale does

Opt-in controller (--rss-autoscale=N, N = cluster grouping) that:

Narrows the HW RSS distribution per port to the smallest N that absorbs
current load (via rte_eth_dev_rss_reta_update, with a portable emulation
on DPAA2 that maps reta[i] = i % N to dpni_set_rx_hash_dist).
Parks unused workers via futex_wait → real 0% CPU (vs busy-poll noise).
Detection is edge-based in the datapath: each rx_burst that returns the
HW burst-cap counts as "full"; 4 consecutive full → saturated. 1000
consecutive empty (n_pkts == 0) → idle. Mid-load (1..cap-1) is steady,
no transition. Counters are local, only edges trigger an atomic_store +
control_queue push to wake the controller (~10 µs reaction).
Zero datapath cost when disabled (gr_config.rss_autoscale == 0):
branch-predicted no-op at top of rxq_health_update.
Two policy knobs (grcli rss-autoscale set NAME cap|floor N) for thermal
daemons and ops: cap = upper bound, floor = lower bound. Cap wins on
conflict.

Known limitations

Elephant flow: a single 5-tuple flow hashes to one queue, so RSS
scale-up cannot redistribute it. The controller will scale up uselessly
until cap/max, leaving 11 unparked workers.
Reordering on scale events: DPAA2 dispatches via hash % dist_size,
so changing N reshuffles all flows. Scale events should be rare. No
mitigation yet.
grcli bug: set NAME cap N and set NAME floor N patterns fail
parsing in libecoli (to debug and fix)

RSS Autoscale Controller

Implements an opt-in RSS autoscale controller (--rss-autoscale=N) that dynamically narrows per-port hardware RSS RETA to absorb current load and parks unused workers. Solves the limitation of rx-side usleep() which fails to scale at line-rate (max_sleep_us clamped by ring depth/line rate) and keeps all queues RSS-fed during idle sleep, risking ring overflow.

Port Scaling Abstraction (`port_scale.{c,h}`)

New module for dynamic RETA reprogramming. Computes allowed queue counts by intersecting driver constraints with DPAA2's discrete set (via hardcoded lookup: 1,2,3,4,6,7,8,12,14,16,24,28,32,...) or falling back to [1..max]. Provides capability helpers (caps_get/free, caps_next/prev) and apply function that constructs uniform RETA (reta[i] = i % n across groups) and calls rte_eth_dev_rss_reta_update(). Query function reads back active queue count from current RETA or returns 0 if never programmed.

RXQ Health Tracking (`rss_autoscale.{h,c}`)

Per-RXQ counters (consec_full, consec_empty) with atomic saturated/idle flags. Datapath rxq_health_update() hook (called per RX burst) increments counters: full bursts (HW burst-cap received) and empty bursts. State transitions trigger notifications: 4 consecutive full → saturated; 1000 consecutive empty → idle; transitions back to normal clear flags and wake controller. Notification pushes to control queue (~10 µs). Zero cost when disabled (rss_autoscale == 0).

Worker Parking (`worker.{c,h}`)

Adds atomic_int paused field and futex-based park/unpark. worker_park() stores 1 and datapath futex_waits with 1s timeout at housekeeping. worker_unpark() stores 0 and issues FUTEX_WAKE. Datapath leaves RCU offline during wait to avoid blocking grace periods; wakes every second to check shutdown/reconfig events. Resets timing on unpark so park duration is not charged as busy cycles.

Butterfly-Reverse RXQ Allocation (`worker.c`)

Changes distribution logic from simple forward walk to butterfly-reverse with separate global forward/reverse cursors. Even-indexed ports walk cpus[] forward from fwd_cursor; odd-indexed ports walk backward from rev_cursor. After each port, corresponding cursor advances by step derived from port's RXQ count. Improves endpoint placement: port 0 q0 → cpus[0], port 1 q0 → cpus[n-1]. On hyperthreaded systems (e.g., x86), endpoint workers land on different physical cores with exclusive sibling-thread and cache access. Enables multiple workers idle when dist_size=1 during low load.

Control Loop (`rss_autoscale.c`)

Three-pass decision triggered on RXQ health transitions: (1) per-port load recommendation from atomic health flags, (2) global cluster-budget arbitration to constrain total cluster usage (filters allowed_n to keep multiples of cluster grouping size), (3) hardware apply. Exposes policy bounds: rss_autoscale_set_cap() / rss_autoscale_clear_cap() for upper bound (shedding) and rss_autoscale_set_floor() / rss_autoscale_clear_floor() for lower bound (force load). Cap overrides floor on conflict. Immediate apply on cap/floor changes with worker park/unpark sync. Subscribes to interface-add events to pin new ports to configured autoscale value.

Datapath Integration

port_rx.c calls rx_burst_done() on every RX burst (replaces direct histogram increment) to update both histogram and health tracking. main_loop.c worker futex_wait loop integrated at housekeeping.

CLI/API Infrastructure (`modules/infra/api/rss_autoscale.c`, `modules/infra/cli/rss_autoscale.c`)

New API handlers: GR_RSS_AUTOSCALE_LIST (streams per-port state: active queues, recommended load, cap/floor/min/max), GR_RSS_AUTOSCALE_CAP_SET/CLEAR, GR_RSS_AUTOSCALE_FLOOR_SET/CLEAR. CLI commands: infra rss-autoscale show, set NAME cap N, set NAME floor N, clear NAME cap, clear NAME floor.

Configuration

New --rss-autoscale=N option in main.c: N=0 (disabled/legacy), N=1 (per-core), N=2 (grouping-by-2), N>2 (grouping-by-N). Stored in gr_config.rss_autoscale. Tick=20ms, up_hold=20ms, dn_hold=1000ms. Boots with N=1; widens as load appears.

coderabbitai · 2026-05-27T14:52:37Z

📝 Walkthrough

Walkthrough

This PR implements adaptive RSS queue autoscaling for DPDK network ports. Configuration is added via a new --rss-autoscale=N command-line option. A port scaling module provides hardware RETA query/apply and capability discovery. Per-queue full/empty state is tracked in the RX datapath and signaled to a control loop. Workers can be parked via futex when queues are idle. The control plane recomputes target queue counts per port based on health flags, enforces operator policy (cap/floor), and arbitrates global cluster resource budgets before applying hardware changes. The RXQ-to-worker distribution algorithm is refined to alternate traversal direction across ports. Public APIs expose state queries and cap/floor configuration; CLI and REST-like API interfaces provide user control.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modules/infra/control/worker.c (1)

401-446: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Division-by-zero risk when affinity resolves to zero datapath CPUs.

n_cpus is used as a modulo divisor before any non-zero guard. If affinity is empty, this crashes in queue distribution.

Suggested fix

 	size_t n_cpus = vec_len(cpus);
+	if (n_cpus == 0) {
+		LOG(ERR, "no datapath CPUs available in affinity mask");
+		ret = -EINVAL;
+		goto end;
+	}
 	size_t fwd_cursor = 0;
 	size_t rev_cursor = (n_cpus > 0) ? (n_cpus - 1) : 0;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modules/infra/control/worker.c` around lines 401 - 446, n_cpus can be zero
causing a divide-by-zero in the RX queue assignment loops (uses n_cpus as
modulo); add an early guard after size_t n_cpus = vec_len(cpus); that checks if
(n_cpus == 0) and handles it (for example log an error via errno_log or
processLogger, and return/goto end or otherwise skip queue-to-CPU distribution)
so subsequent uses of n_cpus in the rxq loop and in computing off/idx do not
perform modulo zero; update any related variables (rev_cursor/fwd_cursor) only
after the non-zero check and keep references to cpus, affinity, CPU_COUNT and
port->n_rxq to locate the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modules/infra/api/rss_autoscale.c`:
- Around line 63-64: The handlers rss_autoscale_cap_set_handler and
rss_autoscale_floor_set_handler currently cast req->n (uint16_t) to int16_t
which can wrap for values > INT16_MAX; before calling rss_autoscale_set_cap /
rss_autoscale_set_floor validate that req->n <= INT16_MAX and reject/return a
suitable api_out error if it exceeds that limit (do not perform the cast on
out-of-range values), otherwise safely cast to int16_t and proceed; reference
policy_cap/policy_floor semantics when constructing the error to avoid
accidental “unset/cleared” behavior.

In `@modules/infra/control/rss_autoscale.c`:
- Around line 463-479: The call to caps_ensure(s) inside
rss_autoscale_port_state_get is ignored; detect and handle its return value
instead of proceeding with possibly-uninitialized caps fields. Update
rss_autoscale_port_state_get to capture int rc = caps_ensure(s) and if rc < 0
return rc (or, alternatively, clear/zero the caps-related out-params and return
0 only when s->supports_scale is true) so that max_n and min_n (allowed_n[0])
are not read/returned when capability discovery failed; reference the function
rss_autoscale_port_state_get, the caps_ensure(s) call, and the caps field
accesses (s->caps.max_n and s->caps.allowed_n[0]) when making the change.

In `@modules/infra/datapath/port_rx.c`:
- Around line 65-70: rx_burst_done (which updates histogram and rxq_health)
isn't called from the bonded RX paths, so rxq_health isn't updated for bonded
queues; modify rx_bond_offload_process and rx_bond_process to invoke
rx_burst_done(port_id, queue_id, n_pkts) after their per-burst packet handling
(i.e., where they currently account for n_pkts or finish processing a received
burst) so bonded queues participate in histogram and autoscale health tracking
the same as other rx_*_process variants.

---

Outside diff comments:
In `@modules/infra/control/worker.c`:
- Around line 401-446: n_cpus can be zero causing a divide-by-zero in the RX
queue assignment loops (uses n_cpus as modulo); add an early guard after size_t
n_cpus = vec_len(cpus); that checks if (n_cpus == 0) and handles it (for example
log an error via errno_log or processLogger, and return/goto end or otherwise
skip queue-to-CPU distribution) so subsequent uses of n_cpus in the rxq loop and
in computing off/idx do not perform modulo zero; update any related variables
(rev_cursor/fwd_cursor) only after the non-zero check and keep references to
cpus, affinity, CPU_COUNT and port->n_rxq to locate the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5c88068d-31aa-4d0f-abfc-f4afe9d11bde

📥 Commits

Reviewing files that changed from the base of the PR and between fe0c7f9 and 493e1e2.

📒 Files selected for processing (17)

main/config.h
main/main.c
modules/infra/api/gr_infra.h
modules/infra/api/meson.build
modules/infra/api/rss_autoscale.c
modules/infra/cli/meson.build
modules/infra/cli/rss_autoscale.c
modules/infra/control/meson.build
modules/infra/control/port_scale.c
modules/infra/control/port_scale.h
modules/infra/control/rss_autoscale.c
modules/infra/control/rss_autoscale.h
modules/infra/control/worker.c
modules/infra/control/worker.h
modules/infra/control/worker_test.c
modules/infra/datapath/main_loop.c
modules/infra/datapath/port_rx.c

coderabbitai · 2026-05-27T14:52:40Z

+	ret = rss_autoscale_set_cap(port_id, (int16_t)req->n);
+	return api_out(ret < 0 ? -ret : 0, 0, NULL);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify type/range mismatch across API contract and handlers. rg -n --type=c 'struct gr_rss_autoscale_set_req|uint16_t n;' modules/infra/api/gr_infra.h rg -n --type=c '\(int16_t\)req->n|rss_autoscale_set_(cap|floor)\(' modules/infra/api/rss_autoscale.c rg -n --type=c 'ec_node_uint\("N",\s*1,\s*UINT16_MAX' modules/infra/cli/rss_autoscale.c

Repository: DPDK/grout

Length of output: 490

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find where rss_autoscale_set_cap/floor are defined rg -n --type=c "rss_autoscale_set_(cap|floor)\s*\(" modules/infra -S # Show function bodies around the definitions (small window) rg -n --type=c "static .*rss_autoscale_set_(cap|floor)\s*\(" modules/infra -S || true # Also search entire repo if not found under modules/infra rg -n --type=c "rss_autoscale_set_(cap|floor)\s*\(" -S --glob='*.c' --glob='*.h' || true

Repository: DPDK/grout

Length of output: 1610

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show function bodies for the relevant setters nl -ba modules/infra/control/rss_autoscale.c | sed -n '140,240p'

Repository: DPDK/grout

Length of output: 97

🏁 Script executed:

#!/bin/bash set -euo pipefail # Dump the relevant function bodies with line numbers (no nl available) cat -n modules/infra/control/rss_autoscale.c | sed -n '140,240p'

Repository: DPDK/grout

Length of output: 3828

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check whether rss_autoscale.c already references EINVAL / errno.h rg -n "EINVAL|errno\.h" modules/infra/api/rss_autoscale.c

Repository: DPDK/grout

Length of output: 150

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find api_out definition and how it's used (error sign conventions) rg -n --type=c "api_out\s*\(" modules/infra/api/rss_autoscale.c rg -n --type=c "struct api_out|api_out\s*\{" -S modules/infra/api || true rg -n --type=c "api_out\s*\(" modules/infra/api -S --glob='*.c' rg -n --type=c "api_out" modules/infra/api -S --glob='*.h'

Repository: DPDK/grout

Length of output: 8293

🏁 Script executed:

#!/bin/bash set -euo pipefail # Locate api_out definition (function or macro) rg -n --type=c "api_out\s*\(" modules/infra/api -S --glob='*.c' rg -n --type=c "api_out" modules/infra/api -S --glob='*.h' rg -n --type=c "struct api_out" modules/infra/api -S --glob='*.h' --glob='*.c' # Try to find header where api_out/errno conventions are defined rg -n --type=c "api_out" modules/infra -S --glob='*.h' --glob='*.c'

Repository: DPDK/grout

Length of output: 4613

🏁 Script executed:

#!/bin/bash set -euo pipefail # Search for definition of api_out (struct/macro/function) rg -n --type=c "api_out\s*=" -S . rg -n --type=c "api_out\s*\(" -S --glob='*.h' --glob='*.c' rg -n --type=c "struct\s+api_out" -S . rg -n --type=c "static\s+inline\s+.*api_out|inline\s+.*api_out|`#define`\s+api_out" -S .

Repository: DPDK/grout

Length of output: 36

🏁 Script executed:

#!/bin/bash set -euo pipefail # Broader search for api_out definition/usages rg -n "api_out" -S --glob='*.h' --glob='*.c' modules/infra . | head -n 200

Repository: DPDK/grout

Length of output: 16042

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find the port state struct definition to confirm types of policy_cap/policy_floor rg -n "struct rss_autoscale_port_state" -S modules/infra/control/rss_autoscale.c modules/infra/control/rss_autoscale.h # Show a small window around the struct definition in the .c rg -n "policy_(cap|floor)" modules/infra/control/rss_autoscale.c | head -n 50

Repository: DPDK/grout

Length of output: 2983

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show policy defaults + setter parts cat -n modules/infra/control/rss_autoscale.c | sed -n '20,110p' # Show clamp_to_policy and effective_* helpers showing negative == unset/cleared semantics cat -n modules/infra/control/rss_autoscale.c | sed -n '140,220p' cat -n modules/infra/control/rss_autoscale.c | sed -n '260,320p'

Repository: DPDK/grout

Length of output: 9323

🏁 Script executed:

#!/bin/bash set -euo pipefail cat -n modules/infra/api/rss_autoscale.c | sed -n '1,140p'

Repository: DPDK/grout

Length of output: 4056

Validate req->n range before casting to int16_t in rss_autoscale API handlers.

In modules/infra/api/rss_autoscale.c the uint16_t n request field is narrowed to int16_t in:

rss_autoscale_cap_set_handler (lines 63-64)

rss_autoscale_floor_set_handler (lines 83-84)

For values > INT16_MAX, the cast wraps negative; since policy_cap/policy_floor use negative values to mean “unset/cleared” (and policy application checks > 0), clients can silently clear bounds instead of applying the requested cap/floor.

🔧 Proposed fix

static struct api_out rss_autoscale_cap_set_handler(const void *request, struct api_ctx *) { const struct gr_rss_autoscale_set_req *req = request; uint16_t port_id; int ret = port_id_from_iface_id(req->iface_id, &port_id); if (ret < 0) return api_out(-ret, 0, NULL); + if (req->n > INT16_MAX) + return api_out(EINVAL, 0, NULL); ret = rss_autoscale_set_cap(port_id, (int16_t)req->n); return api_out(ret < 0 ? -ret : 0, 0, NULL); } @@ static struct api_out rss_autoscale_floor_set_handler(const void *request, struct api_ctx *) { const struct gr_rss_autoscale_set_req *req = request; uint16_t port_id; int ret = port_id_from_iface_id(req->iface_id, &port_id); if (ret < 0) return api_out(-ret, 0, NULL); + if (req->n > INT16_MAX) + return api_out(EINVAL, 0, NULL); ret = rss_autoscale_set_floor(port_id, (int16_t)req->n); return api_out(ret < 0 ? -ret : 0, 0, NULL); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

ret = rss_autoscale_set_cap(port_id, (int16_t)req->n);

return api_out(ret < 0 ? -ret : 0, 0, NULL);

static struct api_out rss_autoscale_cap_set_handler(const void *request, struct api_ctx *) {

const struct gr_rss_autoscale_set_req *req = request;

uint16_t port_id;

int ret = port_id_from_iface_id(req->iface_id, &port_id);

if (ret < 0)

return api_out(-ret, 0, NULL);

if (req->n > INT16_MAX)

return api_out(EINVAL, 0, NULL);

ret = rss_autoscale_set_cap(port_id, (int16_t)req->n);

return api_out(ret < 0 ? -ret : 0, 0, NULL);

}

static struct api_out rss_autoscale_floor_set_handler(const void *request, struct api_ctx *) {

const struct gr_rss_autoscale_set_req *req = request;

uint16_t port_id;

int ret = port_id_from_iface_id(req->iface_id, &port_id);

if (ret < 0)

return api_out(-ret, 0, NULL);

if (req->n > INT16_MAX)

return api_out(EINVAL, 0, NULL);

ret = rss_autoscale_set_floor(port_id, (int16_t)req->n);

return api_out(ret < 0 ? -ret : 0, 0, NULL);

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modules/infra/api/rss_autoscale.c` around lines 63 - 64, The handlers rss_autoscale_cap_set_handler and rss_autoscale_floor_set_handler currently cast req->n (uint16_t) to int16_t which can wrap for values > INT16_MAX; before calling rss_autoscale_set_cap / rss_autoscale_set_floor validate that req->n <= INT16_MAX and reject/return a suitable api_out error if it exceeds that limit (do not perform the cast on out-of-range values), otherwise safely cast to int16_t and proceed; reference policy_cap/policy_floor semantics when constructing the error to avoid accidental “unset/cleared” behavior.

coderabbitai · 2026-05-27T14:52:41Z

+	struct rss_autoscale_port_state *s = state_ensure(port_id);
+	if (s == NULL)
+		return -ENOMEM;
+	caps_ensure(s);
+	if (n_active)
+		*n_active = s->n_active;
+	if (n_load_recommended)
+		*n_load_recommended = s->n_load_recommended;
+	if (cap_effective)
+		*cap_effective = s->policy_cap;
+	if (floor_effective)
+		*floor_effective = s->policy_floor;
+	if (max_n)
+		*max_n = s->caps.max_n;
+	if (min_n)
+		*min_n = (s->caps.allowed_count > 0) ? s->caps.allowed_n[0] : 0;
+	return 0;


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Ignored caps_ensure() failure in rss_autoscale_port_state_get.

Line 466 calls caps_ensure(s) but ignores its return value. If capability discovery fails (e.g., -ENODEV or -ENOTSUP), the function proceeds to return potentially uninitialized or stale caps fields (max_n, allowed_n[0]), which could mislead API consumers.

Consider either propagating the error or documenting that caps fields are only valid when supports_scale is true.

Proposed fix

int rss_autoscale_port_state_get( uint16_t port_id, uint16_t *n_active, uint16_t *n_load_recommended, int16_t *cap_effective, int16_t *floor_effective, uint16_t *max_n, uint16_t *min_n ) { struct rss_autoscale_port_state *s = state_ensure(port_id); if (s == NULL) return -ENOMEM; - caps_ensure(s); + int ret = caps_ensure(s); + // caps fields valid only if ret == 0 or s->caps.supports_scale is true if (n_active) *n_active = s->n_active; if (n_load_recommended) *n_load_recommended = s->n_load_recommended; if (cap_effective) *cap_effective = s->policy_cap; if (floor_effective) *floor_effective = s->policy_floor; - if (max_n) - *max_n = s->caps.max_n; - if (min_n) - *min_n = (s->caps.allowed_count > 0) ? s->caps.allowed_n[0] : 0; + if (max_n) + *max_n = (ret == 0) ? s->caps.max_n : 0; + if (min_n) + *min_n = (ret == 0 && s->caps.allowed_count > 0) ? s->caps.allowed_n[0] : 0; return 0; }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modules/infra/control/rss_autoscale.c` around lines 463 - 479, The call to caps_ensure(s) inside rss_autoscale_port_state_get is ignored; detect and handle its return value instead of proceeding with possibly-uninitialized caps fields. Update rss_autoscale_port_state_get to capture int rc = caps_ensure(s) and if rc < 0 return rc (or, alternatively, clear/zero the caps-related out-params and return 0 only when s->supports_scale is true) so that max_n and min_n (allowed_n[0]) are not read/returned when capability discovery failed; reference the function rss_autoscale_port_state_get, the caps_ensure(s) call, and the caps field accesses (s->caps.max_n and s->caps.allowed_n[0]) when making the change.

coderabbitai · 2026-05-27T14:52:41Z

+// Combined per-rx_burst datapath bookkeeping: histogram + rss-autoscale
+// saturation/idle tracking. Called from every rx_*_process variant.
+static inline void rx_burst_done(uint16_t port_id, uint16_t queue_id, uint16_t n_pkts) {
+	rx_burst_histogram_inc(port_id, n_pkts);
+	rxq_health_update(port_id, queue_id, n_pkts);
+}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Autoscale health tracking is not wired to all RX process paths.

rx_burst_done() is not used in rx_bond_offload_process() and rx_bond_process(), so those queues never update rxq_health. That can mislead the autoscale controller and cause incorrect park/scale actions on bonded traffic.

Suggested fix

uint16_t rx_bond_offload_process(struct rte_graph *graph, struct rte_node *node, void **, uint16_t) { @@ rx = rte_eth_rx_burst(ctx->rxq.port_id, ctx->rxq.queue_id, mbufs, ctx->burst_size); + rx_burst_done(ctx->rxq.port_id, ctx->rxq.queue_id, rx); if (rx == 0) return 0; @@ uint16_t rx_bond_process(struct rte_graph *graph, struct rte_node *node, void **, uint16_t) { @@ rx = rte_eth_rx_burst(ctx->rxq.port_id, ctx->rxq.queue_id, mbufs, ctx->burst_size); + rx_burst_done(ctx->rxq.port_id, ctx->rxq.queue_id, rx); if (rx == 0) return 0;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modules/infra/datapath/port_rx.c` around lines 65 - 70, rx_burst_done (which updates histogram and rxq_health) isn't called from the bonded RX paths, so rxq_health isn't updated for bonded queues; modify rx_bond_offload_process and rx_bond_process to invoke rx_burst_done(port_id, queue_id, n_pkts) after their per-burst packet handling (i.e., where they currently account for n_pkts or finish processing a received burst) so bonded queues participate in histogram and autoscale health tracking the same as other rx_*_process variants.

Add a thin abstraction layer (port_scale.{c,h}) that exposes a uniform API to dynamically restrict the HW RSS distribution of a port to its first N RX queues: port_scale_caps_get(port, &caps) -> queries reta_size and the set of N values the PMD will accept port_scale_apply(port, n) -> builds a uniform reta[i]=i%n and calls rte_eth_dev_rss_reta_update port_scale_caps_next/prev(caps, n) -> walk the HW-allowed N list The implementation is portable: it works on any PMD that advertises a non-zero reta_size (Intel ixgbe/i40e/ice, Mellanox mlx5, Broadcom bnxt) and on DPAA2 once the matching RETA emulation lands in the PMD. For DPAA2 specifically, the caps query exposes the discrete dist_size list {1, 2, 3, 4, 6, 7, 8, 12, ...} so the controller never asks the PMD for a value it would reject with -ENOTSUP. This commit only adds the abstraction. Default port behaviour is unchanged: no RETA narrowing happens until the rss-autoscale controller in a follow-up commit subscribes to GR_EVENT_IFACE_POST_ADD and pins ports when --rss-autoscale is set. Signed-off-by: Maxime Leroy <maxime@leroys.fr>

Introduce worker_park() / worker_unpark() so the control plane can suspend a dataplane worker between scaling events. Parked workers futex_wait() on a shared atomic word, leaving cpu-user time at zero and letting cpuidle (cpu-pw15 on Cortex-A72) engage as the dominant state -- the prerequisite for load-proportional CPU consumption. The mechanism is cooperative: the worker checks atomic_load(&paused) at every housekeeping interval (every HOUSEKEEPING_INTERVAL graph walks), and if set, drops out of the RCU online region and calls futex_wait(..., FUTEX_WAIT | FUTEX_PRIVATE_FLAG, 1, 1s_timeout). The 1s timeout makes shutdown and graph reconfig events visible within one second even while parked, without paying any wakeup cost during the wait. worker_unpark() stores 0 and issues a FUTEX_WAKE; the race with futex_wait is handled by the kernel (EAGAIN when the value already changed). This is opt-in -- workers that are never park()ed see no change in behavior. busy-poll and usleep semantics in the housekeeping branch are preserved. Signed-off-by: Maxime Leroy <maxime@leroys.fr>

Introduce rss-autoscale: an opt-in controller that dynamically narrows the HW RSS distribution per port to fewer active queues when the workers have headroom, and parks the workers that no longer receive packets via the futex primitive added in the previous commit. CLI: --rss-autoscale=N at startup, where N is the cluster grouping size used to align scaling steps: 0 = disabled (legacy: all workers always poll all configured queues, no parking, no RETA scale). 1 = enabled, per-core scaling (no cluster constraint). 2 = enabled, scale by groups of 2 (x86 with hyperthreading or any SoC where 2 cores share an L2 cache). N = scale by groups of N for larger cache-sharing topologies. Scale-up is event-driven. Each rxq publishes a saturated flag via rxq_health_update in the rx datapath after RSS_AUTOSCALE_CONSEC_FULL consecutive full bursts; the transition wakes the control thread, which steps the port to the next allowed dist-size in microseconds. Scale-down is timer-driven per port. A libevent EV_PERSIST timer fires every SCALE_DOWN_PERIOD_S (10 minutes) per port and steps the port one allowed dist-size down when the aggregated busy/total cycle ratio of the workers servicing it has stayed below 40% since the previous fire. The timer is rearmed on every do_apply so a fresh 10-minute window always separates a scale-up from the next scale-down candidate, providing intrinsic anti-flapping. External knobs (rss_autoscale_set_cap / rss_autoscale_set_floor, exposed in the follow-up commit) carry an upper bound (thermal shed) and a lower bound (force load for characterisation or burst pre-arm). Both accept zero to clear. Values are validated against the PMD's allowed dist-size set at API entry. When both are set and cap < floor, cap wins as a safety override. On GR_EVENT_IFACE_POST_ADD for a port whose PMD supports RETA and exposes at least two cluster_size-aligned dist-sizes, the controller narrows the port to N = cluster_size active queues. Ports that only expose a single usable N (or no RETA at all) are left unmanaged: the PMD's default HW RSS still distributes their traffic and their workers are never parked. The event-driven path scales up as soon as load justifies it. Signed-off-by: Maxime Leroy <maxime@leroys.fr>

Change worker_queue_distribute() to lay out per-port RX queues using a "butterfly-reverse" pattern: even-indexed ports walk cpus[] forward from a global fwd_cursor, odd-indexed ports walk it backward from a global rev_cursor. Both cursors advance across ports so every worker still receives some rxq even when n_workers exceeds n_rxq_per_port (the naive "fill from each end" variant would otherwise leave middle CPUs idle). Endpoint placement: port 0 q0 -> cpus[0], port 1 q0 -> cpus[n-1]. On x86 with hyperthreading (where sibling threads share L1 and L2), this puts the two endpoint workers on different physical cores, giving each exclusive use of its sibling thread and its caches. At rss-autoscale dist_size=1 on every port, the two active workers can then run flat-out while the rest of the dataplane CPUs idle. As load grows (dist_size 2, 4, ...) each port expands into a disjoint range of cpus until n_active_total > n_workers, at which point cross-port workers reappear -- at full load the cache exclusivity benefit no longer matters. Direction is keyed on port_idx (position in grout's port vec). The vec ordering is the algorithm's invariant: stable across iterations, owned by grout, and enumerates every configured port exactly once. Update worker_test.c expectations for both queue_distribute_reduce and queue_distribute_increase to match the new layout. Both tests exercise the global-cursor advancement. Signed-off-by: Maxime Leroy <maxime@leroys.fr>

Let the operator bound the controller's RX queue scaling per port with two attributes set through "grcli interface set", like the MTU: rss-cap (upper bound on the active queue count) and rss-floor (lower bound); 0 leaves a bound unset. The cap sheds load during thermal events, the floor keeps a minimum number of workers active even without traffic. When both are set and cap < floor, the cap wins. They live in the port configuration (iface_info_port) rather than in the controller, so the surgical set_attrs update keeps them across an unrelated reconfig. A value is validated against the controller's usable distribution sizes -- the PMD's advertised sizes after the cluster_size filtering, shared with caps_ensure -- and against the target queue count when the same request also changes nb-rxqs. Both are shown in "interface show" and "rss-autoscale show". Reconfiguration is handled in place: since only the queue count changes the usable grid, the controller rebuilds just its caps and re-clamps the current active count, applying only when it moves. An unrelated reconfig no longer resets n_active, rxq_health and the scale-down timer to baseline. Signed-off-by: Maxime Leroy <maxime@leroys.fr>

When rss-autoscale narrows a port's RETA to N queues, queues N..max no longer receive any traffic, yet the worker that owns them keeps polling them on every graph walk as long as it stays awake for its other, still active queues. On DPAA2 an empty rte_eth_rx_burst is not free: the qbman volatile-dequeue round-trip costs several hundred CPU cycles, so these dead polls steal time from the busy queues sharing the core. Give a deactivated queue a no-op process callback, rx_inactive_process, so the graph walk visits the node but issues no poll. The callback is chosen at graph build time from the port's current active count, and swapped in/out at scale time on each worker's live graph by worker_graph_rxq_set_active() without a graph reload -- the same node->process swap DPDK uses for pcap. The store is an aligned pointer write and the worker reads either the old or the new callback, both valid. Factor the rx process-callback selection into rx_process_for() so the build path and the runtime swap cannot drift. Signed-off-by: Maxime Leroy <maxime@leroys.fr>

MortenBroerup · 2026-05-29T21:44:25Z

The power management algorithm per lcore in the SmartShare appliance is relatively simple:
After a certain duration of idle or low workloads, the system uses "napping" (our internal term for that power state) with relatively short durations.
And after a certain duration of idle workloads, the system uses "sleeping", which means longer durations.

As input for the algorithm for each lcore, we look at the maximum number of objects processed by any "node" every time the "graph" is traversed.
0 objects is Idle, 1 object is Low, more than 1 object is Normal.
If any "node" reports that it has more objects ready for processing, we don't nap or sleep at all, but restart traversing the graph immediately.

We are primarily targeting low power consumption during low traffic hours, and low scheduling latency during peak traffic hours.

For long durations. we use nanosleep(), where the kernel puts the lcore in a low power state.
For short durations, we use rte_power_pause(), which provides modest power savings with nearly no jitter or wake-up delay.

We don't use any of the fancy power management hooks in DPDK. It's all done by our application.

maxime-leroy changed the title ~~Rss autoscale~~ Adaptive RSS auto-scaling May 27, 2026

maxime-leroy marked this pull request as draft May 27, 2026 14:43

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

maxime-leroy force-pushed the rss_autoscale branch from 493e1e2 to ff12fe0 Compare May 28, 2026 17:41

maxime-leroy force-pushed the rss_autoscale branch 2 times, most recently from 61cba06 to 779119c Compare May 28, 2026 21:44

maxime-leroy added 4 commits May 29, 2026 15:22

maxime-leroy force-pushed the rss_autoscale branch from 779119c to 1a7b775 Compare May 29, 2026 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive RSS auto-scaling #624

Adaptive RSS auto-scaling #624
maxime-leroy wants to merge 6 commits into
DPDK:mainfrom
maxime-leroy:rss_autoscale

maxime-leroy commented May 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 27, 2026

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 27, 2026

Uh oh!

coderabbitai Bot May 27, 2026

Uh oh!

coderabbitai Bot May 27, 2026

Uh oh!

MortenBroerup commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		ret = rss_autoscale_set_cap(port_id, (int16_t)req->n);
		return api_out(ret < 0 ? -ret : 0, 0, NULL);

-	ret = rss_autoscale_set_cap(port_id, (int16_t)req->n);
-	return api_out(ret < 0 ? -ret : 0, 0, NULL);
+static struct api_out rss_autoscale_cap_set_handler(const void *request, struct api_ctx *) {
+	const struct gr_rss_autoscale_set_req *req = request;
+	uint16_t port_id;
+	int ret = port_id_from_iface_id(req->iface_id, &port_id);
+	if (ret < 0)
+		return api_out(-ret, 0, NULL);
+	if (req->n > INT16_MAX)
+		return api_out(EINVAL, 0, NULL);
+	ret = rss_autoscale_set_cap(port_id, (int16_t)req->n);
+	return api_out(ret < 0 ? -ret : 0, 0, NULL);
+}
+static struct api_out rss_autoscale_floor_set_handler(const void *request, struct api_ctx *) {
+	const struct gr_rss_autoscale_set_req *req = request;
+	uint16_t port_id;
+	int ret = port_id_from_iface_id(req->iface_id, &port_id);
+	if (ret < 0)
+		return api_out(-ret, 0, NULL);
+	if (req->n > INT16_MAX)
+		return api_out(EINVAL, 0, NULL);
+	ret = rss_autoscale_set_floor(port_id, (int16_t)req->n);
+	return api_out(ret < 0 ? -ret : 0, 0, NULL);
+}

Conversation

maxime-leroy commented May 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

RSS Autoscale Controller

Port Scaling Abstraction (port_scale.{c,h})

RXQ Health Tracking (rss_autoscale.{h,c})

Worker Parking (worker.{c,h})

Butterfly-Reverse RXQ Allocation (worker.c)

Control Loop (rss_autoscale.c)

Datapath Integration

CLI/API Infrastructure (modules/infra/api/rss_autoscale.c, modules/infra/cli/rss_autoscale.c)

Configuration

Uh oh!

coderabbitai Bot commented May 27, 2026

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

MortenBroerup commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maxime-leroy commented May 27, 2026 •

edited by coderabbitai Bot

Loading

Port Scaling Abstraction (`port_scale.{c,h}`)

RXQ Health Tracking (`rss_autoscale.{h,c}`)

Worker Parking (`worker.{c,h}`)

Butterfly-Reverse RXQ Allocation (`worker.c`)

Control Loop (`rss_autoscale.c`)

CLI/API Infrastructure (`modules/infra/api/rss_autoscale.c`, `modules/infra/cli/rss_autoscale.c`)