Skip to content

Adaptive RSS auto-scaling #624

Draft
maxime-leroy wants to merge 6 commits into
DPDK:mainfrom
maxime-leroy:rss_autoscale
Draft

Adaptive RSS auto-scaling #624
maxime-leroy wants to merge 6 commits into
DPDK:mainfrom
maxime-leroy:rss_autoscale

Conversation

@maxime-leroy
Copy link
Copy Markdown
Collaborator

@maxime-leroy maxime-leroy commented May 27, 2026

Why rx_usleep doesn't scale

Grout's existing rx-side usleep(us) micro-sleep saves CPU at modest loads but breaks
at line rate:

  • max_sleep_us is auto-clamped by ring depth / line rate, so at 25-100G the
    cap drops to a few microseconds → effectively zero. The worker busy-polls
    anyway, just with extra syscall overhead.
  • All queues remain RSS-fed during sleep, so a burst arriving on any queue can
    overflow its ring before the worker wakes.
  • The "idle CPU" gain only materializes at very low rates where usleep is
    long enough to matter, which is exactly the regime where the energy savings
    are negligible (the box was already idle).

What rss-autoscale does

Opt-in controller (--rss-autoscale=N, N = cluster grouping) that:

  • Narrows the HW RSS distribution per port to the smallest N that absorbs
    current load (via rte_eth_dev_rss_reta_update, with a portable emulation
    on DPAA2 that maps reta[i] = i % N to dpni_set_rx_hash_dist).
  • Parks unused workers via futex_wait → real 0% CPU (vs busy-poll noise).
  • Detection is edge-based in the datapath: each rx_burst that returns the
    HW burst-cap counts as "full"; 4 consecutive full → saturated. 1000
    consecutive empty (n_pkts == 0) → idle. Mid-load (1..cap-1) is steady,
    no transition. Counters are local, only edges trigger an atomic_store +
    control_queue push to wake the controller (~10 µs reaction).
  • Zero datapath cost when disabled (gr_config.rss_autoscale == 0):
    branch-predicted no-op at top of rxq_health_update.
  • Two policy knobs (grcli rss-autoscale set NAME cap|floor N) for thermal
    daemons and ops: cap = upper bound, floor = lower bound. Cap wins on
    conflict.

Known limitations

  • Elephant flow: a single 5-tuple flow hashes to one queue, so RSS
    scale-up cannot redistribute it. The controller will scale up uselessly
    until cap/max, leaving 11 unparked workers.
  • Reordering on scale events: DPAA2 dispatches via hash % dist_size,
    so changing N reshuffles all flows. Scale events should be rare. No
    mitigation yet.
  • grcli bug: set NAME cap N and set NAME floor N patterns fail
    parsing in libecoli (to debug and fix)

RSS Autoscale Controller

Implements an opt-in RSS autoscale controller (--rss-autoscale=N) that dynamically narrows per-port hardware RSS RETA to absorb current load and parks unused workers. Solves the limitation of rx-side usleep() which fails to scale at line-rate (max_sleep_us clamped by ring depth/line rate) and keeps all queues RSS-fed during idle sleep, risking ring overflow.

Port Scaling Abstraction (port_scale.{c,h})

New module for dynamic RETA reprogramming. Computes allowed queue counts by intersecting driver constraints with DPAA2's discrete set (via hardcoded lookup: 1,2,3,4,6,7,8,12,14,16,24,28,32,...) or falling back to [1..max]. Provides capability helpers (caps_get/free, caps_next/prev) and apply function that constructs uniform RETA (reta[i] = i % n across groups) and calls rte_eth_dev_rss_reta_update(). Query function reads back active queue count from current RETA or returns 0 if never programmed.

RXQ Health Tracking (rss_autoscale.{h,c})

Per-RXQ counters (consec_full, consec_empty) with atomic saturated/idle flags. Datapath rxq_health_update() hook (called per RX burst) increments counters: full bursts (HW burst-cap received) and empty bursts. State transitions trigger notifications: 4 consecutive full → saturated; 1000 consecutive empty → idle; transitions back to normal clear flags and wake controller. Notification pushes to control queue (~10 µs). Zero cost when disabled (rss_autoscale == 0).

Worker Parking (worker.{c,h})

Adds atomic_int paused field and futex-based park/unpark. worker_park() stores 1 and datapath futex_waits with 1s timeout at housekeeping. worker_unpark() stores 0 and issues FUTEX_WAKE. Datapath leaves RCU offline during wait to avoid blocking grace periods; wakes every second to check shutdown/reconfig events. Resets timing on unpark so park duration is not charged as busy cycles.

Butterfly-Reverse RXQ Allocation (worker.c)

Changes distribution logic from simple forward walk to butterfly-reverse with separate global forward/reverse cursors. Even-indexed ports walk cpus[] forward from fwd_cursor; odd-indexed ports walk backward from rev_cursor. After each port, corresponding cursor advances by step derived from port's RXQ count. Improves endpoint placement: port 0 q0 → cpus[0], port 1 q0 → cpus[n-1]. On hyperthreaded systems (e.g., x86), endpoint workers land on different physical cores with exclusive sibling-thread and cache access. Enables multiple workers idle when dist_size=1 during low load.

Control Loop (rss_autoscale.c)

Three-pass decision triggered on RXQ health transitions: (1) per-port load recommendation from atomic health flags, (2) global cluster-budget arbitration to constrain total cluster usage (filters allowed_n to keep multiples of cluster grouping size), (3) hardware apply. Exposes policy bounds: rss_autoscale_set_cap() / rss_autoscale_clear_cap() for upper bound (shedding) and rss_autoscale_set_floor() / rss_autoscale_clear_floor() for lower bound (force load). Cap overrides floor on conflict. Immediate apply on cap/floor changes with worker park/unpark sync. Subscribes to interface-add events to pin new ports to configured autoscale value.

Datapath Integration

port_rx.c calls rx_burst_done() on every RX burst (replaces direct histogram increment) to update both histogram and health tracking. main_loop.c worker futex_wait loop integrated at housekeeping.

CLI/API Infrastructure (modules/infra/api/rss_autoscale.c, modules/infra/cli/rss_autoscale.c)

New API handlers: GR_RSS_AUTOSCALE_LIST (streams per-port state: active queues, recommended load, cap/floor/min/max), GR_RSS_AUTOSCALE_CAP_SET/CLEAR, GR_RSS_AUTOSCALE_FLOOR_SET/CLEAR. CLI commands: infra rss-autoscale show, set NAME cap N, set NAME floor N, clear NAME cap, clear NAME floor.

Configuration

New --rss-autoscale=N option in main.c: N=0 (disabled/legacy), N=1 (per-core), N=2 (grouping-by-2), N>2 (grouping-by-N). Stored in gr_config.rss_autoscale. Tick=20ms, up_hold=20ms, dn_hold=1000ms. Boots with N=1; widens as load appears.

Review Change Stack

@maxime-leroy maxime-leroy changed the title Rss autoscale Adaptive RSS auto-scaling May 27, 2026
@maxime-leroy maxime-leroy marked this pull request as draft May 27, 2026 14:43
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

📝 Walkthrough

Walkthrough

This PR implements adaptive RSS queue autoscaling for DPDK network ports. Configuration is added via a new --rss-autoscale=N command-line option. A port scaling module provides hardware RETA query/apply and capability discovery. Per-queue full/empty state is tracked in the RX datapath and signaled to a control loop. Workers can be parked via futex when queues are idle. The control plane recomputes target queue counts per port based on health flags, enforces operator policy (cap/floor), and arbitrates global cluster resource budgets before applying hardware changes. The RXQ-to-worker distribution algorithm is refined to alternate traversal direction across ports. Public APIs expose state queries and cap/floor configuration; CLI and REST-like API interfaces provide user control.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modules/infra/control/worker.c (1)

401-446: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Division-by-zero risk when affinity resolves to zero datapath CPUs.

n_cpus is used as a modulo divisor before any non-zero guard. If affinity is empty, this crashes in queue distribution.

Suggested fix
 	size_t n_cpus = vec_len(cpus);
+	if (n_cpus == 0) {
+		LOG(ERR, "no datapath CPUs available in affinity mask");
+		ret = -EINVAL;
+		goto end;
+	}
 	size_t fwd_cursor = 0;
 	size_t rev_cursor = (n_cpus > 0) ? (n_cpus - 1) : 0;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modules/infra/control/worker.c` around lines 401 - 446, n_cpus can be zero
causing a divide-by-zero in the RX queue assignment loops (uses n_cpus as
modulo); add an early guard after size_t n_cpus = vec_len(cpus); that checks if
(n_cpus == 0) and handles it (for example log an error via errno_log or
processLogger, and return/goto end or otherwise skip queue-to-CPU distribution)
so subsequent uses of n_cpus in the rxq loop and in computing off/idx do not
perform modulo zero; update any related variables (rev_cursor/fwd_cursor) only
after the non-zero check and keep references to cpus, affinity, CPU_COUNT and
port->n_rxq to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modules/infra/api/rss_autoscale.c`:
- Around line 63-64: The handlers rss_autoscale_cap_set_handler and
rss_autoscale_floor_set_handler currently cast req->n (uint16_t) to int16_t
which can wrap for values > INT16_MAX; before calling rss_autoscale_set_cap /
rss_autoscale_set_floor validate that req->n <= INT16_MAX and reject/return a
suitable api_out error if it exceeds that limit (do not perform the cast on
out-of-range values), otherwise safely cast to int16_t and proceed; reference
policy_cap/policy_floor semantics when constructing the error to avoid
accidental “unset/cleared” behavior.

In `@modules/infra/control/rss_autoscale.c`:
- Around line 463-479: The call to caps_ensure(s) inside
rss_autoscale_port_state_get is ignored; detect and handle its return value
instead of proceeding with possibly-uninitialized caps fields. Update
rss_autoscale_port_state_get to capture int rc = caps_ensure(s) and if rc < 0
return rc (or, alternatively, clear/zero the caps-related out-params and return
0 only when s->supports_scale is true) so that max_n and min_n (allowed_n[0])
are not read/returned when capability discovery failed; reference the function
rss_autoscale_port_state_get, the caps_ensure(s) call, and the caps field
accesses (s->caps.max_n and s->caps.allowed_n[0]) when making the change.

In `@modules/infra/datapath/port_rx.c`:
- Around line 65-70: rx_burst_done (which updates histogram and rxq_health)
isn't called from the bonded RX paths, so rxq_health isn't updated for bonded
queues; modify rx_bond_offload_process and rx_bond_process to invoke
rx_burst_done(port_id, queue_id, n_pkts) after their per-burst packet handling
(i.e., where they currently account for n_pkts or finish processing a received
burst) so bonded queues participate in histogram and autoscale health tracking
the same as other rx_*_process variants.

---

Outside diff comments:
In `@modules/infra/control/worker.c`:
- Around line 401-446: n_cpus can be zero causing a divide-by-zero in the RX
queue assignment loops (uses n_cpus as modulo); add an early guard after size_t
n_cpus = vec_len(cpus); that checks if (n_cpus == 0) and handles it (for example
log an error via errno_log or processLogger, and return/goto end or otherwise
skip queue-to-CPU distribution) so subsequent uses of n_cpus in the rxq loop and
in computing off/idx do not perform modulo zero; update any related variables
(rev_cursor/fwd_cursor) only after the non-zero check and keep references to
cpus, affinity, CPU_COUNT and port->n_rxq to locate the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5c88068d-31aa-4d0f-abfc-f4afe9d11bde

📥 Commits

Reviewing files that changed from the base of the PR and between fe0c7f9 and 493e1e2.

📒 Files selected for processing (17)
  • main/config.h
  • main/main.c
  • modules/infra/api/gr_infra.h
  • modules/infra/api/meson.build
  • modules/infra/api/rss_autoscale.c
  • modules/infra/cli/meson.build
  • modules/infra/cli/rss_autoscale.c
  • modules/infra/control/meson.build
  • modules/infra/control/port_scale.c
  • modules/infra/control/port_scale.h
  • modules/infra/control/rss_autoscale.c
  • modules/infra/control/rss_autoscale.h
  • modules/infra/control/worker.c
  • modules/infra/control/worker.h
  • modules/infra/control/worker_test.c
  • modules/infra/datapath/main_loop.c
  • modules/infra/datapath/port_rx.c

Comment thread modules/infra/api/rss_autoscale.c Outdated
Comment on lines +63 to +64
ret = rss_autoscale_set_cap(port_id, (int16_t)req->n);
return api_out(ret < 0 ? -ret : 0, 0, NULL);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify type/range mismatch across API contract and handlers.
rg -n --type=c 'struct gr_rss_autoscale_set_req|uint16_t n;' modules/infra/api/gr_infra.h
rg -n --type=c '\(int16_t\)req->n|rss_autoscale_set_(cap|floor)\(' modules/infra/api/rss_autoscale.c
rg -n --type=c 'ec_node_uint\("N",\s*1,\s*UINT16_MAX' modules/infra/cli/rss_autoscale.c

Repository: DPDK/grout

Length of output: 490


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find where rss_autoscale_set_cap/floor are defined
rg -n --type=c "rss_autoscale_set_(cap|floor)\s*\(" modules/infra -S
# Show function bodies around the definitions (small window)
rg -n --type=c "static .*rss_autoscale_set_(cap|floor)\s*\(" modules/infra -S || true
# Also search entire repo if not found under modules/infra
rg -n --type=c "rss_autoscale_set_(cap|floor)\s*\(" -S --glob='*.c' --glob='*.h' || true

Repository: DPDK/grout

Length of output: 1610


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show function bodies for the relevant setters
nl -ba modules/infra/control/rss_autoscale.c | sed -n '140,240p'

Repository: DPDK/grout

Length of output: 97


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Dump the relevant function bodies with line numbers (no nl available)
cat -n modules/infra/control/rss_autoscale.c | sed -n '140,240p'

Repository: DPDK/grout

Length of output: 3828


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check whether rss_autoscale.c already references EINVAL / errno.h
rg -n "EINVAL|errno\.h" modules/infra/api/rss_autoscale.c

Repository: DPDK/grout

Length of output: 150


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find api_out definition and how it's used (error sign conventions)
rg -n --type=c "api_out\s*\(" modules/infra/api/rss_autoscale.c
rg -n --type=c "struct api_out|api_out\s*\{" -S modules/infra/api || true
rg -n --type=c "api_out\s*\(" modules/infra/api -S --glob='*.c'
rg -n --type=c "api_out" modules/infra/api -S --glob='*.h'

Repository: DPDK/grout

Length of output: 8293


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate api_out definition (function or macro)
rg -n --type=c "api_out\s*\(" modules/infra/api -S --glob='*.c' 
rg -n --type=c "api_out" modules/infra/api -S --glob='*.h'
rg -n --type=c "struct api_out" modules/infra/api -S --glob='*.h' --glob='*.c' 

# Try to find header where api_out/errno conventions are defined
rg -n --type=c "api_out" modules/infra -S --glob='*.h' --glob='*.c'

Repository: DPDK/grout

Length of output: 4613


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for definition of api_out (struct/macro/function)
rg -n --type=c "api_out\s*=" -S .
rg -n --type=c "api_out\s*\(" -S --glob='*.h' --glob='*.c'
rg -n --type=c "struct\s+api_out" -S .
rg -n --type=c "static\s+inline\s+.*api_out|inline\s+.*api_out|`#define`\s+api_out" -S .

Repository: DPDK/grout

Length of output: 36


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Broader search for api_out definition/usages
rg -n "api_out" -S --glob='*.h' --glob='*.c' modules/infra . | head -n 200

Repository: DPDK/grout

Length of output: 16042


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find the port state struct definition to confirm types of policy_cap/policy_floor
rg -n "struct rss_autoscale_port_state" -S modules/infra/control/rss_autoscale.c modules/infra/control/rss_autoscale.h
# Show a small window around the struct definition in the .c
rg -n "policy_(cap|floor)" modules/infra/control/rss_autoscale.c | head -n 50

Repository: DPDK/grout

Length of output: 2983


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show policy defaults + setter parts
cat -n modules/infra/control/rss_autoscale.c | sed -n '20,110p'

# Show clamp_to_policy and effective_* helpers showing negative == unset/cleared semantics
cat -n modules/infra/control/rss_autoscale.c | sed -n '140,220p'
cat -n modules/infra/control/rss_autoscale.c | sed -n '260,320p'

Repository: DPDK/grout

Length of output: 9323


🏁 Script executed:

#!/bin/bash
set -euo pipefail

cat -n modules/infra/api/rss_autoscale.c | sed -n '1,140p'

Repository: DPDK/grout

Length of output: 4056


Validate req->n range before casting to int16_t in rss_autoscale API handlers.

In modules/infra/api/rss_autoscale.c the uint16_t n request field is narrowed to int16_t in:

  • rss_autoscale_cap_set_handler (lines 63-64)
  • rss_autoscale_floor_set_handler (lines 83-84)

For values > INT16_MAX, the cast wraps negative; since policy_cap/policy_floor use negative values to mean “unset/cleared” (and policy application checks > 0), clients can silently clear bounds instead of applying the requested cap/floor.

🔧 Proposed fix
 static struct api_out rss_autoscale_cap_set_handler(const void *request, struct api_ctx *) {
 	const struct gr_rss_autoscale_set_req *req = request;
 	uint16_t port_id;
 	int ret = port_id_from_iface_id(req->iface_id, &port_id);
 	if (ret < 0)
 		return api_out(-ret, 0, NULL);
+	if (req->n > INT16_MAX)
+		return api_out(EINVAL, 0, NULL);
 	ret = rss_autoscale_set_cap(port_id, (int16_t)req->n);
 	return api_out(ret < 0 ? -ret : 0, 0, NULL);
 }
@@
 static struct api_out rss_autoscale_floor_set_handler(const void *request, struct api_ctx *) {
 	const struct gr_rss_autoscale_set_req *req = request;
 	uint16_t port_id;
 	int ret = port_id_from_iface_id(req->iface_id, &port_id);
 	if (ret < 0)
 		return api_out(-ret, 0, NULL);
+	if (req->n > INT16_MAX)
+		return api_out(EINVAL, 0, NULL);
 	ret = rss_autoscale_set_floor(port_id, (int16_t)req->n);
 	return api_out(ret < 0 ? -ret : 0, 0, NULL);
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ret = rss_autoscale_set_cap(port_id, (int16_t)req->n);
return api_out(ret < 0 ? -ret : 0, 0, NULL);
static struct api_out rss_autoscale_cap_set_handler(const void *request, struct api_ctx *) {
const struct gr_rss_autoscale_set_req *req = request;
uint16_t port_id;
int ret = port_id_from_iface_id(req->iface_id, &port_id);
if (ret < 0)
return api_out(-ret, 0, NULL);
if (req->n > INT16_MAX)
return api_out(EINVAL, 0, NULL);
ret = rss_autoscale_set_cap(port_id, (int16_t)req->n);
return api_out(ret < 0 ? -ret : 0, 0, NULL);
}
static struct api_out rss_autoscale_floor_set_handler(const void *request, struct api_ctx *) {
const struct gr_rss_autoscale_set_req *req = request;
uint16_t port_id;
int ret = port_id_from_iface_id(req->iface_id, &port_id);
if (ret < 0)
return api_out(-ret, 0, NULL);
if (req->n > INT16_MAX)
return api_out(EINVAL, 0, NULL);
ret = rss_autoscale_set_floor(port_id, (int16_t)req->n);
return api_out(ret < 0 ? -ret : 0, 0, NULL);
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modules/infra/api/rss_autoscale.c` around lines 63 - 64, The handlers
rss_autoscale_cap_set_handler and rss_autoscale_floor_set_handler currently cast
req->n (uint16_t) to int16_t which can wrap for values > INT16_MAX; before
calling rss_autoscale_set_cap / rss_autoscale_set_floor validate that req->n <=
INT16_MAX and reject/return a suitable api_out error if it exceeds that limit
(do not perform the cast on out-of-range values), otherwise safely cast to
int16_t and proceed; reference policy_cap/policy_floor semantics when
constructing the error to avoid accidental “unset/cleared” behavior.

Comment thread modules/infra/control/rss_autoscale.c Outdated
Comment on lines +463 to +479
struct rss_autoscale_port_state *s = state_ensure(port_id);
if (s == NULL)
return -ENOMEM;
caps_ensure(s);
if (n_active)
*n_active = s->n_active;
if (n_load_recommended)
*n_load_recommended = s->n_load_recommended;
if (cap_effective)
*cap_effective = s->policy_cap;
if (floor_effective)
*floor_effective = s->policy_floor;
if (max_n)
*max_n = s->caps.max_n;
if (min_n)
*min_n = (s->caps.allowed_count > 0) ? s->caps.allowed_n[0] : 0;
return 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Ignored caps_ensure() failure in rss_autoscale_port_state_get.

Line 466 calls caps_ensure(s) but ignores its return value. If capability discovery fails (e.g., -ENODEV or -ENOTSUP), the function proceeds to return potentially uninitialized or stale caps fields (max_n, allowed_n[0]), which could mislead API consumers.

Consider either propagating the error or documenting that caps fields are only valid when supports_scale is true.

Proposed fix
 int rss_autoscale_port_state_get(
 	uint16_t port_id,
 	uint16_t *n_active,
 	uint16_t *n_load_recommended,
 	int16_t *cap_effective,
 	int16_t *floor_effective,
 	uint16_t *max_n,
 	uint16_t *min_n
 ) {
 	struct rss_autoscale_port_state *s = state_ensure(port_id);
 	if (s == NULL)
 		return -ENOMEM;
-	caps_ensure(s);
+	int ret = caps_ensure(s);
+	// caps fields valid only if ret == 0 or s->caps.supports_scale is true
 	if (n_active)
 		*n_active = s->n_active;
 	if (n_load_recommended)
 		*n_load_recommended = s->n_load_recommended;
 	if (cap_effective)
 		*cap_effective = s->policy_cap;
 	if (floor_effective)
 		*floor_effective = s->policy_floor;
-	if (max_n)
-		*max_n = s->caps.max_n;
-	if (min_n)
-		*min_n = (s->caps.allowed_count > 0) ? s->caps.allowed_n[0] : 0;
+	if (max_n)
+		*max_n = (ret == 0) ? s->caps.max_n : 0;
+	if (min_n)
+		*min_n = (ret == 0 && s->caps.allowed_count > 0) ? s->caps.allowed_n[0] : 0;
 	return 0;
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modules/infra/control/rss_autoscale.c` around lines 463 - 479, The call to
caps_ensure(s) inside rss_autoscale_port_state_get is ignored; detect and handle
its return value instead of proceeding with possibly-uninitialized caps fields.
Update rss_autoscale_port_state_get to capture int rc = caps_ensure(s) and if rc
< 0 return rc (or, alternatively, clear/zero the caps-related out-params and
return 0 only when s->supports_scale is true) so that max_n and min_n
(allowed_n[0]) are not read/returned when capability discovery failed; reference
the function rss_autoscale_port_state_get, the caps_ensure(s) call, and the caps
field accesses (s->caps.max_n and s->caps.allowed_n[0]) when making the change.

Comment on lines +65 to +70
// Combined per-rx_burst datapath bookkeeping: histogram + rss-autoscale
// saturation/idle tracking. Called from every rx_*_process variant.
static inline void rx_burst_done(uint16_t port_id, uint16_t queue_id, uint16_t n_pkts) {
rx_burst_histogram_inc(port_id, n_pkts);
rxq_health_update(port_id, queue_id, n_pkts);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Autoscale health tracking is not wired to all RX process paths.

rx_burst_done() is not used in rx_bond_offload_process() and rx_bond_process(), so those queues never update rxq_health. That can mislead the autoscale controller and cause incorrect park/scale actions on bonded traffic.

Suggested fix
 uint16_t
 rx_bond_offload_process(struct rte_graph *graph, struct rte_node *node, void **, uint16_t) {
@@
 	rx = rte_eth_rx_burst(ctx->rxq.port_id, ctx->rxq.queue_id, mbufs, ctx->burst_size);
+	rx_burst_done(ctx->rxq.port_id, ctx->rxq.queue_id, rx);
 	if (rx == 0)
 		return 0;
@@
 uint16_t rx_bond_process(struct rte_graph *graph, struct rte_node *node, void **, uint16_t) {
@@
 	rx = rte_eth_rx_burst(ctx->rxq.port_id, ctx->rxq.queue_id, mbufs, ctx->burst_size);
+	rx_burst_done(ctx->rxq.port_id, ctx->rxq.queue_id, rx);
 	if (rx == 0)
 		return 0;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modules/infra/datapath/port_rx.c` around lines 65 - 70, rx_burst_done (which
updates histogram and rxq_health) isn't called from the bonded RX paths, so
rxq_health isn't updated for bonded queues; modify rx_bond_offload_process and
rx_bond_process to invoke rx_burst_done(port_id, queue_id, n_pkts) after their
per-burst packet handling (i.e., where they currently account for n_pkts or
finish processing a received burst) so bonded queues participate in histogram
and autoscale health tracking the same as other rx_*_process variants.

Add a thin abstraction layer (port_scale.{c,h}) that exposes a uniform
API to dynamically restrict the HW RSS distribution of a port to its
first N RX queues:

  port_scale_caps_get(port, &caps)   -> queries reta_size and the set of
                                        N values the PMD will accept
  port_scale_apply(port, n)          -> builds a uniform reta[i]=i%n and
                                        calls rte_eth_dev_rss_reta_update
  port_scale_caps_next/prev(caps, n) -> walk the HW-allowed N list

The implementation is portable: it works on any PMD that advertises a
non-zero reta_size (Intel ixgbe/i40e/ice, Mellanox mlx5, Broadcom bnxt)
and on DPAA2 once the matching RETA emulation lands in the PMD. For
DPAA2 specifically, the caps query exposes the discrete dist_size list
{1, 2, 3, 4, 6, 7, 8, 12, ...} so the controller never asks the PMD for
a value it would reject with -ENOTSUP.

This commit only adds the abstraction. Default port behaviour is
unchanged: no RETA narrowing happens until the rss-autoscale controller
in a follow-up commit subscribes to GR_EVENT_IFACE_POST_ADD and pins
ports when --rss-autoscale is set.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Introduce worker_park() / worker_unpark() so the control plane can
suspend a dataplane worker between scaling events. Parked workers
futex_wait() on a shared atomic word, leaving cpu-user time at zero
and letting cpuidle (cpu-pw15 on Cortex-A72) engage as the dominant
state -- the prerequisite for load-proportional CPU consumption.

The mechanism is cooperative: the worker checks atomic_load(&paused)
at every housekeeping interval (every HOUSEKEEPING_INTERVAL graph
walks), and if set, drops out of the RCU online region and calls
futex_wait(..., FUTEX_WAIT | FUTEX_PRIVATE_FLAG, 1, 1s_timeout). The
1s timeout makes shutdown and graph reconfig events visible within
one second even while parked, without paying any wakeup cost during
the wait. worker_unpark() stores 0 and issues a FUTEX_WAKE; the race
with futex_wait is handled by the kernel (EAGAIN when the value
already changed).

This is opt-in -- workers that are never park()ed see no change in
behavior. busy-poll and usleep semantics in the housekeeping branch
are preserved.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
@maxime-leroy maxime-leroy force-pushed the rss_autoscale branch 2 times, most recently from 61cba06 to 779119c Compare May 28, 2026 21:44
Introduce rss-autoscale: an opt-in controller that dynamically
narrows the HW RSS distribution per port to fewer active queues
when the workers have headroom, and parks the workers that no
longer receive packets via the futex primitive added in the
previous commit.

CLI: --rss-autoscale=N at startup, where N is the cluster grouping
size used to align scaling steps:
  0 = disabled (legacy: all workers always poll all configured
      queues, no parking, no RETA scale).
  1 = enabled, per-core scaling (no cluster constraint).
  2 = enabled, scale by groups of 2 (x86 with hyperthreading or
      any SoC where 2 cores share an L2 cache).
  N = scale by groups of N for larger cache-sharing topologies.

Scale-up is event-driven. Each rxq publishes a saturated flag via
rxq_health_update in the rx datapath after RSS_AUTOSCALE_CONSEC_FULL
consecutive full bursts; the transition wakes the control thread,
which steps the port to the next allowed dist-size in microseconds.

Scale-down is timer-driven per port. A libevent EV_PERSIST timer
fires every SCALE_DOWN_PERIOD_S (10 minutes) per port and steps
the port one allowed dist-size down when the aggregated busy/total
cycle ratio of the workers servicing it has stayed below 40% since
the previous fire. The timer is rearmed on every do_apply so a
fresh 10-minute window always separates a scale-up from the next
scale-down candidate, providing intrinsic anti-flapping.

External knobs (rss_autoscale_set_cap / rss_autoscale_set_floor,
exposed in the follow-up commit) carry an upper bound (thermal
shed) and a lower bound (force load for characterisation or burst
pre-arm). Both accept zero to clear. Values are validated against
the PMD's allowed dist-size set at API entry. When both are set
and cap < floor, cap wins as a safety override.

On GR_EVENT_IFACE_POST_ADD for a port whose PMD supports RETA and
exposes at least two cluster_size-aligned dist-sizes, the controller
narrows the port to N = cluster_size active queues. Ports that only
expose a single usable N (or no RETA at all) are left unmanaged: the
PMD's default HW RSS still distributes their traffic and their
workers are never parked. The event-driven path scales up as soon as
load justifies it.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Change worker_queue_distribute() to lay out per-port RX queues using
a "butterfly-reverse" pattern: even-indexed ports walk cpus[] forward
from a global fwd_cursor, odd-indexed ports walk it backward from a
global rev_cursor. Both cursors advance across ports so every worker
still receives some rxq even when n_workers exceeds n_rxq_per_port
(the naive "fill from each end" variant would otherwise leave middle
CPUs idle).

Endpoint placement: port 0 q0 -> cpus[0], port 1 q0 -> cpus[n-1]. On
x86 with hyperthreading (where sibling threads share L1 and L2), this
puts the two endpoint workers on different physical cores, giving
each exclusive use of its sibling thread and its caches. At
rss-autoscale dist_size=1 on every port, the two active workers can
then run flat-out while the rest of the dataplane CPUs idle.

As load grows (dist_size 2, 4, ...) each port expands into a disjoint
range of cpus until n_active_total > n_workers, at which point
cross-port workers reappear -- at full load the cache exclusivity
benefit no longer matters.

Direction is keyed on port_idx (position in grout's port vec). The
vec ordering is the algorithm's invariant: stable across iterations,
owned by grout, and enumerates every configured port exactly once.

Update worker_test.c expectations for both queue_distribute_reduce
and queue_distribute_increase to match the new layout. Both tests
exercise the global-cursor advancement.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Let the operator bound the controller's RX queue scaling per port with
two attributes set through "grcli interface set", like the MTU: rss-cap
(upper bound on the active queue count) and rss-floor (lower bound); 0
leaves a bound unset. The cap sheds load during thermal events, the floor
keeps a minimum number of workers active even without traffic. When both
are set and cap < floor, the cap wins.

They live in the port configuration (iface_info_port) rather than in the
controller, so the surgical set_attrs update keeps them across an
unrelated reconfig. A value is validated against the controller's usable
distribution sizes -- the PMD's advertised sizes after the cluster_size
filtering, shared with caps_ensure -- and against the target queue count
when the same request also changes nb-rxqs. Both are shown in "interface
show" and "rss-autoscale show".

Reconfiguration is handled in place: since only the queue count changes
the usable grid, the controller rebuilds just its caps and re-clamps the
current active count, applying only when it moves. An unrelated reconfig
no longer resets n_active, rxq_health and the scale-down timer to
baseline.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
When rss-autoscale narrows a port's RETA to N queues, queues N..max no
longer receive any traffic, yet the worker that owns them keeps polling
them on every graph walk as long as it stays awake for its other, still
active queues. On DPAA2 an empty rte_eth_rx_burst is not free: the qbman
volatile-dequeue round-trip costs several hundred CPU cycles, so these
dead polls steal time from the busy queues sharing the core.

Give a deactivated queue a no-op process callback, rx_inactive_process,
so the graph walk visits the node but issues no poll. The callback is
chosen at graph build time from the port's current active count, and
swapped in/out at scale time on each worker's live graph by
worker_graph_rxq_set_active() without a graph reload -- the same
node->process swap DPDK uses for pcap. The store is an aligned pointer
write and the worker reads either the old or the new callback, both
valid.

Factor the rx process-callback selection into rx_process_for() so the
build path and the runtime swap cannot drift.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
@MortenBroerup
Copy link
Copy Markdown
Contributor

The power management algorithm per lcore in the SmartShare appliance is relatively simple:
After a certain duration of idle or low workloads, the system uses "napping" (our internal term for that power state) with relatively short durations.
And after a certain duration of idle workloads, the system uses "sleeping", which means longer durations.

As input for the algorithm for each lcore, we look at the maximum number of objects processed by any "node" every time the "graph" is traversed.
0 objects is Idle, 1 object is Low, more than 1 object is Normal.
If any "node" reports that it has more objects ready for processing, we don't nap or sleep at all, but restart traversing the graph immediately.

We are primarily targeting low power consumption during low traffic hours, and low scheduling latency during peak traffic hours.

For long durations. we use nanosleep(), where the kernel puts the lcore in a low power state.
For short durations, we use rte_power_pause(), which provides modest power savings with nearly no jitter or wake-up delay.

We don't use any of the fancy power management hooks in DPDK. It's all done by our application.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants