Skip to content

Latest commit

 

History

History
603 lines (460 loc) · 18.9 KB

File metadata and controls

603 lines (460 loc) · 18.9 KB

cfiber Architecture Documentation

This document describes the internal architecture and implementation details of the cfiber library.

Table of Contents

Overview

cfiber implements stackful coroutines (fibers) using low-level context switching. The library is split into:

  1. Architecture-agnostic C code - High-level fiber management (fiber.c, headers)
  2. Architecture-specific assembly - Low-level context switching (context_*.S, fiber_prologue_*.S)

This separation ensures portability while allowing optimal performance on each platform.

How It Works

This section provides a conceptual overview of how fibers work. For detailed implementation specifics, see the later sections on Context Switching Mechanism and Platform-Specific Details.

Context Switching

Fibers work by saving and restoring CPU register state (the "context"). When switching from fiber A to fiber B:

  1. Save Context: All callee-saved registers of fiber A are saved to memory
  2. Restore Context: All callee-saved registers of fiber B are loaded from memory
  3. Jump: The program counter/instruction pointer is updated to resume fiber B

The magic happens in architecture-specific assembly code that knows exactly which registers need to be preserved according to each platform's ABI (Application Binary Interface).

Fiber Initialization

When a fiber is initialized:

  1. Its stack pointer is set to the top of its allocated stack
  2. Special setup puts the fiber entry point address where the first context switch will jump to
  3. User data pointer is placed in a callee-saved register
  4. A special "prologue" function is set as the return address that handles fiber startup
  5. An "epilogue" function handles cleanup when the fiber returns

Memory Layout

Each fiber has its own stack that grows downward:

High Address
┌─────────────────┐
│  Stack grows    │
│      down       │
│       ↓         │
├─────────────────┤ ← Stack pointer (initially at top)
│                 │
│   Unused stack  │
│     space       │
│                 │
├─────────────────┤
│  Stack base     │
└─────────────────┘
Low Address

Performance at a Glance

Context switch overhead (approximate):

  • x86_64: ~50-100 cycles
  • AArch64: ~40-80 cycles
  • ARM Cortex-M3/M4: ~30-60 cycles
  • ARM Cortex-M4 with FPU: ~60-100 cycles

Compare to typical OS thread switch: 1000-10000 cycles - fibers are 10-300x faster!

Directory Structure

cfiber/
├── include/cfiber/         # Public headers
│   ├── context.h           # Context structure definitions
│   └── fiber.h             # Fiber API
├── src/cfiber/             # Implementation
│   ├── fiber.c             # Platform-agnostic fiber initialization
│   ├── context_x86_64.S    # x86_64 context switching
│   ├── context_aarch64.S   # AArch64 context switching
│   ├── context_armv6-m.S   # ARMv6-M (Cortex-M0/M0+)
│   ├── context_armv7-m.S   # ARMv7-M (Cortex-M3/M4/M7)
│   └── fiber_prologue_*.S  # Architecture-specific prologues
├── sample/                 # Complete scheduler example
├── tests/                  # Unit tests per architecture
└── utils/                  # Build utilities and toolchains

Core Components

1. Context Structure (context_t)

The context structure holds the CPU state needed to resume a fiber. Each architecture defines its own context layout based on its Application Binary Interface (ABI).

Common elements:

  • Stack pointer
  • Callee-saved general purpose registers
  • Frame pointer (if used)
  • Link register (ARM) or return address (x86_64)
  • FPU registers (optional, ARM only)

Why only callee-saved registers?

  • Caller-saved registers are saved by the calling code before function calls
  • When we switch contexts, we're effectively "calling" the other fiber
  • The other fiber will preserve its own caller-saved registers if needed
  • This minimizes the size of the context and switching overhead

2. Context Switching (switch_context())

Implemented in assembly for each architecture. The function:

void switch_context(context_t* old, context_t* new);

Steps:

  1. Save all callee-saved registers to old
  2. Load all callee-saved registers from new
  3. Return (which jumps to the address in the restored link register/return address)

Flow diagram:

Fiber A running
     |
     v
switch_context(&A.ctx, &B.ctx)
     |
     +---> Save A's registers to A.ctx
     |
     +---> Load B's registers from B.ctx
     |
     +---> Jump to B's saved instruction pointer
     |
     v
Fiber B running

3. Fiber Initialization (init_fiber())

Sets up a new fiber's initial state. The function:

void init_fiber(fiber_t* fiber, fiber_fn func, void* user_data);

Initialization steps:

  1. Set stack pointer to top of allocated stack

    stackPtr = fiber->stack + fiber->stack_size
    
  2. Align stack according to ABI requirements

    • x86_64/AArch64: 16-byte alignment
    • ARM: 8-byte alignment
  3. Store function pointer and user data in callee-saved registers

    • These persist across the context switch
    • Available when fiber starts running
  4. Set link register/return address to fiber_prologue

    • When the fiber is first switched to, it "returns" to fiber_prologue
    • fiber_prologue sets up the call to the user's function
  5. Clear frame pointer for new stack

    • Signals to debuggers that this is the base of the stack

4. Fiber Prologue

The fiber_prologue function is the actual entry point for a new fiber. It's implemented in assembly and:

  1. Retrieves the function pointer from a callee-saved register
  2. Retrieves the user data pointer from a callee-saved register
  3. Calls the user's fiber function with the user data as an argument
  4. When the user function returns, calls fiber_epilogue

5. Fiber Epilogue

The fiber_epilogue function handles fiber completion:

[[noreturn]] void fiber_epilogue() {
    scheduler_return_fiber();
    __builtin_unreachable();
}
  • Calls the user-provided scheduler_return_fiber() function
  • The scheduler decides what to do next (run another fiber, cleanup, etc.)
  • Marked [[noreturn]] because it never returns normally

Context Switching Mechanism

x86_64 Implementation

Registers saved (System V AMD64 ABI):

  • rsp - Stack pointer
  • r12-r15, rbx, rbp - Callee-saved general purpose registers

Assembly code structure:

switch_context:
    # Save current context
    mov [rdi + 0x00], rsp
    mov [rdi + 0x08], r15
    # ... save other registers
    
    # Load new context
    mov rsp, [rsi + 0x00]
    mov r15, [rsi + 0x08]
    # ... load other registers
    
    ret  # Jump to address in restored RSP

Notes:

  • First argument (rdi) = old context
  • Second argument (rsi) = new context
  • Uses Intel syntax: mov destination, source
  • The ret instruction pops the return address from the stack and jumps to it

AArch64 Implementation

Registers saved (AAPCS64):

  • sp - Stack pointer
  • x19-x30 - Callee-saved general purpose (x29=FP, x30=LR)
  • v8-v15 (d8-d15) - Callee-saved floating point registers

Assembly code structure:

switch_context:
    # Save current context
    mov x2, sp
    str x2, [x0], #8
    stp x19, x20, [x0], #16
    # ... save other registers
    
    # Load new context
    ldr x2, [x1], #8
    mov sp, x2
    ldp x19, x20, [x1], #16
    # ... load other registers
    
    ret  # Return to address in x30 (link register)

Notes:

  • First argument (x0) = old context
  • Second argument (x1) = new context
  • stp/ldp = store/load pair (more efficient)
  • FP registers are 128-bit, but we only save lower 64 bits (callee-saved portion)

ARM Cortex-M Implementation

Registers saved (AAPCS):

  • sp (r13) - Stack pointer
  • r4-r11 - Callee-saved general purpose
  • lr (r14) - Link register (return address)
  • s16-s31 - FPU registers (if FPU enabled/CFIBER_ARM_FPU defined)

Assembly code structure:

switch_context:
    # Save current context
    mov r2, sp
    str r2, [r0], #4
    stmia r0!, {r4-r11}
    str lr, [r0], #4
    
    # Optional: save FPU registers
    vstmia r0!, {s16-s31}
    
    # Load new context
    ldr r2, [r1], #4
    mov sp, r2
    ldmia r1!, {r4-r11}
    ldr lr, [r1], #4
    
    # Optional: load FPU registers
    vldmia r1!, {s16-s31}
    
    bx lr  # Branch to address in link register

Notes:

  • Uses Thumb-2 instruction set
  • stmia/ldmia = store/load multiple, increment after
  • FPU instructions are conditional (not present on M0/M0+/M3)
  • bx lr = branch and exchange to address in link register

Fiber Lifecycle

┌─────────────────┐
│   Uninitialized │
└────────┬────────┘
         │
         │ init_fiber()
         │
         v
┌──────────────────┐
│   Ready to Run   │◄────────────┐
└────────┬─────────┘             │
         │                       │
         │ switch_context()      │ yield()
         │                       │
         v                       │
┌──────────────────┐             │
│     Running      │─────────────┘
└────────┬─────────┘
         │
         │ function returns
         │
         v
┌─────────────────┐
│    Completed    │
└─────────────────┘

State transitions:

  1. Uninitialized → Ready

    • User allocates stack
    • Calls init_fiber() with function and user data
    • Context is set up but fiber hasn't started yet
  2. Ready → Running

    • Scheduler calls switch_context() to the fiber
    • Fiber starts executing from fiber_prologue
    • User function begins running
  3. Running → Ready

    • Fiber explicitly yields by calling user-provided yield function
    • Yield function calls switch_context() to switch to another fiber/scheduler
    • Fiber's context is saved, can be resumed later
  4. Running → Completed

    • User function returns
    • fiber_epilogue calls scheduler_return_fiber()
    • Scheduler marks fiber as completed and selects next fiber

Platform-Specific Details

x86_64 (System V AMD64 ABI)

  • Stack alignment: 16 bytes (required by System V ABI)
  • Call overhead: ~50-100 CPU cycles
  • Context size: 56 bytes
  • FPU handling: SSE registers are caller-saved, not saved in context

AArch64

  • Stack alignment: 16 bytes (required by AAPCS64)
  • Call overhead: ~40-80 CPU cycles
  • Context size: 208 bytes (with FP registers)
  • FPU handling: d8-d15 (lower 64 bits of v8-v15) saved

ARM Cortex-M

  • Stack alignment: 8 bytes (required by AAPCS)
  • Call overhead: ~30-60 cycles (M3/M4), ~60-100 cycles (M4 with FPU)
  • Context size:
    • Without FPU: 40 bytes
    • With FPU: 104 bytes
  • FPU handling: s16-s31 saved if CFIBER_ARM_FPU defined
  • Instruction set: Thumb/Thumb-2
Cortex-M Architecture FPU Support Notes
M0/M0+ ARMv6-M No Thumb-1 only, limited instructions
M3 ARMv7-M No Full Thumb-2
M4 ARMv7-M Optional DSP instructions, optional FPU
M7 ARMv7E-M Optional Faster, optional double-precision FPU

Memory Management

Stack Memory

Allocation:

  • User is responsible for allocating stack memory
  • Can use malloc(), static arrays, or custom allocators
  • Must remain valid for fiber's entire lifetime

Growth:

  • Stacks grow downward (from high addresses to low)
  • Stack pointer starts at stack + stack_size
  • As functions are called, SP decreases

Sizing considerations:

Stack Size = 
    (Max Call Depth × Average Frame Size) +
    (Largest Local Variable Buffer) +
    (Safety Margin)

Example calculation:

// Function chain: main → func1 → func2 → func3
// Each function has ~200 bytes of locals/saved registers
// func3 has a 1KB buffer
Stack Size = (4 × 200) + 1024 + 512 = 2336 bytes
// Round up to 4KB for safety

Context Memory

  • Context is stored within the fiber_t structure
  • No dynamic allocation required
  • Size varies by architecture (40-208 bytes)

Performance Considerations

Context Switch Overhead

Compared to OS threads (1000-10000 cycles), fiber switches are extremely fast:

x86_64:           ~50-100 cycles
AArch64:          ~40-80 cycles
ARM Cortex-M3/M4: ~30-60 cycles
ARM Cortex-M4+FPU: ~60-100 cycles

Optimization Techniques

  1. Minimal register saves - Only callee-saved registers
  2. No system calls - Pure user-space operation
  3. Cache-friendly - Context fits in single cache line
  4. Branch prediction - Consistent control flow
  5. Alignment - Proper stack alignment avoids penalties

Scalability

  • Memory per fiber: Stack size + context size
  • Context switch time: O(1), constant regardless of number of fibers
  • Scheduler complexity: Depends on user implementation
  • No hard limit: Limited only by available memory

Example capacity:

Embedded system: 64KB RAM
Stack per fiber: 2KB
Context per fiber: 40 bytes
Max fibers: ~30 (accounting for other memory use)

Cache Considerations

  • L1 cache line: Typically 64 bytes
  • Context size: Usually fits in 1-3 cache lines
  • Memory access pattern: Sequential reads/writes to context
  • Stack locality: Active stack portion typically hot in cache

Debug Support

Stack Traces

  • Frame pointers can be preserved for debugging
  • On some platforms, can walk the stack to generate traces
  • Clear frame pointer on new stacks signals base of stack

Debugging Tips

  1. Use stack canaries - Detect overflow

    uint32_t* canary = (uint32_t*)(fiber->stack);
    *canary = 0xDEADBEEF;
    // Later check: assert(*canary == 0xDEADBEEF);
  2. Initialize stack memory - Detect usage

    memset(fiber->stack, 0xCC, fiber->stack_size);
    // Unused stack will still contain 0xCC pattern
  3. Validate alignment

    assert(((uintptr_t)fiber->stack & (DEFAULT_ALIGNMENT-1)) == 0);
  4. Check context sanity

    assert(fiber->ctx.sp >= (uintptr_t)fiber->stack);
    assert(fiber->ctx.sp <= (uintptr_t)fiber->stack + fiber->stack_size);

Usage Guidelines and Best Practices

This section provides practical guidance for using cfiber effectively in your applications.

Stack Size Considerations

Choosing the right stack size is critical for fiber performance and reliability.

Recommended sizes by platform:

  • x86_64/AArch64: 8KB-64KB typical for hosted environments
  • ARM Cortex-M: 2KB-8KB typical for embedded (depends on call depth and local variables)

Calculate required stack size based on:

  • Maximum call depth in the fiber
  • Size of local variables and buffers
  • Any library functions called (check their stack usage)

Important Note for x86_64: Be aware that the System V ABI defines a 128-byte "red zone" below the stack pointer that leaf functions may use. Ensure your stack allocation accounts for this if your fiber functions are leaf functions or call leaf functions early in execution. In practice, this is rarely an issue as most functions establish proper frame pointers.

Always add margin for safety! Stack overflow is undefined behavior and can corrupt memory.

FPU Context on ARM Cortex-M

When using Cortex-M4F/M7F with hardware floating-point:

  • Set -DCFIBER_ARM_FLOAT_ABI=hard and appropriate FPU type during compilation
  • The library automatically saves/restores s16-s31 FP registers
  • Registers s0-s15 are caller-saved (not preserved across context switches)
  • FPU context adds ~64 bytes and approximately doubles context switch time

When to enable FPU context:

  • Enable if your fiber functions use floating-point operations
  • Disable if only using integer math (saves memory and cycles)
  • All fibers in a system must use the same FPU configuration

Integrating with RTOS

cfiber can complement Real-Time Operating Systems:

Integration pattern:

  • Each RTOS task can contain multiple fibers
  • Fibers provide lightweight cooperative multitasking within a preemptive task
  • Combine with RTOS for preemptive scheduling between tasks

Benefits:

  • Reduce number of RTOS tasks needed (saves memory)
  • Simplify communication between related operations (no locks needed within task)
  • Maintain deterministic RTOS scheduling where needed
  • Use fibers for I/O-bound operations, RTOS tasks for CPU-bound work

Example use case:

RTOS Task 1 (High Priority - Hard Real-Time):
  - Motor control fiber
  - Sensor reading fiber
  
RTOS Task 2 (Normal Priority):
  - Network communication fiber
  - Data logging fiber
  - UI update fiber

Performance Best Practices

Context Switch Performance:

As shown in the Performance Considerations section, fiber context switches are extremely fast:

  • x86_64: ~50-100 cycles
  • AArch64: ~40-80 cycles
  • ARM Cortex-M3/M4: ~30-60 cycles
  • ARM Cortex-M4 with FPU: ~60-100 cycles

Compare to typical OS thread switch: 1000-10000 cycles - fibers are 10-300× faster!

Optimization tips:

  1. Minimize stack size - Use only what you need plus a safety margin
  2. Yield strategically - Balance responsiveness vs overhead
  3. Group related fibers - Improves cache locality
  4. Disable FPU if not needed - On ARM Cortex-M, FPU context doubles switch time
  5. Use static allocation - Especially in embedded systems, avoids fragmentation

Thread Safety

Important: cfiber itself is not thread-safe.

  • Fibers within a single OS thread can safely switch between each other
  • Multiple OS threads each maintaining their own fiber pool is safe
  • Sharing fibers between OS threads requires external synchronization

Multi-threading patterns:

  1. One fiber pool per thread:

    Thread 1: [Fiber A, Fiber B, Fiber C]
    Thread 2: [Fiber D, Fiber E, Fiber F]
    
  2. Work stealing: OS threads can steal fibers from each other with proper locking