cfiber Architecture Documentation

This document describes the internal architecture and implementation details of the cfiber library.

Overview
How It Works
Directory Structure
Core Components
Context Switching Mechanism
Fiber Lifecycle
Platform-Specific Details
Memory Management
Performance Considerations
Debug Support
Usage Guidelines and Best Practices
Thread Safety
References

Overview

cfiber implements stackful coroutines (fibers) using low-level context switching. The library is split into:

Architecture-agnostic C code - High-level fiber management (fiber.c, headers)
Architecture-specific assembly - Low-level context switching (context_*.S, fiber_prologue_*.S)

This separation ensures portability while allowing optimal performance on each platform.

How It Works

This section provides a conceptual overview of how fibers work. For detailed implementation specifics, see the later sections on Context Switching Mechanism and Platform-Specific Details.

Context Switching

Fibers work by saving and restoring CPU register state (the "context"). When switching from fiber A to fiber B:

Save Context: All callee-saved registers of fiber A are saved to memory
Restore Context: All callee-saved registers of fiber B are loaded from memory
Jump: The program counter/instruction pointer is updated to resume fiber B

The magic happens in architecture-specific assembly code that knows exactly which registers need to be preserved according to each platform's ABI (Application Binary Interface).

Fiber Initialization

When a fiber is initialized:

Its stack pointer is set to the top of its allocated stack
Special setup puts the fiber entry point address where the first context switch will jump to
User data pointer is placed in a callee-saved register
A special "prologue" function is set as the return address that handles fiber startup
An "epilogue" function handles cleanup when the fiber returns

Memory Layout

Each fiber has its own stack that grows downward:

High Address
┌─────────────────┐
│  Stack grows    │
│      down       │
│       ↓         │
├─────────────────┤ ← Stack pointer (initially at top)
│                 │
│   Unused stack  │
│     space       │
│                 │
├─────────────────┤
│  Stack base     │
└─────────────────┘
Low Address

Performance at a Glance

Context switch overhead (approximate):

x86_64: ~50-100 cycles
AArch64: ~40-80 cycles
ARM Cortex-M3/M4: ~30-60 cycles
ARM Cortex-M4 with FPU: ~60-100 cycles

Compare to typical OS thread switch: 1000-10000 cycles - fibers are 10-300x faster!

Directory Structure

cfiber/
├── include/cfiber/         # Public headers
│   ├── context.h           # Context structure definitions
│   └── fiber.h             # Fiber API
├── src/cfiber/             # Implementation
│   ├── fiber.c             # Platform-agnostic fiber initialization
│   ├── context_x86_64.S    # x86_64 context switching
│   ├── context_aarch64.S   # AArch64 context switching
│   ├── context_armv6-m.S   # ARMv6-M (Cortex-M0/M0+)
│   ├── context_armv7-m.S   # ARMv7-M (Cortex-M3/M4/M7)
│   └── fiber_prologue_*.S  # Architecture-specific prologues
├── sample/                 # Complete scheduler example
├── tests/                  # Unit tests per architecture
└── utils/                  # Build utilities and toolchains

Core Components

1. Context Structure (`context_t`)

The context structure holds the CPU state needed to resume a fiber. Each architecture defines its own context layout based on its Application Binary Interface (ABI).

Common elements:

Stack pointer
Callee-saved general purpose registers
Frame pointer (if used)
Link register (ARM) or return address (x86_64)
FPU registers (optional, ARM only)

Why only callee-saved registers?

Caller-saved registers are saved by the calling code before function calls
When we switch contexts, we're effectively "calling" the other fiber
The other fiber will preserve its own caller-saved registers if needed
This minimizes the size of the context and switching overhead

2. Context Switching (`switch_context()`)

Implemented in assembly for each architecture. The function:

void switch_context(context_t* old, context_t* new);

Steps:

Save all callee-saved registers to old
Load all callee-saved registers from new
Return (which jumps to the address in the restored link register/return address)

Flow diagram:

Fiber A running
     |
     v
switch_context(&A.ctx, &B.ctx)
     |
     +---> Save A's registers to A.ctx
     |
     +---> Load B's registers from B.ctx
     |
     +---> Jump to B's saved instruction pointer
     |
     v
Fiber B running

3. Fiber Initialization (`init_fiber()`)

Sets up a new fiber's initial state. The function:

void init_fiber(fiber_t* fiber, fiber_fn func, void* user_data);

Initialization steps:

Set stack pointer to top of allocated stack

stackPtr = fiber->stack + fiber->stack_size

Align stack according to ABI requirements
- x86_64/AArch64: 16-byte alignment
- ARM: 8-byte alignment
Store function pointer and user data in callee-saved registers
- These persist across the context switch
- Available when fiber starts running
Set link register/return address to fiber_prologue
- When the fiber is first switched to, it "returns" to fiber_prologue
- fiber_prologue sets up the call to the user's function
Clear frame pointer for new stack
- Signals to debuggers that this is the base of the stack

4. Fiber Prologue

The fiber_prologue function is the actual entry point for a new fiber. It's implemented in assembly and:

Retrieves the function pointer from a callee-saved register
Retrieves the user data pointer from a callee-saved register
Calls the user's fiber function with the user data as an argument
When the user function returns, calls fiber_epilogue

5. Fiber Epilogue

The fiber_epilogue function handles fiber completion:

[[noreturn]] void fiber_epilogue() {
    scheduler_return_fiber();
    __builtin_unreachable();
}

Calls the user-provided scheduler_return_fiber() function
The scheduler decides what to do next (run another fiber, cleanup, etc.)
Marked [[noreturn]] because it never returns normally

Context Switching Mechanism

x86_64 Implementation

Registers saved (System V AMD64 ABI):

rsp - Stack pointer
r12-r15, rbx, rbp - Callee-saved general purpose registers

Assembly code structure:

switch_context:
    # Save current context
    mov [rdi + 0x00], rsp
    mov [rdi + 0x08], r15
    # ... save other registers
    
    # Load new context
    mov rsp, [rsi + 0x00]
    mov r15, [rsi + 0x08]
    # ... load other registers
    
    ret  # Jump to address in restored RSP

Notes:

First argument (rdi) = old context
Second argument (rsi) = new context
Uses Intel syntax: mov destination, source
The ret instruction pops the return address from the stack and jumps to it

AArch64 Implementation

Registers saved (AAPCS64):

sp - Stack pointer
x19-x30 - Callee-saved general purpose (x29=FP, x30=LR)
v8-v15 (d8-d15) - Callee-saved floating point registers

Assembly code structure:

switch_context:
    # Save current context
    mov x2, sp
    str x2, [x0], #8
    stp x19, x20, [x0], #16
    # ... save other registers
    
    # Load new context
    ldr x2, [x1], #8
    mov sp, x2
    ldp x19, x20, [x1], #16
    # ... load other registers
    
    ret  # Return to address in x30 (link register)

Notes:

First argument (x0) = old context
Second argument (x1) = new context
stp/ldp = store/load pair (more efficient)
FP registers are 128-bit, but we only save lower 64 bits (callee-saved portion)

ARM Cortex-M Implementation

Registers saved (AAPCS):

sp (r13) - Stack pointer
r4-r11 - Callee-saved general purpose
lr (r14) - Link register (return address)
s16-s31 - FPU registers (if FPU enabled/CFIBER_ARM_FPU defined)

Assembly code structure:

switch_context:
    # Save current context
    mov r2, sp
    str r2, [r0], #4
    stmia r0!, {r4-r11}
    str lr, [r0], #4
    
    # Optional: save FPU registers
    vstmia r0!, {s16-s31}
    
    # Load new context
    ldr r2, [r1], #4
    mov sp, r2
    ldmia r1!, {r4-r11}
    ldr lr, [r1], #4
    
    # Optional: load FPU registers
    vldmia r1!, {s16-s31}
    
    bx lr  # Branch to address in link register

Notes:

Uses Thumb-2 instruction set
stmia/ldmia = store/load multiple, increment after
FPU instructions are conditional (not present on M0/M0+/M3)
bx lr = branch and exchange to address in link register

Fiber Lifecycle

┌─────────────────┐
│   Uninitialized │
└────────┬────────┘
         │
         │ init_fiber()
         │
         v
┌──────────────────┐
│   Ready to Run   │◄────────────┐
└────────┬─────────┘             │
         │                       │
         │ switch_context()      │ yield()
         │                       │
         v                       │
┌──────────────────┐             │
│     Running      │─────────────┘
└────────┬─────────┘
         │
         │ function returns
         │
         v
┌─────────────────┐
│    Completed    │
└─────────────────┘

State transitions:

Uninitialized → Ready
- User allocates stack
- Calls init_fiber() with function and user data
- Context is set up but fiber hasn't started yet
Ready → Running
- Scheduler calls switch_context() to the fiber
- Fiber starts executing from fiber_prologue
- User function begins running
Running → Ready
- Fiber explicitly yields by calling user-provided yield function
- Yield function calls switch_context() to switch to another fiber/scheduler
- Fiber's context is saved, can be resumed later
Running → Completed
- User function returns
- fiber_epilogue calls scheduler_return_fiber()
- Scheduler marks fiber as completed and selects next fiber

Platform-Specific Details

x86_64 (System V AMD64 ABI)

Stack alignment: 16 bytes (required by System V ABI)
Call overhead: ~50-100 CPU cycles
Context size: 56 bytes
FPU handling: SSE registers are caller-saved, not saved in context

AArch64

Stack alignment: 16 bytes (required by AAPCS64)
Call overhead: ~40-80 CPU cycles
Context size: 208 bytes (with FP registers)
FPU handling: d8-d15 (lower 64 bits of v8-v15) saved

ARM Cortex-M

Stack alignment: 8 bytes (required by AAPCS)
Call overhead: ~30-60 cycles (M3/M4), ~60-100 cycles (M4 with FPU)
Context size:
- Without FPU: 40 bytes
- With FPU: 104 bytes
FPU handling: s16-s31 saved if CFIBER_ARM_FPU defined
Instruction set: Thumb/Thumb-2

Cortex-M	Architecture	FPU Support	Notes
M0/M0+	ARMv6-M	No	Thumb-1 only, limited instructions
M3	ARMv7-M	No	Full Thumb-2
M4	ARMv7-M	Optional	DSP instructions, optional FPU
M7	ARMv7E-M	Optional	Faster, optional double-precision FPU

Memory Management

Stack Memory

Allocation:

User is responsible for allocating stack memory
Can use malloc(), static arrays, or custom allocators
Must remain valid for fiber's entire lifetime

Growth:

Stacks grow downward (from high addresses to low)
Stack pointer starts at stack + stack_size
As functions are called, SP decreases

Sizing considerations:

Stack Size = 
    (Max Call Depth × Average Frame Size) +
    (Largest Local Variable Buffer) +
    (Safety Margin)

Example calculation:

// Function chain: main → func1 → func2 → func3
// Each function has ~200 bytes of locals/saved registers
// func3 has a 1KB buffer
Stack Size = (4 × 200) + 1024 + 512 = 2336 bytes
// Round up to 4KB for safety

Context Memory

Context is stored within the fiber_t structure
No dynamic allocation required
Size varies by architecture (40-208 bytes)

Performance Considerations

Context Switch Overhead

Compared to OS threads (1000-10000 cycles), fiber switches are extremely fast:

x86_64:           ~50-100 cycles
AArch64:          ~40-80 cycles
ARM Cortex-M3/M4: ~30-60 cycles
ARM Cortex-M4+FPU: ~60-100 cycles

Optimization Techniques

Minimal register saves - Only callee-saved registers
No system calls - Pure user-space operation
Cache-friendly - Context fits in single cache line
Branch prediction - Consistent control flow
Alignment - Proper stack alignment avoids penalties

Scalability

Memory per fiber: Stack size + context size
Context switch time: O(1), constant regardless of number of fibers
Scheduler complexity: Depends on user implementation
No hard limit: Limited only by available memory

Example capacity:

Embedded system: 64KB RAM
Stack per fiber: 2KB
Context per fiber: 40 bytes
Max fibers: ~30 (accounting for other memory use)

Cache Considerations

L1 cache line: Typically 64 bytes
Context size: Usually fits in 1-3 cache lines
Memory access pattern: Sequential reads/writes to context
Stack locality: Active stack portion typically hot in cache

Debug Support

Stack Traces

Frame pointers can be preserved for debugging
On some platforms, can walk the stack to generate traces
Clear frame pointer on new stacks signals base of stack

Debugging Tips

Use stack canaries - Detect overflow

uint32_t* canary = (uint32_t*)(fiber->stack);
*canary = 0xDEADBEEF;
// Later check: assert(*canary == 0xDEADBEEF);

Initialize stack memory - Detect usage

memset(fiber->stack, 0xCC, fiber->stack_size);
// Unused stack will still contain 0xCC pattern

Validate alignment

assert(((uintptr_t)fiber->stack & (DEFAULT_ALIGNMENT-1)) == 0);

Check context sanity

assert(fiber->ctx.sp >= (uintptr_t)fiber->stack);
assert(fiber->ctx.sp <= (uintptr_t)fiber->stack + fiber->stack_size);

Usage Guidelines and Best Practices

This section provides practical guidance for using cfiber effectively in your applications.

Stack Size Considerations

Choosing the right stack size is critical for fiber performance and reliability.

Recommended sizes by platform:

x86_64/AArch64: 8KB-64KB typical for hosted environments
ARM Cortex-M: 2KB-8KB typical for embedded (depends on call depth and local variables)

Calculate required stack size based on:

Maximum call depth in the fiber
Size of local variables and buffers
Any library functions called (check their stack usage)

Important Note for x86_64: Be aware that the System V ABI defines a 128-byte "red zone" below the stack pointer that leaf functions may use. Ensure your stack allocation accounts for this if your fiber functions are leaf functions or call leaf functions early in execution. In practice, this is rarely an issue as most functions establish proper frame pointers.

Always add margin for safety! Stack overflow is undefined behavior and can corrupt memory.

FPU Context on ARM Cortex-M

When using Cortex-M4F/M7F with hardware floating-point:

Set -DCFIBER_ARM_FLOAT_ABI=hard and appropriate FPU type during compilation
The library automatically saves/restores s16-s31 FP registers
Registers s0-s15 are caller-saved (not preserved across context switches)
FPU context adds ~64 bytes and approximately doubles context switch time

When to enable FPU context:

Enable if your fiber functions use floating-point operations
Disable if only using integer math (saves memory and cycles)
All fibers in a system must use the same FPU configuration

Integrating with RTOS

cfiber can complement Real-Time Operating Systems:

Integration pattern:

Each RTOS task can contain multiple fibers
Fibers provide lightweight cooperative multitasking within a preemptive task
Combine with RTOS for preemptive scheduling between tasks

Benefits:

Reduce number of RTOS tasks needed (saves memory)
Simplify communication between related operations (no locks needed within task)
Maintain deterministic RTOS scheduling where needed
Use fibers for I/O-bound operations, RTOS tasks for CPU-bound work

Example use case:

RTOS Task 1 (High Priority - Hard Real-Time):
  - Motor control fiber
  - Sensor reading fiber
  
RTOS Task 2 (Normal Priority):
  - Network communication fiber
  - Data logging fiber
  - UI update fiber

Performance Best Practices

Context Switch Performance:

As shown in the Performance Considerations section, fiber context switches are extremely fast:

x86_64: ~50-100 cycles
AArch64: ~40-80 cycles
ARM Cortex-M3/M4: ~30-60 cycles
ARM Cortex-M4 with FPU: ~60-100 cycles

Compare to typical OS thread switch: 1000-10000 cycles - fibers are 10-300× faster!

Optimization tips:

Minimize stack size - Use only what you need plus a safety margin
Yield strategically - Balance responsiveness vs overhead
Group related fibers - Improves cache locality
Disable FPU if not needed - On ARM Cortex-M, FPU context doubles switch time
Use static allocation - Especially in embedded systems, avoids fragmentation

Thread Safety

Important: cfiber itself is not thread-safe.

Fibers within a single OS thread can safely switch between each other
Multiple OS threads each maintaining their own fiber pool is safe
Sharing fibers between OS threads requires external synchronization

Multi-threading patterns:

One fiber pool per thread:

Thread 1: [Fiber A, Fiber B, Fiber C]
Thread 2: [Fiber D, Fiber E, Fiber F]

Work stealing: OS threads can steal fibers from each other with proper locking

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

cfiber Architecture Documentation

Table of Contents

Overview

How It Works

Context Switching

Fiber Initialization

Memory Layout

Performance at a Glance

Directory Structure

Core Components

1. Context Structure (context_t)

2. Context Switching (switch_context())

3. Fiber Initialization (init_fiber())

4. Fiber Prologue

5. Fiber Epilogue

Context Switching Mechanism

x86_64 Implementation

AArch64 Implementation

ARM Cortex-M Implementation

Fiber Lifecycle

Platform-Specific Details

x86_64 (System V AMD64 ABI)

AArch64

ARM Cortex-M

Memory Management

Stack Memory

Context Memory

Performance Considerations

Context Switch Overhead

Optimization Techniques

Scalability

Cache Considerations

Debug Support

Stack Traces

Debugging Tips

Usage Guidelines and Best Practices

Stack Size Considerations

FPU Context on ARM Cortex-M

Integrating with RTOS

Performance Best Practices

Thread Safety

1. Context Structure (`context_t`)

2. Context Switching (`switch_context()`)

3. Fiber Initialization (`init_fiber()`)