This document describes the internal architecture and implementation details of the cfiber library.
- Overview
- How It Works
- Directory Structure
- Core Components
- Context Switching Mechanism
- Fiber Lifecycle
- Platform-Specific Details
- Memory Management
- Performance Considerations
- Debug Support
- Usage Guidelines and Best Practices
- Thread Safety
- References
cfiber implements stackful coroutines (fibers) using low-level context switching. The library is split into:
- Architecture-agnostic C code - High-level fiber management (
fiber.c, headers) - Architecture-specific assembly - Low-level context switching (
context_*.S,fiber_prologue_*.S)
This separation ensures portability while allowing optimal performance on each platform.
This section provides a conceptual overview of how fibers work. For detailed implementation specifics, see the later sections on Context Switching Mechanism and Platform-Specific Details.
Fibers work by saving and restoring CPU register state (the "context"). When switching from fiber A to fiber B:
- Save Context: All callee-saved registers of fiber A are saved to memory
- Restore Context: All callee-saved registers of fiber B are loaded from memory
- Jump: The program counter/instruction pointer is updated to resume fiber B
The magic happens in architecture-specific assembly code that knows exactly which registers need to be preserved according to each platform's ABI (Application Binary Interface).
When a fiber is initialized:
- Its stack pointer is set to the top of its allocated stack
- Special setup puts the fiber entry point address where the first context switch will jump to
- User data pointer is placed in a callee-saved register
- A special "prologue" function is set as the return address that handles fiber startup
- An "epilogue" function handles cleanup when the fiber returns
Each fiber has its own stack that grows downward:
High Address
┌─────────────────┐
│ Stack grows │
│ down │
│ ↓ │
├─────────────────┤ ← Stack pointer (initially at top)
│ │
│ Unused stack │
│ space │
│ │
├─────────────────┤
│ Stack base │
└─────────────────┘
Low Address
Context switch overhead (approximate):
- x86_64: ~50-100 cycles
- AArch64: ~40-80 cycles
- ARM Cortex-M3/M4: ~30-60 cycles
- ARM Cortex-M4 with FPU: ~60-100 cycles
Compare to typical OS thread switch: 1000-10000 cycles - fibers are 10-300x faster!
cfiber/
├── include/cfiber/ # Public headers
│ ├── context.h # Context structure definitions
│ └── fiber.h # Fiber API
├── src/cfiber/ # Implementation
│ ├── fiber.c # Platform-agnostic fiber initialization
│ ├── context_x86_64.S # x86_64 context switching
│ ├── context_aarch64.S # AArch64 context switching
│ ├── context_armv6-m.S # ARMv6-M (Cortex-M0/M0+)
│ ├── context_armv7-m.S # ARMv7-M (Cortex-M3/M4/M7)
│ └── fiber_prologue_*.S # Architecture-specific prologues
├── sample/ # Complete scheduler example
├── tests/ # Unit tests per architecture
└── utils/ # Build utilities and toolchains
The context structure holds the CPU state needed to resume a fiber. Each architecture defines its own context layout based on its Application Binary Interface (ABI).
Common elements:
- Stack pointer
- Callee-saved general purpose registers
- Frame pointer (if used)
- Link register (ARM) or return address (x86_64)
- FPU registers (optional, ARM only)
Why only callee-saved registers?
- Caller-saved registers are saved by the calling code before function calls
- When we switch contexts, we're effectively "calling" the other fiber
- The other fiber will preserve its own caller-saved registers if needed
- This minimizes the size of the context and switching overhead
Implemented in assembly for each architecture. The function:
void switch_context(context_t* old, context_t* new);Steps:
- Save all callee-saved registers to
old - Load all callee-saved registers from
new - Return (which jumps to the address in the restored link register/return address)
Flow diagram:
Fiber A running
|
v
switch_context(&A.ctx, &B.ctx)
|
+---> Save A's registers to A.ctx
|
+---> Load B's registers from B.ctx
|
+---> Jump to B's saved instruction pointer
|
v
Fiber B running
Sets up a new fiber's initial state. The function:
void init_fiber(fiber_t* fiber, fiber_fn func, void* user_data);Initialization steps:
-
Set stack pointer to top of allocated stack
stackPtr = fiber->stack + fiber->stack_size -
Align stack according to ABI requirements
- x86_64/AArch64: 16-byte alignment
- ARM: 8-byte alignment
-
Store function pointer and user data in callee-saved registers
- These persist across the context switch
- Available when fiber starts running
-
Set link register/return address to
fiber_prologue- When the fiber is first switched to, it "returns" to fiber_prologue
- fiber_prologue sets up the call to the user's function
-
Clear frame pointer for new stack
- Signals to debuggers that this is the base of the stack
The fiber_prologue function is the actual entry point for a new fiber. It's implemented in assembly and:
- Retrieves the function pointer from a callee-saved register
- Retrieves the user data pointer from a callee-saved register
- Calls the user's fiber function with the user data as an argument
- When the user function returns, calls
fiber_epilogue
The fiber_epilogue function handles fiber completion:
[[noreturn]] void fiber_epilogue() {
scheduler_return_fiber();
__builtin_unreachable();
}- Calls the user-provided
scheduler_return_fiber()function - The scheduler decides what to do next (run another fiber, cleanup, etc.)
- Marked
[[noreturn]]because it never returns normally
Registers saved (System V AMD64 ABI):
rsp- Stack pointerr12-r15, rbx, rbp- Callee-saved general purpose registers
Assembly code structure:
switch_context:
# Save current context
mov [rdi + 0x00], rsp
mov [rdi + 0x08], r15
# ... save other registers
# Load new context
mov rsp, [rsi + 0x00]
mov r15, [rsi + 0x08]
# ... load other registers
ret # Jump to address in restored RSPNotes:
- First argument (
rdi) = old context - Second argument (
rsi) = new context - Uses Intel syntax:
mov destination, source - The
retinstruction pops the return address from the stack and jumps to it
Registers saved (AAPCS64):
sp- Stack pointerx19-x30- Callee-saved general purpose (x29=FP, x30=LR)v8-v15(d8-d15) - Callee-saved floating point registers
Assembly code structure:
switch_context:
# Save current context
mov x2, sp
str x2, [x0], #8
stp x19, x20, [x0], #16
# ... save other registers
# Load new context
ldr x2, [x1], #8
mov sp, x2
ldp x19, x20, [x1], #16
# ... load other registers
ret # Return to address in x30 (link register)Notes:
- First argument (
x0) = old context - Second argument (
x1) = new context stp/ldp= store/load pair (more efficient)- FP registers are 128-bit, but we only save lower 64 bits (callee-saved portion)
Registers saved (AAPCS):
sp(r13) - Stack pointerr4-r11- Callee-saved general purposelr(r14) - Link register (return address)s16-s31- FPU registers (if FPU enabled/CFIBER_ARM_FPUdefined)
Assembly code structure:
switch_context:
# Save current context
mov r2, sp
str r2, [r0], #4
stmia r0!, {r4-r11}
str lr, [r0], #4
# Optional: save FPU registers
vstmia r0!, {s16-s31}
# Load new context
ldr r2, [r1], #4
mov sp, r2
ldmia r1!, {r4-r11}
ldr lr, [r1], #4
# Optional: load FPU registers
vldmia r1!, {s16-s31}
bx lr # Branch to address in link registerNotes:
- Uses Thumb-2 instruction set
stmia/ldmia= store/load multiple, increment after- FPU instructions are conditional (not present on M0/M0+/M3)
bx lr= branch and exchange to address in link register
┌─────────────────┐
│ Uninitialized │
└────────┬────────┘
│
│ init_fiber()
│
v
┌──────────────────┐
│ Ready to Run │◄────────────┐
└────────┬─────────┘ │
│ │
│ switch_context() │ yield()
│ │
v │
┌──────────────────┐ │
│ Running │─────────────┘
└────────┬─────────┘
│
│ function returns
│
v
┌─────────────────┐
│ Completed │
└─────────────────┘
State transitions:
-
Uninitialized → Ready
- User allocates stack
- Calls
init_fiber()with function and user data - Context is set up but fiber hasn't started yet
-
Ready → Running
- Scheduler calls
switch_context()to the fiber - Fiber starts executing from
fiber_prologue - User function begins running
- Scheduler calls
-
Running → Ready
- Fiber explicitly yields by calling user-provided yield function
- Yield function calls
switch_context()to switch to another fiber/scheduler - Fiber's context is saved, can be resumed later
-
Running → Completed
- User function returns
fiber_epiloguecallsscheduler_return_fiber()- Scheduler marks fiber as completed and selects next fiber
- Stack alignment: 16 bytes (required by System V ABI)
- Call overhead: ~50-100 CPU cycles
- Context size: 56 bytes
- FPU handling: SSE registers are caller-saved, not saved in context
- Stack alignment: 16 bytes (required by AAPCS64)
- Call overhead: ~40-80 CPU cycles
- Context size: 208 bytes (with FP registers)
- FPU handling: d8-d15 (lower 64 bits of v8-v15) saved
- Stack alignment: 8 bytes (required by AAPCS)
- Call overhead: ~30-60 cycles (M3/M4), ~60-100 cycles (M4 with FPU)
- Context size:
- Without FPU: 40 bytes
- With FPU: 104 bytes
- FPU handling: s16-s31 saved if CFIBER_ARM_FPU defined
- Instruction set: Thumb/Thumb-2
| Cortex-M | Architecture | FPU Support | Notes |
|---|---|---|---|
| M0/M0+ | ARMv6-M | No | Thumb-1 only, limited instructions |
| M3 | ARMv7-M | No | Full Thumb-2 |
| M4 | ARMv7-M | Optional | DSP instructions, optional FPU |
| M7 | ARMv7E-M | Optional | Faster, optional double-precision FPU |
Allocation:
- User is responsible for allocating stack memory
- Can use
malloc(), static arrays, or custom allocators - Must remain valid for fiber's entire lifetime
Growth:
- Stacks grow downward (from high addresses to low)
- Stack pointer starts at
stack + stack_size - As functions are called, SP decreases
Sizing considerations:
Stack Size =
(Max Call Depth × Average Frame Size) +
(Largest Local Variable Buffer) +
(Safety Margin)
Example calculation:
// Function chain: main → func1 → func2 → func3
// Each function has ~200 bytes of locals/saved registers
// func3 has a 1KB buffer
Stack Size = (4 × 200) + 1024 + 512 = 2336 bytes
// Round up to 4KB for safety- Context is stored within the
fiber_tstructure - No dynamic allocation required
- Size varies by architecture (40-208 bytes)
Compared to OS threads (1000-10000 cycles), fiber switches are extremely fast:
x86_64: ~50-100 cycles
AArch64: ~40-80 cycles
ARM Cortex-M3/M4: ~30-60 cycles
ARM Cortex-M4+FPU: ~60-100 cycles
- Minimal register saves - Only callee-saved registers
- No system calls - Pure user-space operation
- Cache-friendly - Context fits in single cache line
- Branch prediction - Consistent control flow
- Alignment - Proper stack alignment avoids penalties
- Memory per fiber: Stack size + context size
- Context switch time: O(1), constant regardless of number of fibers
- Scheduler complexity: Depends on user implementation
- No hard limit: Limited only by available memory
Example capacity:
Embedded system: 64KB RAM
Stack per fiber: 2KB
Context per fiber: 40 bytes
Max fibers: ~30 (accounting for other memory use)
- L1 cache line: Typically 64 bytes
- Context size: Usually fits in 1-3 cache lines
- Memory access pattern: Sequential reads/writes to context
- Stack locality: Active stack portion typically hot in cache
- Frame pointers can be preserved for debugging
- On some platforms, can walk the stack to generate traces
- Clear frame pointer on new stacks signals base of stack
-
Use stack canaries - Detect overflow
uint32_t* canary = (uint32_t*)(fiber->stack); *canary = 0xDEADBEEF; // Later check: assert(*canary == 0xDEADBEEF);
-
Initialize stack memory - Detect usage
memset(fiber->stack, 0xCC, fiber->stack_size); // Unused stack will still contain 0xCC pattern
-
Validate alignment
assert(((uintptr_t)fiber->stack & (DEFAULT_ALIGNMENT-1)) == 0);
-
Check context sanity
assert(fiber->ctx.sp >= (uintptr_t)fiber->stack); assert(fiber->ctx.sp <= (uintptr_t)fiber->stack + fiber->stack_size);
This section provides practical guidance for using cfiber effectively in your applications.
Choosing the right stack size is critical for fiber performance and reliability.
Recommended sizes by platform:
- x86_64/AArch64: 8KB-64KB typical for hosted environments
- ARM Cortex-M: 2KB-8KB typical for embedded (depends on call depth and local variables)
Calculate required stack size based on:
- Maximum call depth in the fiber
- Size of local variables and buffers
- Any library functions called (check their stack usage)
Important Note for x86_64: Be aware that the System V ABI defines a 128-byte "red zone" below the stack pointer that leaf functions may use. Ensure your stack allocation accounts for this if your fiber functions are leaf functions or call leaf functions early in execution. In practice, this is rarely an issue as most functions establish proper frame pointers.
Always add margin for safety! Stack overflow is undefined behavior and can corrupt memory.
When using Cortex-M4F/M7F with hardware floating-point:
- Set
-DCFIBER_ARM_FLOAT_ABI=hardand appropriate FPU type during compilation - The library automatically saves/restores s16-s31 FP registers
- Registers s0-s15 are caller-saved (not preserved across context switches)
- FPU context adds ~64 bytes and approximately doubles context switch time
When to enable FPU context:
- Enable if your fiber functions use floating-point operations
- Disable if only using integer math (saves memory and cycles)
- All fibers in a system must use the same FPU configuration
cfiber can complement Real-Time Operating Systems:
Integration pattern:
- Each RTOS task can contain multiple fibers
- Fibers provide lightweight cooperative multitasking within a preemptive task
- Combine with RTOS for preemptive scheduling between tasks
Benefits:
- Reduce number of RTOS tasks needed (saves memory)
- Simplify communication between related operations (no locks needed within task)
- Maintain deterministic RTOS scheduling where needed
- Use fibers for I/O-bound operations, RTOS tasks for CPU-bound work
Example use case:
RTOS Task 1 (High Priority - Hard Real-Time):
- Motor control fiber
- Sensor reading fiber
RTOS Task 2 (Normal Priority):
- Network communication fiber
- Data logging fiber
- UI update fiber
Context Switch Performance:
As shown in the Performance Considerations section, fiber context switches are extremely fast:
- x86_64: ~50-100 cycles
- AArch64: ~40-80 cycles
- ARM Cortex-M3/M4: ~30-60 cycles
- ARM Cortex-M4 with FPU: ~60-100 cycles
Compare to typical OS thread switch: 1000-10000 cycles - fibers are 10-300× faster!
Optimization tips:
- Minimize stack size - Use only what you need plus a safety margin
- Yield strategically - Balance responsiveness vs overhead
- Group related fibers - Improves cache locality
- Disable FPU if not needed - On ARM Cortex-M, FPU context doubles switch time
- Use static allocation - Especially in embedded systems, avoids fragmentation
Important: cfiber itself is not thread-safe.
- Fibers within a single OS thread can safely switch between each other
- Multiple OS threads each maintaining their own fiber pool is safe
- Sharing fibers between OS threads requires external synchronization
Multi-threading patterns:
-
One fiber pool per thread:
Thread 1: [Fiber A, Fiber B, Fiber C] Thread 2: [Fiber D, Fiber E, Fiber F] -
Work stealing: OS threads can steal fibers from each other with proper locking