Added in: v2.1.0 Use case: Test client rate limit handling, backoff strategies, concurrent requests, and timeout scenarios
MockLLMApi now supports rate limiting simulation and n-completions batching to help you test how your applications handle rate-limited APIs, slow responses, and multiple completion variants.
- 🎯 Per-endpoint statistics tracking with moving averages
- ⏱️ Configurable delay ranges for simulating rate limits
- 🔄 Multiple batching strategies (Auto, Sequential, Parallel, Streaming)
- 📊 Detailed response headers with timing information
- 🚀 N-completions support for generating multiple response variants
- 🔧 Fully backward compatible - opt-in feature
- Quick Start
- Configuration
- Usage Examples
- Batching Strategies
- Response Headers
- Response Format
- Use Cases
- API Reference
{
"MockLlmApi": {
"EnableRateLimiting": true,
"RateLimitDelayRange": "500-2000",
"RateLimitStrategy": "Auto",
"EnableRateLimitStatistics": true
}
}GET /api/mock/users?n=3&rateLimit=500-2000X-RateLimit-Limit: 25
X-RateLimit-Remaining: 24
X-LLMApi-Avg-Time: 2100
X-LLMApi-Delay-Applied: 1500
Add these settings to your appsettings.json under the MockLlmApi section:
{
"MockLlmApi": {
// Enable rate limiting simulation (default: false)
"EnableRateLimiting": false,
// Delay range in milliseconds (default: null = disabled)
// Format: "min-max" (e.g., "500-4000"), "max", "avg", or a fixed number
"RateLimitDelayRange": null,
// Batching strategy for n-completions (default: Auto)
// Options: Auto, Sequential, Parallel, Streaming
"RateLimitStrategy": "Auto",
// Enable per-endpoint statistics tracking (default: true)
"EnableRateLimitStatistics": true,
// Window size for moving average (default: 10)
"RateLimitStatsWindowSize": 10
}
}- Type:
boolean - Default:
false - Description: Master switch for rate limiting features. When disabled, no delays are applied regardless of other settings.
- Type:
stringornull - Default:
null - Description: Controls how delays are calculated:
"500-4000"- Random delay between 500ms and 4000ms"max"- Delay matches actual LLM response time (doubles total response time)"avg"- Delay matches endpoint's moving average response time"1500"- Fixed delay of 1500msnull- No delay (same asEnableRateLimiting: false)
- Type:
enum(Auto,Sequential,Parallel,Streaming) - Default:
Auto - Description: Strategy for executing n-completions (see Batching Strategies)
- Type:
boolean - Default:
true - Description: Tracks per-endpoint response times for calculating realistic rate limits
- Type:
int - Default:
10 - Description: Number of recent requests to include in moving average calculation
Request 3 variations of the same response:
GET /api/mock/users?n=3Response:
{
"completions": [
{
"index": 0,
"content": [{"id": 1, "name": "Alice Johnson"}],
"timing": {"requestTimeMs": 2340, "delayAppliedMs": null}
},
{
"index": 1,
"content": [{"id": 2, "name": "Bob Smith"}],
"timing": {"requestTimeMs": 2280, "delayAppliedMs": null}
},
{
"index": 2,
"content": [{"id": 3, "name": "Charlie Davis"}],
"timing": {"requestTimeMs": 2410, "delayAppliedMs": null}
}
],
"meta": {
"strategy": "Parallel",
"totalRequestTimeMs": 7030,
"totalDelayMs": 0,
"totalElapsedMs": 2410,
"averageRequestTimeMs": 2343
}
}Add delays between completions:
GET /api/mock/users?n=5&rateLimit=500-2000&strategy=sequentialThis will:
- Generate first completion
- Wait 500-2000ms (random)
- Generate second completion
- Wait 500-2000ms
- Repeat for all 5 completions
Use "max" to simulate APIs that rate limit based on processing time:
GET /api/mock/products?n=3&rateLimit=maxIf each LLM request takes ~2000ms, the delay will also be ~2000ms, effectively doubling the response time.
Override config via headers:
GET /api/mock/orders?n=4
X-Rate-Limit-Delay: 1000-3000
X-Rate-Limit-Strategy: parallelPrecedence order:
- Query parameters (
?rateLimit=,?strategy=) - HTTP headers (
X-Rate-Limit-Delay,X-Rate-Limit-Strategy) - Global configuration (
appsettings.json)
Enable rate limiting for a single request even if globally disabled:
GET /api/mock/users?n=2&rateLimit=1000-2000Disable for a single request even if globally enabled:
GET /api/mock/users?n=2&rateLimit=0Choose how multiple completions are executed:
System automatically selects the best strategy based on n:
- n = 1: No batching (single request)
- n = 2-5:
Parallel(fastest for small batches) - n > 5:
Streaming(most efficient for large batches)
Example:
GET /api/mock/users?n=10&strategy=auto
# Automatically uses Streaming strategyExecute requests one at a time with delays between each.
Pattern: Request 1 → Delay → Request 2 → Delay → Request 3
Best for:
- Testing backoff strategies
- Predictable timing requirements
- Sequential dependency validation
Example:
GET /api/mock/users?n=3&rateLimit=1000-2000&strategy=sequentialTimeline:
0ms: Start Request 1
2340ms: Complete Request 1
2340ms: Apply delay (1500ms)
3840ms: Start Request 2
6120ms: Complete Request 2
6120ms: Apply delay (1200ms)
7320ms: Start Request 3
9730ms: Complete Request 3
Total: 9730ms
Start all requests simultaneously, stagger response delivery with delays.
Pattern: Start all → Complete independently → Apply delays → Return in order
Best for:
- Fast completion with simulated rate limiting
- Testing concurrent request handling
- Maximum throughput testing
Example:
GET /api/mock/users?n=4&rateLimit=500-1500&strategy=parallelTimeline:
0ms: Start all 4 requests in parallel
2340ms: Complete Request 1, delay 1000ms, deliver at 3340ms
2380ms: Complete Request 2, delay 1200ms, deliver at 3580ms
2410ms: Complete Request 3, delay 1100ms, deliver at 3510ms
2360ms: Complete Request 4, delay 1300ms, deliver at 3660ms
Total: ~3660ms (vs 9490ms sequential)
Return results as they complete with rate-limited delays between each.
Pattern: Complete Request 1 → Deliver → Delay → Complete Request 2 → Deliver → Delay
Best for:
- Real-time UIs with SSE
- Large batch processing (n > 5)
- Progressive result delivery
Example:
GET /api/mock/users?n=6&rateLimit=500-1000&strategy=streamingTimeline:
0ms: All requests start via native n-completions API
2400ms: First result available, deliver immediately
3400ms: Second result (1000ms delay applied)
4300ms: Third result (900ms delay applied)
5200ms: Fourth result (900ms delay applied)
...
All responses include detailed timing and rate limit information:
X-RateLimit-Limit: 25
X-RateLimit-Remaining: 24
X-RateLimit-Reset: 1704067200
- X-RateLimit-Limit: Maximum requests allowed per minute (calculated from avg response time)
- X-RateLimit-Remaining: Requests remaining in current window (simulated)
- X-RateLimit-Reset: Unix timestamp when rate limit resets
X-LLMApi-Request-Time: 2340
X-LLMApi-Avg-Time: 2100
X-LLMApi-Total-Elapsed: 11520
X-LLMApi-Delay-Applied: 1500
- X-LLMApi-Request-Time: This request's LLM response time in milliseconds
- X-LLMApi-Avg-Time: Moving average response time for this endpoint
- X-LLMApi-Total-Elapsed: Total time including LLM + delays
- X-LLMApi-Delay-Applied: Total artificial delay applied (for n-completions, sum of all delays)
HTTP/1.1 200 OK
Content-Type: application/json
X-RateLimit-Limit: 25
X-RateLimit-Remaining: 24
X-RateLimit-Reset: 1704067200
X-LLMApi-Request-Time: 2340
X-LLMApi-Avg-Time: 2100
X-LLMApi-Total-Elapsed: 11520
X-LLMApi-Delay-Applied: 4500
{...response body...}Standard JSON response without wrapper:
[
{"id": 1, "name": "Alice Johnson"},
{"id": 2, "name": "Bob Smith"}
]Structured response with timing metadata:
{
"completions": [
{
"index": 0,
"content": [
{"id": 1, "name": "Alice Johnson"}
],
"timing": {
"requestTimeMs": 2340,
"delayAppliedMs": 1500
}
},
{
"index": 1,
"content": [
{"id": 2, "name": "Bob Smith"}
],
"timing": {
"requestTimeMs": 2280,
"delayAppliedMs": 1200
}
},
{
"index": 2,
"content": [
{"id": 3, "name": "Charlie Davis"}
],
"timing": {
"requestTimeMs": 2410,
"delayAppliedMs": 1300
}
}
],
"meta": {
"strategy": "Sequential",
"totalRequestTimeMs": 7030,
"totalDelayMs": 4000,
"totalElapsedMs": 11030,
"averageRequestTimeMs": 2343
}
}completions[]:
index: Zero-based index of the completioncontent: The generated JSON content (parsed object)timing.requestTimeMs: LLM processing time for this specific completiontiming.delayAppliedMs: Artificial delay applied after this completion (null if none)
meta:
strategy: The batching strategy used (Auto resolves to actual strategy)totalRequestTimeMs: Sum of all LLM processing timestotalDelayMs: Sum of all artificial delaystotalElapsedMs: Total wall-clock time for the requestaverageRequestTimeMs: Average LLM processing time per completion
Simulate 429 responses and verify your backoff logic:
GET /api/mock/users?n=10&rateLimit=100-500&strategy=sequentialMonitor response times to ensure your client handles delays appropriately.
Test how your app handles slow APIs:
GET /api/mock/products?n=1&rateLimit=5000
# Adds 5 second delay to test timeout behaviorGenerate multiple completions in parallel:
GET /api/mock/orders?n=5&strategy=parallelVerify your app can handle multiple simultaneous responses.
Generate diverse mock data for the same request:
GET /api/mock/users?shape={"name":"string","age":"number"}&n=5Each completion will have different names and ages, simulating real API variance.
Measure how rate limiting impacts your app's performance:
# No rate limiting
curl "/api/mock/users?n=10" -w "Time: %{time_total}s\n"
# With rate limiting
curl "/api/mock/users?n=10&rateLimit=500-1000" -w "Time: %{time_total}s\n"Test exponential backoff implementations:
# First request - fast
GET /api/mock/users?rateLimit=100-200
# Subsequent requests - simulate increasing delays
GET /api/mock/users?rateLimit=500-1000
GET /api/mock/users?rateLimit=2000-4000| Parameter | Type | Description | Example |
|---|---|---|---|
n |
integer | Number of completions to generate | ?n=5 |
rateLimit |
string | Delay range override | ?rateLimit=500-2000 |
strategy |
enum | Batching strategy override | ?strategy=parallel |
| Header | Type | Description | Example |
|---|---|---|---|
X-Rate-Limit-Delay |
string | Delay range override | X-Rate-Limit-Delay: 1000-3000 |
X-Rate-Limit-Strategy |
enum | Strategy override | X-Rate-Limit-Strategy: sequential |
| Header | Type | Description |
|---|---|---|
X-RateLimit-Limit |
integer | Max requests per minute |
X-RateLimit-Remaining |
integer | Requests remaining |
X-RateLimit-Reset |
integer | Unix timestamp for reset |
X-LLMApi-Request-Time |
integer | LLM processing time (ms) |
X-LLMApi-Avg-Time |
integer | Endpoint average time (ms) |
X-LLMApi-Total-Elapsed |
integer | Total response time (ms) |
X-LLMApi-Delay-Applied |
integer | Artificial delay applied (ms) |
interface RateLimitConfig {
EnableRateLimiting: boolean; // default: false
RateLimitDelayRange?: string | null; // "min-max", "max", "avg", or null
RateLimitStrategy: "Auto" | "Sequential" | "Parallel" | "Streaming";
EnableRateLimitStatistics: boolean; // default: true
RateLimitStatsWindowSize: number; // default: 10
}Test how your app adapts to changing rate limits:
# Start with generous limits
curl "/api/mock/users?n=5&rateLimit=100-500"
# Gradually increase pressure
curl "/api/mock/users?n=5&rateLimit=500-1500"
curl "/api/mock/users?n=5&rateLimit=1500-3000"
# Simulate rate limit exhaustion
curl "/api/mock/users?n=5&rateLimit=5000-10000"Combine with tools like Apache Bench:
# Test sustained load with rate limiting
ab -n 100 -c 10 "http://localhost:5116/api/mock/users?n=2&rateLimit=500-1000"Different strategies for different endpoints:
# Fast completions for critical data
curl "/api/mock/orders?n=3&strategy=parallel"
# Sequential for less critical data
curl "/api/mock/analytics?n=10&strategy=sequential&rateLimit=1000-2000"- Statistics tracking: Each endpoint maintains a queue of recent response times (default: 10 entries)
- N-completions: Memory scales linearly with
n(each completion held in memory) - Recommendation: For
n > 20, consider usingStreamingstrategy
- Sequential: Low CPU, one request at a time
- Parallel: High CPU burst during concurrent execution
- Streaming: Moderate CPU, balanced approach
- Auto: Automatically optimizes based on batch size
All batching strategies use the same total bandwidth (n requests to LLM), but differ in timing:
- Sequential: Bandwidth spread over longer time period
- Parallel: Bandwidth burst at start, staggered delivery
- Streaming: Balanced bandwidth usage over time
Problem: Delays not being applied
Checklist:
- ✅
EnableRateLimiting: truein config - ✅
RateLimitDelayRangeis notnull - ✅
n > 1in request (single requests don't apply delays by default) - ✅ Check logs for any errors
Problem: Getting wrapped response when expecting single object
Solution: N-completions (n>1) always return the wrapped format. For single completions, omit n or use n=1.
Problem: X-LLMApi-Avg-Time header missing
Checklist:
- ✅
EnableRateLimitStatistics: true - ✅ Make multiple requests to same endpoint (stats need data)
- ✅ Check that endpoint path is consistent (query params don't affect path tracking)
Problem: Slow responses with large n values
Solutions:
- Use
strategy=parallelfor faster completion - Reduce
nto manageable size (n ≤ 10 recommended) - Increase
MaxContextWindowandTimeoutSecondsin config - Consider using faster LLM model
No breaking changes! Rate limiting is disabled by default.
To enable:
{
"MockLlmApi": {
"EnableRateLimiting": true,
"RateLimitDelayRange": "500-2000"
}
}- All existing endpoints work unchanged
- New headers added to all responses (minimal overhead)
- Configuration is purely additive
See the LLMApi/LLMApi.http file for complete examples of all rate limiting and batching scenarios.
- Main README - Project overview
- Multiple LLM Backends - Backend configuration
- Error Simulation - Error handling features
Found a bug or have a feature request? Open an issue: https://github.com/scottgal/mostlylucid.mockllmapi/issues
This feature is part of MockLLMApi and is released under the Unlicense.