Patent Pending — USPTO Provisional Application on file. See PATENT_PENDING.md.
SIS is a two-tier architecture that fires a speculative intent on partial input — before the user finishes speaking or typing — and reconciles it against a slower, authoritative model in parallel. The result: time-to-first-action drops by 239× on text input while maintaining zero false-positive fires on ambiguous inputs across four consecutive validation runs.
This repository contains the open-source test harness used to validate the architecture across three input modalities (text-paste, text-typed, audio) and three hardware tiers (CPU, RTX 3070 CUDA, Jetson Orin Nano).
flowchart LR
subgraph Input["Input Stream (partial)"]
A[Typed / Spoken / Pasted]
end
subgraph Tier1["Tier-1 — Fast Classifier"]
B[Sentence-transformer\nMiniLM-L6-v2\nMedian: 8.9ms]
end
subgraph Tier2["Tier-2 — Authoritative Model"]
C[LLM\nLlama 3.1 8B / Qwen\nMedian: ~2s]
end
subgraph Reconciler["Reconciler — 8 Internal Outcomes"]
D{Match?}
end
subgraph UIState["4 User-Visible States"]
E[MATCH — action stands]
F[REFINEMENT — action updated]
G[MISMATCH — action rolled back]
H[CONFUSION — surfaces to user]
end
A -->|partial stream| B
A -->|complete input| C
B -->|speculative fire| D
C -->|authoritative prediction| D
D --> E
D --> F
D --> G
D --> H
- Tier-1 watches the partial input stream. When it recognizes an intent with sufficient confidence, it fires a speculative action immediately — in ~9ms on text, ~2.6s on audio with CUDA ASR.
- Tier-2 runs concurrently on the same input. By the time it finishes, the user may have already seen their action begin executing.
- The Reconciler classifies the relationship between Tier-1 and Tier-2's predictions across eight internal outcomes, and surfaces one of four UI states. This is the inventive core: UI states mask LLM latency as character expression rather than a loading spinner.
- Rollback is available for mismatched speculative actions. The architecture requires that Tier-1 only fires on rollback-feasible action classes in the current world state — preserving safety without sacrificing speed.
The architecture is modality-agnostic: the same orchestrator code drove text-typed, text-paste, audio-CPU, and audio-CUDA batteries with zero behavioral branching. Only the input streamer changes.
All thresholds were locked before any data was collected (see sis_test_harness/config/thresholds.json, timestamped 2026-05-21). This is the anti-bias provision: thresholds cannot be loosened after seeing data.
Run ID 96a2827e | 600 trials | Hardware: RTX 3070 (CPU-only Tier-2)
| Criterion | Threshold | Result | Status |
|---|---|---|---|
| Easy-band Tier-1 accuracy | ≥ 85% | 100.0% (120/120) | ✅ |
| Medium-band Tier-1 accuracy | ≥ 70% | 100.0% (90/90) | ✅ |
| Hard-band Tier-1 accuracy | ≥ 50% | 66.7% (58/87) | ✅ |
| Hard-band false-positive rate | < 25% | 0.00% (0/29) | ✅ |
| Tier-1 median latency | ≤ 200ms | 8.924 ms | ✅ |
Headline: Median time-to-first-action drops from 2,128ms (baseline) to 8.924ms (SIS) — 239× faster, with zero false fires on ambiguous input.
Text-paste mode is where the T1 latency claim lives: input emits as a single partial, so
t1_msreflects pure classifier compute with no streamer-pacing artifact.
Run ID e0916c62 | 600 trials | Hardware: RTX 3070 (faster-whisper small.en CUDA + CPU Tier-2)
| Criterion | Threshold | Result | Status |
|---|---|---|---|
| Medium-band Tier-1 accuracy | ≥ 70% | 100.0% (90/90) | ✅ |
| Hard-band Tier-1 accuracy (behavioral) | ≥ 50% | 66.7% (58/87) | ✅ |
| Hard-band false-positive rate | < 25% | 0.00% (0/29) | ✅ |
| Tier-1 latency vs CPU baseline | — | 16–21× speedup | ✅ |
Behavioral accuracy at parity with text-paste GREEN run. The 200ms T1 latency threshold does not apply to audio mode — wall-clock latency is bounded by utterance duration, not classifier compute. Full analysis in patent_anchor_data/README.md.
Zero hard-band false-positive fires across:
- text-typed CPU (
948c4160) - text-paste CPU (
96a2827e) - audio CPU (
624194eb) - audio CUDA (
e0916c62)
The safety invariant holds across two input modalities and three hardware configurations.
sis-public-repo/
├── README.md ← you are here
├── PATENT_PENDING.md
├── sis_test_harness/ ← full benchmark harness (see below)
│ ├── src/ ← orchestrator, reconciler, Tier-1, Tier-2, streamers
│ ├── config/ ← locked thresholds + corpus (pre-data-collection)
│ ├── audio/ ← corpus WAV generation
│ ├── analysis/ ← stats, verdict, plotting scripts
│ ├── tests/ ← pytest suite
│ ├── docs/ ← architecture decisions, reconciler state diagram
│ └── results/ ← CSV output (gitignored; patent_anchor_data is the archive)
└── patent_anchor_data/ ← locked, immutable validation datasets
├── README.md ← full per-dataset analysis
├── 2026-05-22_text-paste_battery_96a2827e.csv
├── 2026-05-22_text-paste_stats_96a2827e.json
├── 2026-05-22_text-typed_battery_948c4160.csv
├── 2026-05-22_text-typed_stats_948c4160.json
├── 2026-05-23_audio_battery_624194eb.csv
├── 2026-05-23_audio_stats_624194eb.json
├── 2026-05-23_audio-rtx3070-cuda_battery_e0916c62.csv
└── 2026-05-23_audio-rtx3070-cuda_stats_e0916c62.json
Requirements: Python 3.11+, a GGUF-capable LLM backend (llama-cpp-python), sentence-transformers. Full setup in sis_test_harness/README.md.
# Clone
git clone https://github.com/entelyx/speculative-intent-system
cd speculative-intent-system/sis_test_harness
# Install
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Smoke test (no models needed — uses mocks)
python -m src.runner smoke --mock
python -m src.runner smoke-exercise
# Full battery (requires Llama 3.1 8B GGUF at models/)
python -m src.runner battery --seed 42 --variant embedding --modality text-paste- Tier-1 median latency: 8.924ms on CPU (text-paste, consumer hardware)
- Mechanism correctness across three input modalities (text-typed, text-paste, audio)
- Safety invariant — 0% hard-band false-positive rate across four consecutive runs
- Consumer GPU (RTX 3070) reduces audio T1 latency by 16–21× vs CPU baseline
- Modality-agnostic architecture — same orchestrator code, different streamers
- Jetson Orin Nano battery — validates the same harness on constrained edge hardware (8GB unified memory). Pending model swap (Llama 3.1 8B → Llama 3.2 3B) to resolve OOM at the Tier-2 layer.
- RTX 3070 CUDA full battery — Tier-1 GPU acceleration validation. Expected to further reduce text-mode latency below the CPU baseline.
- Tier-1 LoRA distillation — replacing the embedding similarity classifier with a fine-tuned small model (SmolLM2-360M or Qwen2.5-0.5B base). First training run on DGX Spark GB10.
- SIS SDK — a licensable API surface for third-party edge AI developers. Spec in progress.
- CascGang embodiment — SIS as the command intent layer for robotic swarm orchestration.
- HoloAssist embodiment — SIS driving a local-LLM holographic assistant interface.
| Tier | Hardware | ASR | Tier-2 | Status |
|---|---|---|---|---|
| 0 | RTX 3070 (CPU-only) | pywhispercpp CPU | Llama 3.1 8B CPU | ✅ Locked |
| 1 | RTX 3070 (CUDA) | faster-whisper CUDA | Llama 3.1 8B CPU | ✅ Locked |
| 2 | Jetson Orin Nano 8GB | faster-whisper CPU | Llama 3.2 3B (pending OOM fix) | 🔄 In progress |
| 3 | DGX Spark GB10 | faster-whisper CUDA | Llama 3.1 70B or LoRA | ⚪ Planned |
The reconciler classifies every trial into one of eight internal outcomes:
| Outcome | Tier-1 fires? | Tier-2 agrees? | UI State | Action |
|---|---|---|---|---|
| Match | ✅ | ✅ same intent + slots | MATCH | Stand |
| Refinement | ✅ | ✅ same intent, adds/removes slots | REFINEMENT | Update in place |
| Contraction | ✅ | ✅ same intent, removes slots | REFINEMENT | Update in place |
| Slot Conflict | ✅ | ✅ same intent, changes slot value | REFINEMENT | Update to T2 value |
| Mismatch | ✅ | ❌ different intent | MISMATCH | Roll back, execute T2 |
| Tier-2 Abstain | ✅ | — abstained | MISMATCH | Roll back |
| Confusion | ✅ | malformed output | CONFUSION | Surface to user |
| Timeout | ✅ | — timed out | MISMATCH | Treat as abstain |
If you use this harness or reference this architecture in your work:
Speculative Intent System — Validation Harness
Entelyx LLC, 2026
USPTO Provisional Patent Application on file
https://github.com/entelyx/speculative-intent-system
The test harness source code is released under the MIT License.
The patent anchor datasets (patent_anchor_data/) are released as evidence of reduction-to-practice and are provided for reference and reproducibility verification. They may not be used to challenge the novelty or priority of the associated provisional patent application.
Built by Entelyx LLC — physical AI, edge inference, and embedded intent systems.