PRISM-KB

Fork of PRISM exploring the dormant knowledge mechanism — culminating in a measured holographic breakthrough.

🏆 The breakthrough — PRISM-Holo

👉 PRISM-HOLO.md — full writeup + reproduction.

Replacing the soft-attention memory tape with an algebraic holographic (VSA) tape yields:

Metric	Neural attention tape	Holographic tape
Specificity correlation	+0.006 (random)	+0.355 ✅
200 facts retrieved	n/a	100%
Training required	full Prism	NONE

The holographic tape stores and retrieves facts algebraically, with zero training, at 60× the specificity of the neural tape. This is the path out of the "6·N·D" Chinchilla trap (which only applies to monolithic Transformers, not to PRISM's externalized-memory architecture).

The two axes — be precise about what "no retraining" means

PRISM-Holo provides two independent capabilities, and it is important not to conflate them:

Axis	What it changes	Requires training?	How
1. Scaling the model (350M → 1B)	Model capacity	✅ Yes (but ~40-50% cheaper via PCS)	`--progressive` flag: train at small scale, grow weights, fine-tune
2. Adding knowledge (facts, datasets)	What the model knows	❌ NO — zero gradient	`tape.bind(key, value)`: pure algebra, instant

Axis 1 (PCS) makes training a 1B model affordable: stages 350M → 700M → 1B with grow_model() transferring weights at each step. The bulk of tokens train on the small model. Real gradient descent, but ~40-50% less wall-clock than from-scratch 1B.
Axis 2 (Holo) is the true "no retraining" path: once a model exists at any size, you bind new facts into the tape algebraically. Zero backward, zero GPU. This is the +0.355 specificity result.

Run python -m prism.two_axes_demo to see both axes composed end-to-end (CPU, ~seconds).

Together: scale to 1B cheaply (PCS), then customize per-client or per-task by binding facts (Holo). Adding new knowledge to an already-trained model requires no retraining.

Training cost & savings — why PRISM-Holo is cheaper than a Transformer 1B

The Chinchilla scaling law (C ≈ 6·N·D) is the empirical cost rule for Transformer monoliths where all knowledge lives in trained weights. PRISM-Holo violates the law's premise in two ways: the memory path has zero trained weights (algebra), and PCS trains most tokens at smaller capacity. Four independent factors compound:

Factor	Saving	Mechanism
1. Memory expert = free algebra	~25% fewer trained params	The HoloHead replaces `q_proj`/`read_out`/`write_gate`/`erase_gate` with fixed VSA bind/unbind. Zero gradient on the memory path.
2. Progressive Capacity Stacking	~40-50% less wall-clock	Train 350M → 700M → 1B; `grow_model()` inherits weights at each stage. The bulk of tokens train at the cheaper small scale.
3. Modular parallelism	up to ~3× wall-clock	The neural / symbolic / memory experts train on separate GPU pools simultaneously (`--modular-phase`).
4. Free one-shot retrieval	eliminates RAG infra	No separate retriever model, no retrieval fine-tuning, no context-window inflation at inference. `tape.bind()` is instant.

Compounded wall-clock estimate (8×A100, bf16, 50B tokens):

Config	Hours	vs Transformer 1B baseline
Transformer 1B from-scratch (industry baseline)	8.7	1.0×
PRISM-Holo 1B (holo_mode only)	6.5	1.3× faster
PRISM-Holo 1B + PCS	3.5	2.5× faster
PRISM-Holo 1B + PCS + modular parallel (24 GPUs)	~1.5	5.8× faster

Caveat (honest): these are architectural-accounting extrapolations from the parameter counts and the PCS smoke validation, not direct 1B-scale benchmarks. Factor 1 (25% fewer params) is exact; factors 2-3 are well-motivated estimates; factor 4 (RAG savings) is an infrastructure cost that depends on deployment. The 1B measurement is the remaining validation.

What you save per deployment:

No retraining for new knowledge. Adding a client's knowledge base is a tape.bind() call — minutes of CPU, not days of GPU. A 1000-client deployment serves one trained model + 1000 cheap tapes, not 1000 fine-tunes.
No separate RAG pipeline. The retriever is the model's own read head (algebra). No embeddings index to maintain, no reranker, no chunking.
Shorter training. A team with 8×A100 reaches a usable 1B model in an afternoon (~3.5h with PCS), not a week.

Layers of work in this repo

PRISM-KB.md — seed the tape for one-shot learning. Honest baseline: +0.006 specificity (random) without training.
COGLOOP.md — PERCEIVE → REFLECT → RESPOND → CONSOLIDATE cognitive loop. Persistent memory, multi-pass reflection, double-layer consolidation. Phase 3 probe confirmed the neural read head doesn't generalize from 40 seeds.
PRISM-HOLO.md — the answer. Algebraic VSA tape. +0.355 specificity, zero training, 200 facts 100% retrieved. This is the real breakthrough.
PCS (Progressive Capacity Stacking) — scale 350M → 700M → 1B with weight inheritance. ~40-50% wall-clock reduction vs from-scratch 1B. --progressive flag in run_holo_train.py.

Modules:

Layer	Files
KB seeding	`encoder/kb/incontext/generate/ingest`
COGLOOP	`capture/reflect/cogmemory/cogloop`
Holo memory	`holo.py` (HoloTape, HoloEncoder, HoloHead)
Holo training	`holo_data.py`, `holo_loss.py`, `run_holo_train.py`
PCS (scaling)	`pcs.py` (grow_model, stage schedule)
Two-axes demo	`two_axes_demo.py` (scaling + knowledge composed)
Phase 3 probe	`tasks/retrieval.py`, `train_retrieval.py`

100 tests (52 PRISM + 48 KB/COGLOOP/Holo/PCS), all passing.

The base PRISM architecture below is unchanged — see the upstream repo for full details.

PRISM

Polymorphic Recurrent Intelligence with Shared Memory.

A novel language-model architecture that unifies four paradigms — sub-quadratic recurrent mixing, heterogeneous Mixture-of-Experts, differentiable external memory, and differentiable symbolic reasoning — under a single abstraction.

📺 8-minute explainer video: renders/prism_explainer.mp4 covers the architecture and how PRISM differs from standard LLMs.

The idea in one sentence

A router picks, for each token, which kind of computation it needs: transform it (neural), store or retrieve it (memory), or reason over it (symbolic) — all sharing one differentiable memory tape.

Why this is different from anything on the market

Family	Core mechanism	PRISM's distinction
Transformer	O(n²) self-attention	replaced by Multi-Rate Bus (O(n))
Mamba / RWKV	single-rate recurrence	generalized to structured multi-rate with a per-token scale gate
MoE Transformers	routed homogeneous MLPs	generalized to heterogeneous experts (neural / memory / symbolic)
Memory nets (RETRO, NTM)	external retrieval, orthogonal to MoE	unified: memory is the communication channel between expert kinds
Neuro-symbolic	neural + external solver (not differentiable)	symbolic primitives live in the weights, trained end-to-end

No model combines all four. PRISM does.

Architecture

                ┌─────────────────────────────────────────┐
   token ──►    │  Multi-Rate Bus (MRB)                   │   K recurrent filters,
                │  + learned per-token scale gate         │   log-spaced decay rates
                └────────────────┬────────────────────────┘
                                 ▼
                        ┌────────────────┐
                        │  Polymorphic   │   picks ONE per token:
                        │    Router      │     neural / memory / symbolic
                        └───┬────┬───┬───┘
                            ▼    ▼   ▼
                       ┌────┐┌────┐┌─────────┐
                       │ N  ││ M  ││    S    │
                       │(MLP)││(rw)││(primitives)
                       └────┘└─┬──┘└─────────┘
                              ▼
                    ╔═══════════════════╗
                    ║  Shared Memory    ║   one tape flows through
                    ║      Bus          ║   ALL layers and time steps
                    ╚═══════════════════╝

Multi-Rate Bus — sub-quadratic backbone, O(n), interpretable scale gates.
Polymorphic Router — real top-k routing (k=2 for 300m+), epsilon-soft straight-through, load-balancing loss.
Shared Memory Bus — one S × d_mem tape, NTM-style read/write, shared across layers.
Symbolic Expert — 6 differentiable primitives (compare, gate, select, threshold, count, shift).

Full spec: docs/ARCHITECTURE.md.

Results (honest benchmark)

Same training loop, optimizer, hyperparameters, seeds. CPU-only, toy scale.

Task	Metric	PRISM	Transformer	SSM	Winner
Induction (lookup)	cross-entropy	1.78	1.95	2.04	PRISM 🏆
Copy (k=8)	cross-entropy	2.11	1.64	1.45	SSM
Mini-LM (char)	cross-entropy	0.89	1.33	0.31	SSM

PRISM wins on induction — the associative-lookup task its heterogeneous MoE was designed for. Routing inspection confirms genuine specialization: on a copy task the first layer routes ~53% to memory (storing the sequence); on lookup the last layer routes ~73% to the symbolic expert (reasoning). The architecture does what it claims.

On dense language modelling a state-space model wins — and that's documented honestly. PRISM's edge is compositional reasoning, not universal replacement.

Full analysis: RESULTS.md.

Training innovation levers (reduce compute, not physics)

Three independent training techniques — each a CLI flag, each measurable in isolation. They reduce the useful token count; FLOPs = 6·N·D still holds. Honest toy-scale results in RESULTS_LEVERS.md.

Lever	Flag	Toy-scale	Production
Modular pretraining	`--modular-phase neural\|memory\|symbolic\|assemble`	✅ Validated	Low risk, run it
Curriculum	`--curriculum`	✅ Validated	Low risk, run it
Token recycling	`--token-recycling`	⚠️ Negative (wrong regime)	Enable ≥1B tokens

Modular pretraining trains each expert kind separately on its optimal data (neural→text, memory→retrieval, symbolic→code+math), then assembles them with a short router+MRB fine-tune. 2–3× speedup via specialization + parallelism.

# Train experts in parallel (different GPU pools), then assemble
torchrun --nproc_per_node=8 -m prism.run_scale --preset 1b --modular-phase neural --out-dir runs/neural
torchrun --nproc_per_node=8 -m prism.run_scale --preset 1b --modular-phase symbolic --out-dir runs/symbolic
torchrun --nproc_per_node=8 -m prism.run_scale --preset 1b --modular-phase assemble \
    --modular-neural-ckpt runs/neural/ckpt-N --modular-symbolic-ckpt runs/symbolic/ckpt-M \
    --out-dir runs/prism-assembled

Curriculum re-weights the dataset mix across training: neural-heavy early (fluency) → memory (retrieval) → symbolic (reasoning). The dataloader rebuilds at each phase transition.

Quick start

git clone https://github.com/AFKmoney/prism.git
cd prism
pip install -e ".[dev]"          # torch, numpy, pytest

# Train PRISM on induction (its strongest task)
python -m prism.train --model prism --task induction --steps 400

# Compare all three models on a task
python -m prism.train --compare --task induction --steps 400

# Run the test suite (52 tests)
pytest tests/

Scaling to 1B params (real pretraining, needs GPUs)

A complete scaled training pipeline — multi-GPU DDP, bf16, gradient checkpointing, HuggingFace dataset streaming — is in TRAINING.md. Validate it on a laptop in <1 minute:

python -m prism.run_scale --smoke --phase pretrain   # tiny, CPU, no big downloads

Then train PRISM 1B for real (8×A100, ~1–2 weeks):

torchrun --nproc_per_node=8 -m prism.run_scale \
    --preset 1b --phase pretrain --steps 100000 \
    --seq-len 4096 --global-batch-size 256 --lr 3e-4 --dtype bf16 \
    --out-dir runs/prism-1b-pretrain --wandb

Project layout

prism/
├── prism/
│   ├── config.py        # all hyperparameters as dataclasses
│   ├── mrb.py           # Multi-Rate Bus (sub-quadratic temporal mixing)
│   ├── memory.py        # Shared Memory Bus + read/write head
│   ├── symbolic.py      # differentiable primitive library
│   ├── experts.py       # Neural / Memory / Symbolic experts
│   ├── router.py        # Polymorphic Router (real top-k, epsilon-soft, load-balance)
│   ├── block.py         # PRISM block (MRB + routing + residual)
│   ├── model.py         # full PRISM model
│   ├── baselines.py     # Transformer + SSM baselines (fair comparison)
│   ├── train.py         # toy training harness (CLI)
│   ├── train_scale.py   # 1b/350m/300m/tiny presets + dataset mixes
│   ├── data_scale.py    # streaming multi-dataset dataloader
│   ├── modular.py       # lever 1: modular pretraining + assemble_experts
│   ├── curriculum.py    # lever 2: curriculum schedule + token recycler
│   └── run_scale.py     # scaled trainer (DDP, bf16, checkpointing, levers)
├── tasks/
│   ├── copy.py          # working-memory probe
│   ├── induction.py     # associative lookup (PRISM's target)
│   ├── reasoning.py     # chained arithmetic (lever-3 validation target)
│   └── mini_lm.py       # char-level LM on bundled corpus
├── tests/               # 52 unit tests (33 core + 19 levers)
├── renders/
│   └── prism_explainer.mp4   # 8-min architecture explainer video
├── docs/ARCHITECTURE.md
├── TRAINING.md          # 1B-scale training guide (datasets, DDP, hyperparams)
├── RESULTS.md           # toy benchmark: PRISM vs Transformer vs SSM
├── RESULTS_LEVERS.md    # honest toy-scale results for the 3 training levers
└── pyproject.toml

Design principles

One abstraction, many implementations. The Expert interface is uniform; implementations differ in kind — this is what makes the routing meaningful.
Every parameter must learn. No dead weights. The last block drops its memory expert (writes would never be read) so nothing is wasted.
Interpretability by construction. The MRB gate and router selection are directly readable — you can see which scale and which strategy each token uses.
Honest baselines. Same loop, optimizer, seeds. Results reported whether or not PRISM wins.

Scaling to 1B+ parameters

The toy config (~186k params) runs on CPU. Four production presets are available via --preset:

Preset	Params	Target hardware
`tiny`	~7M	CPU (pipeline validation only)
`300m`	~319M	1×A100 80GB — the "small brain, big reasoning" thesis
`350m`	~371M	1×A100 80GB (validation runs)
`1b`	~1.8B	8×A100 80GB

The architecture scales: d_model, num_layers, num_rates, and num_slots are all in PrismConfig, and the code runs on GPU unchanged (.to('cuda')). The sub-quadratic backbone and constant-memory-size tape make long-context scaling particularly favourable. The 300m preset specifically compensates for fewer parameters with more rate groups (8), wider memory (64 slots), and top-2 routing — see RESULTS_LEVERS.md for the honest status of the "300M beats 1B" thesis.

License

MIT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PRISM-KB

🏆 The breakthrough — PRISM-Holo

The two axes — be precise about what "no retraining" means

Training cost & savings — why PRISM-Holo is cheaper than a Transformer 1B

Layers of work in this repo

PRISM

The idea in one sentence

Why this is different from anything on the market

Architecture

Results (honest benchmark)

Training innovation levers (reduce compute, not physics)

Quick start

Scaling to 1B params (real pretraining, needs GPUs)

Project layout

Design principles

Scaling to 1B+ parameters

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
docs		docs
paper		paper
prism		prism
renders		renders
tasks		tasks
tests		tests
.gitignore		.gitignore
COGLOOP.md		COGLOOP.md
Maintenance.md		Maintenance.md
PRISM-HOLO.md		PRISM-HOLO.md
PRISM-KB.md		PRISM-KB.md
README.md		README.md
RESULTS.md		RESULTS.md
RESULTS_LEVERS.md		RESULTS_LEVERS.md
TRAINING.md		TRAINING.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

PRISM-KB

🏆 The breakthrough — PRISM-Holo

The two axes — be precise about what "no retraining" means

Training cost & savings — why PRISM-Holo is cheaper than a Transformer 1B

Layers of work in this repo

PRISM

The idea in one sentence

Why this is different from anything on the market

Architecture

Results (honest benchmark)

Training innovation levers (reduce compute, not physics)

Quick start

Scaling to 1B params (real pretraining, needs GPUs)

Project layout

Design principles

Scaling to 1B+ parameters

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages