perf: Reduce prove peak memory and switch to jemalloc#277
Open
perf: Reduce prove peak memory and switch to jemalloc#277
Conversation
Collaborator
|
@Paradox Can you please rebase it with main, pr makes quite a few changes which breaks compatibility, such as -
|
- Destructure WhirR1CSCommitment to drop masked/random polynomials before WHIR prove_batch/prove, saving ~256 MB in dual-witness path - Defer public input weight vector allocation until after alphas are consumed - Drop program and witness_generator before prove call (~60 MB) - Add feature-gated jemalloc as default allocator (RSS: 2.39 GB -> 1.90 GB) - Add release-fast build profile (30s vs 2.5min) Profiling peak: 2.24 GB -> 1.92 GB RSS with jemalloc: 1.90 GB (complete_age_check, 1.1M constraints)
…commits Move drop(self.program) and drop(self.witness_generator) immediately after extracting public input indices, before the NTT-heavy commit phase. Also drop acir_witness_idx_to_value_map right after its last use in each branch rather than after both branches.
b914aad to
65b12b0
Compare
65b12b0 to
099ac78
Compare
ashpect
requested changes
Feb 10, 2026
| } | ||
| } | ||
|
|
||
| impl R1CSSolver for LazyR1CS { |
Collaborator
There was a problem hiding this comment.
The R1CSSolver for LazyR1CS is same as that of the R1CS implementation. Instead of the common code, consider extracting into common func, using macros etc.
| } | ||
| } | ||
|
|
||
| fn ensure_decompressed(&self) -> &(Interner, SparseMatrix, SparseMatrix, SparseMatrix) { |
Collaborator
There was a problem hiding this comment.
Consider using Result<&(...)> for better logging
| postcard::to_allocvec(&matrices).expect("Failed to serialize R1CS matrices"); | ||
| let mut compressed = Vec::new(); | ||
| { | ||
| let mut encoder = XzEncoder::new(&mut compressed, 6); |
Collaborator
There was a problem hiding this comment.
In file/bin.rs, the encoding used was xz level 9. it's better to have a global const which is 9 and used here as well
| zeroize = "1.8.1" | ||
| xz2 = "0.1.7" | ||
|
|
||
|
|
| /// After the first access the decompressed matrices live in `cached`, | ||
| /// so the compressed blob is dead weight. Call this after the first | ||
| /// access to reclaim ~10 MB for a typical circuit. | ||
| pub fn free_compressed(&mut self) { |
Collaborator
There was a problem hiding this comment.
Consider adding an assertion in free_compressed() to verify cache is populated: assert!(self.cached.get().is_some(), "Must access matrices before freeing");
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reduce Peak Memory During Prove Step
Problem
After adding public inputs, the prove step for
complete_age_checkregressed from 1.84 GB to 2.24 GB peak memory.Changes
Memory optimization (2.24 GB → 1.92 GB, −320 MB)
WhirR1CSCommitmentin both single and dual witness paths to take ownership of masked/random polynomials, enabling explicitdrop()before entering WHIR'sprove_batch/proveadd_public_inputs_to_transcript+build_public_weights) to defer the 64 MB allocation until after alphas are consumedprogramandwitness_generatorbefore the prove call since they are only needed during witness generationSwitch default allocator to jemalloc (RSS: 2.39 GB → 1.90 GB, −490 MB)
ProfilingAllocator, enabled by defaultAdd
release-fastbuild profilecargo build --profile release-fastcodegen-units = 16lto = "thin"Benchmark (
complete_age_check, 1.1M constraints)Allocator Comparison
jemalloc was chosen as default for best RSS.
mimalloc was evaluated and rejected (worst RSS despite best wall-clock time).
Root Cause Analysis
The remaining ~80 MB gap from the original 1.84 GB is fully accounted for by the public inputs weight vector:
prove_batch(line 279, read-only external crate)Before public inputs, there were 6 weights; after, 7.
This overhead is inherent to the protocol and cannot be reduced without modifying the WHIR crate or changing the proof transcript structure.