Fix: Overhaul NSRL workflow: streaming build, SHA1 verification, optional filter, download on install#166
Merged
steffenfritz merged 6 commits intosteffenfritz:mainfrom Apr 11, 2026
Conversation
The Bloom filter was always sized for 40M items (hardcoded default), regardless of actual input size. When the NSRL hash file contained significantly more entries (e.g. 160M+), the undersized filter produced ~40-45% false positives instead of the target 0.01%. CreateNSRLBloom now pre-scans the input file to count actual hashes when estimatedItems is 0 (the new default), ensuring the filter is always correctly dimensioned for the target FPR. Fixes steffenfritz#165
Use -readonly mode for all sqlite3 SELECT queries in Taskfile.nsrl.yml to prevent accidental writes to the NSRL source databases. Add project guidance file for Claude Code.
…filter, download on install
- install.go: replace single NSRLBloomURL with per-variant constants (NSRLBloomURLModern/Mobile/All), add NSRLVariants map, make InstallFT accept nsrlVariant param, DownloadNSRLBloom takes explicit URL - cmd/ftrove/main.go: add --nsrl-variant flag (default: all) - README.md: update installation and NSRL sections to reflect that nsrl.bloom is no longer bundled; document --nsrl-variant flag and variants - BUILDING.md: update dist:bundle description, remove nsrl.bloom prerequisite, update bloom sizes to 1% FPR values, add update workflow for publishing new release assets
steffenfritz
approved these changes
Apr 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes #165 and overhauls the entire NSRL workflow: memory/IO/disk
usage during bloom filter builds, file verification, making NSRL optional
at scan time, and distributing the bloom filter via GitHub Releases instead
of bundling it in the repository.
Changes
Remove
db/nsrl.bloomfrom the repositoryThe pre-built bloom filter is too large to bundle in the repo. It is now
distributed as a GitHub Release asset and downloaded automatically during
ftrove --install.Taskfile.nsrl.yml— streaming pipeline rewriteSQLite databases into
admftrovevia stdin, eliminating all temporary.txtfiles and cutting peak disk and memory usage to near zero.
SELECT COUNT(*) FROM DISTINCT_HASHis run first sothe bloom filter is correctly sized before ingestion begins.
corresponding NIST-provided
.shafile usingsha1sum -c.bash -eo pipefail: Multi-database pipe commands are wrapped in anexplicit
bash -eo pipefail -c '...'invocation so that a failingsqlite3process propagates an error through the pipeline (Debian
/bin/shis dashand does not support
pipefail).unnecessarily strict and produced a ~474 MB file. 1% halves the size with no
practical impact on classification accuracy.
build-*task ends withtask nsrl:test(
go test -run TestBloomWithRealNSRL) to verify the filter contains knownNSRL hashes before the file is used.
checktask: Compares the version embedded indb/nsrl.bloomagainstNSRL_VERSIONso operators can see at a glance whether an update is needed.nsrl.go— stdin support and auto-countCreateNSRLBloomaccepts"-"as the source path to read fromos.Stdin;--nsrl-estimateis required in that mode.estimatedItems == 0for a regular file, the file is pre-scanned tocount non-empty lines (two-pass), guaranteeing correct filter sizing without
the caller needing to supply an estimate.
install.go— configurable download on installNSRLBloomURLModern,NSRLBloomURLMobile,NSRLBloomURLAll.NSRLVariantsmap for validation.InstallFTaccepts annsrlVariantparameter ("modern","mobile","all"); defaults to"all".InstallFTtries to copy a localnsrl.bloomfirst → falls back todownloading the selected variant → continues gracefully if unavailable.
DownloadNSRLBloom(dst, url string)takes an explicit URL so callers cantarget any variant.
cmd/ftrove/main.go—--nsrl-variantflag + optional NSRL--nsrl-variantflag (default:all) selects which bloom filtervariant is downloaded during
--install.Warnlog instead of a fatalError+os.Exit(1). The scan continues withnsrlFilter = nil.Nsrlversionin the session record is set to"none"when no filter isloaded.
nsrlFilter != nil; all files getFilensrl = "FALSE"when no filter is present.nsrl_test.go— new testsTestBloomAutoCount: verifies thatestimatedItems=0triggers two-passauto-count and all inserted hashes are found.
TestBloomStdinRequiresEstimate: verifies that reading from stdin withoutan estimate returns an error.
testdata/nsrl_known_hashes.txtReplaced placeholder hashes with 10 real SHA1 hashes extracted from NSRL RDS
2026.03.1 modern minimal, verified present in the built bloom filter.
Creating the GitHub Release assets
After merging, build all three variants and upload them to a release:
The URL constants in
install.gomust be updated whenever a new NSRL buildis published.
Usage
Test plan
go test -v ./...passestask nsrl:build-allcompletes: verifies SHA1, builds bloom, runsTestBloomWithRealNSRLftrove --install <dir>downloads theallvariant by defaultftrove --install <dir> --nsrl-variant moderndownloads the modern variantftrovescan completes withoutdb/nsrl.bloompresent (NSRL checks skipped, no exit)nsrl-2026.03.1after buildingCloses #165