datasetfoundry

datasetfoundry is a browser-based application for building, organizing, and exporting datasets. It runs entirely client-side, so your files never leave your machine.

Features

Multi-Source Ingestion: Drag and drop files, upload archives (.zip, .tar, .gz), or fetch an entire GitHub repository by URL.
Auto-Categorization: Automatically sorts files into groups such as Code, Data, Documents, and Images.
Token Counting: Uses gpt-tokenizer to count BPE tokens for imported text files, making it easy to size a dataset.
Virtual File System (VFS): Navigate and manipulate the dataset from a modern UI, with summary statistics for file weight, quantity, and token counts.
Advanced Export Pipeline: Export in .jsonl, .csv, .md, or raw file formats. Apply token caps, chunking rules, stop-word removal, and more.
Local Persistence: Data is stored in the browser via IndexedDB / LocalForage, so datasets reload between sessions.

Tech Stack

React + TypeScript
Tailwind CSS & Framer Motion
JSZip / libarchive.js (client-side archive handling)
gpt-tokenizer (token counting)
IndexedDB (LocalForage)

Setup

npm install
npm run dev

Open http://localhost:3000 to start using datasetfoundry.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
.gitignore		.gitignore
README.md		README.md
index.html		index.html
package-lock.json		package-lock.json
package.json		package.json
server.ts		server.ts
tsconfig.json		tsconfig.json
vite.config.ts		vite.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datasetfoundry

Features

Tech Stack

Setup

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

datasetfoundry

Features

Tech Stack

Setup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages