A lightweight command-line package to generate and manage CellProfiler analysis jobs for HPC clusters. cptools2 builds image lists, splits work into jobs, creates space-optimized plate batches, generates submission scripts, and can join and optionally transfer result CSVs after analysis.
- Packaging consolidated under
pyproject.tomlwith modern metadata and direct references to supporting parser utilities. - Scratch-space batching defaults increased to 75% utilisation with a 30% per-plate overhead buffer.
- Installation instructions updated for both
pipanduvworkflows. - NEW: Automatic LoadData metadata enrichment integrated into the join workflow—output CSVs now include complete well, site, and plate metadata.
- Release highlights (v1.0.0)
- Installation
- Quick usage
- Metadata enrichment
- YAML configuration
- Behavior notes
- Developer quickstart & testing
- HPC validation checklist
- Contributing
- License
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -e .[dev]Notes:
- Runtime dependencies (
pandas,pyyaml,parserix,scissorhands) are resolved automatically viapyproject.toml. - Use
pip install .for a pure runtime install without developer extras.
uv venv
uv pip install .[dev] # or: uv pip install .To run commands without activating the environment explicitly, prefix with uv run, e.g. uv run pytest.
Generate a full workflow (creates loaddata, per-plate command files and a master submit script):
cptools2 generate config.ymlJoin chunked CSV outputs after analysis (one or more patterns):
cptools2 join --location /path/to/location --patterns Image.csv Cells.csvThe join command automatically enriches output files with LoadData metadata before concatenation, ensuring complete metadata columns in the final output.
When you run cptools2 join, the tool automatically enriches all output CSV files with metadata from the corresponding LoadData files. This happens before concatenation to preserve metadata integrity.
Why this matters:
Each CellProfiler chunk has sequential ImageNumber values (1, 2, 3, ...). If chunks are concatenated first, these ImageNumbers become duplicated and lose meaning. By enriching each chunk individually before concatenation, every row retains its correct metadata (well, site, plate, etc.).
Example workflow:
Input structure:
loaddata/plate_0.csv <- Contains Metadata_well, Metadata_site, etc.
loaddata/plate_1.csv
raw_data/plate_0/Image.csv <- No metadata, just measurements
raw_data/plate_1/Image.csv
Processing:
1. Enrich chunk 0: Join Image.csv with loaddata/plate_0.csv on ImageNumber
2. Enrich chunk 1: Join Image.csv with loaddata/plate_1.csv on ImageNumber
3. Concatenate enriched chunks into joined_files/plate_Image.csv
Output:
joined_files/plate_Image.csv <- Complete metadata preserved!
To disable metadata enrichment (for debugging), use --no-enrich-metadata:
cptools2 join --location /path/to/location --patterns Image.csv --no-enrich-metadataTroubleshooting:
- If metadata enrichment fails, check that LoadData CSV files exist in
location/loaddata/with filenames matching the chunk pattern (e.g.,plate_0.csv). - Verify that both the LoadData and output CSVs have an
ImageNumbercolumn. - Check for ImageNumber mismatches—output CSVs with more rows than their corresponding LoadData CSV will have null metadata for unmatched rows.
cptools2 accepts a YAML configuration file describing the experiment, pipeline, and optional features such as batching and transfer. A canonical example is included at tests/new_config.yaml.
Sanitized example (matches tests/new_config.yaml):
chunk: 96
join_files:
- Image.csv
location: /path/to/scratch/$USER/project/outputs
commands location: /path/to/scratch/$USER/project/commands
pipeline: /path/to/pipeline.cppipe
add plate:
- experiment: /path/to/imagexpress/experiment
plates:
- plate_1
- plate_2
data_destination: /path/to/datastore/project/dataCommon fields:
experiment/add plate: where to find image data and which plates to includechunk: desired images-per-job (integer)pipeline: path to the.cppipeCellProfiler pipelinelocation: base location for image outputscommands location: directory to write command filesjoin_files: list of CSV filenames to join after analysisdata_destination: optional path for post-join transfer
Advanced sections:
batching— overrides for automatic batching (if supported)transfer— transfer provider configuration (S3 or other); cptools2 will write transfer metadata/commands but actual transfer depends on runner hooks
generatewill discover plates under theexperiment, create image lists, split jobs according tochunk, apply batching overrides (if present), and write command files intocommands location.joinconcatenates/join CSV outputs after the analysis; provide filename patterns to target.- Transfer entries in the config are optional.
generaterecords transfer commands/metadata; running transfers typically requires cluster-side hooks or CI steps that read the produced metadata.
Run tests:
uv run pytest # or: pytestRun linters/formatters (configured via pre-commit):
pre-commit run --all-filesThese steps mirror the checks typically performed on the Eddie HPC cluster:
- Scratch quota assessment – run
cptools2 generatewith real experiment configs to confirm 75% utilisation and overhead handling fits within assigned scratch space. - Batch command generation – verify per-batch command files (
staging_batch_*.txt,cp_commands_batch_*.txt) are produced and reference the expected plates. - CellProfiler dry run – execute one batch via the HPC queue to confirm staging, CellProfiler invocation, and cleanup complete without exceeding scratch limits.
- Packaging install test – from a clean node, run
pip install git+https://github.com/CarragherLab/cptools2@v1.0.0(or sync viauv) to ensure dependencies resolve correctly. - Post-analysis join – validate
cptools2 joinagainst batch outputs for consistency with historical runs.
Document outcomes for each release to maintain an audit trail.
- Open issues or PRs describing bugs or enhancements.
- Keep changes small and focused; tests should accompany functional changes.
This project is distributed under the MIT License. See LICENSE.