AFAIU, current build logic is as follows:
- Jobset gets evaluated, produces one job (
Build) for every attribute
- Build job gets executed on a machine, and that machine builds the derivation itself as well as all its dependencies
The problem with that is that the low granularity causes the queue runner to make poor use of its available resources. If there are many agents available and the queue runner needs to perform a build that involves lots of unrealised derivations, it will schedule the entire build on one agent. That one agent will spend a long time building derivations that could have been built elsewhere in the meantime, and all the other agents will remain idle.
The new build logic I'm proposing is:
- Evaluate the jobset, recursively discovering all the derivations in the graph, not just the top-level ones
- Walk the derivation graph from the top, and for each derivation: check if we need to build this derivation (e.g. not realised locally, no substitute available), and if not, eliminate it and its dependencies from the graph. (This is so we don't end up realising and caching things like
bootstrap-tools)
- For each derivation in the pruned graph: enqeue a
Build for this particular derivation, and make it depend on the Builds for all its dependencies
- Schedule
Builds like before
Now we have build parallelism across all of our agents.
Design note: There is already a concept of "build steps", but it seems to be unused. In 550 builds, I have not found a single build that has more than one step in my database. Build steps is how Hydra does things. I'm not sure if it supports cross-machine parallelism, but I imagine it does. My proposal is closer to what Hercules CI does, which I personally think is Clean.er.
circus=# select count(id), build_id from build_steps group by build_id having count(id) > 1;
count | build_id
-------+----------
(0 rows)
circus=# select count(id) from build_steps;
count
-------
550
(1 row)
AFAIU, current build logic is as follows:
Build) for every attributeThe problem with that is that the low granularity causes the queue runner to make poor use of its available resources. If there are many agents available and the queue runner needs to perform a build that involves lots of unrealised derivations, it will schedule the entire build on one agent. That one agent will spend a long time building derivations that could have been built elsewhere in the meantime, and all the other agents will remain idle.
The new build logic I'm proposing is:
bootstrap-tools)Buildfor this particular derivation, and make it depend on theBuilds for all its dependenciesBuilds like beforeNow we have build parallelism across all of our agents.
Design note: There is already a concept of "build steps", but it seems to be unused. In 550 builds, I have not found a single build that has more than one step in my database. Build steps is how Hydra does things. I'm not sure if it supports cross-machine parallelism, but I imagine it does. My proposal is closer to what Hercules CI does, which I personally think is
Clean.er.