Extension of #3280.
Where a per-voxel operation is quite light, using just one axis in the inner loop is sub-optimal, as the overhead of the multi-threading management becomes non-trivial. However for expensive operations it's preferable to have only one such thread so that thread utilisation remains high toward the end of processing.
For each use of ThreadedLoop, manually set the number of inner threads based on knowledge of the complexity of the underlying operation.