Introduce ecoli.library.xarray_emitter#414
Conversation
|
Thanks for the PR! I'll do a more thorough review later, but at a glance, one thing I'd like to see is a demo for how to read/work with the Zarr output. Some ad hoc code in a Marimo notebook would suffice. I just want to get a sense for what that would look like and maybe run some benchmarks. |
thalassemia
left a comment
There was a problem hiding this comment.
Some preliminary thoughts. I really appreciate the detailed docstrings! Ultimately, I ended up focusing on those and just skimming the code, which feels way beyond my paygrade. I had to look so much stuff up and learn a bunch of unfamiliar Python syntax but still couldn't really follow along haha.
My biggest questions are:
- What does analyzing this output will look like?
- How will this perform at scale in the cloud (read/write)?
| """ | ||
| assert engine.emitter is self | ||
| if emit_paths: | ||
| state = self.ecoli_experiment.state |
There was a problem hiding this comment.
This should be engine.state.
| cls, "Invalid argument", f"\"predicate\": [{config}]")) | ||
| return cls(list(map(AtomicEmitPredicate.build, config))) | ||
|
|
||
| @abstractmethod |
There was a problem hiding this comment.
Claude says the @abstractmethod decorator should be removed here and for ConjunctiveEmitPredicate.__call__.
|
|
||
| @cached_property | ||
| def _warnings_eval_effect(self) -> list[WarningFilter]: | ||
| return self.warnings_make_effect() |
There was a problem hiding this comment.
Claude also caught this minor error: should be self.warnings_eval_effect().
| `conjunctive normal form`_ (CNF), where literals are atomic predicates | ||
| parametrized by the JSON configuration. | ||
|
|
||
| .. _conjunctive normal form: https://en.wikipedia.org/wiki/Conjunctive_normal_form |
There was a problem hiding this comment.
The whole emit predicate system is a really cool idea.
There was a problem hiding this comment.
Thanks! This will enable more complex statistical aggregations over simulation trajectories.
| dtype: str | ||
| #: Unit string to store as an attribute inside the node, provided either | ||
| #: as a string, as a :py:class:`pint.Unit`, or as a :py:class:`unum.Unum`. | ||
| unit: str | None = None |
There was a problem hiding this comment.
This is cool. Units have long been a pain point for us. I know Vivarium 2.0 addresses this somehow, but I also wanted to point out that it is possible to add custom key-value metadata to Parquet files (metadata option in Polars write_parquet and DuckDB parquet_kv_metadata).
There was a problem hiding this comment.
This here is merely a pragmatic user-facing solution that is configured in the emitter on a per-variable basis. The fundamental solution will be a globally consistent type system in the simulator framework that includes all variable units.
|
|
||
| .. hint:: | ||
| - The numbers of threads and concurrent connections in the transport layer | ||
| have backend-specific configuration options (see |
There was a problem hiding this comment.
In my limited testing, runtime write performance seems quite good locally. The output size from one generation run with configs/test_configs/test_xarray_emitter.json was also 50% smaller than that of an equivalent sim run with the Parquet emitter (20 MB vs 30 MB)!
Looking forward to testing write performance on HPC/cloud, as well as read and aggregation performance over large datasets.
There was a problem hiding this comment.
Stay tuned for an example aggregation script.
I encourage you to also experiment with variable-specific compression codecs on some variables for which you possess strong prior knowledge.
| libraries, the hierarchy of the output `DataTree`_ is decoupled from the | ||
| hierarchy of simulation :py:class:`~vivarium.core.store.Store`\ s, using an | ||
| output :ref:`variable layout <variable_layout>` specified in the :ref:`simulator | ||
| configuration <sim_config>`. |
There was a problem hiding this comment.
The decoupling of simulation store hierarchy and output hierarchy is interesting. The flexibility is very nice, especially for stores very deeply nested in the sim hierarchy. I'm a bit worried that it will complicate analysis scripts, which up to now have operated under the assumption that data will be located in a fixed location (tied to the commit used to run the workflow and verified by our CI workflow via ecoli/analysis/multivariant/dummy.py).
There was a problem hiding this comment.
This design really reflects a difference is intended use cases. The decoupling of external from internal hierarchies may provide additional flexibility for "in-house" uses; however, it is primarily motivated as a way:
- to reduce the overhead for "external" downstream users of simulation outputs, who should not need to be familiar with
vivariumstore hierarchies they are not concerned with, - and to facilitate future design efforts towards standardized high-performance emission formats.
Meanwhile, there is no a priori reason to port existing analysis scripts from ParquetEmitter to XarrayEmitter, unless somebody specifically decides that the potential increase in throughput is worth their time. In that scenario, tying the analysis script to the config JSON should still ensure reproducibility.
There was a problem hiding this comment.
That makes sense to me. Another thing that would help is for our eventual Xarray analyses to include in their docstring what emitter_arg has to look like to work.
| - Supports *renaming and rearranging* of output variables. | ||
| - Requires the configuration of *output data types*. | ||
| - Supports the configuration of backend-specific *compression codecs*. | ||
| - Supports :ref:`log_updates`, i.e., the emission of individual |
There was a problem hiding this comment.
I played around with this to remind myself why log_updates are not supported by the Parquet emitter.
One issue is that log_updates can contain Python objects not supported by Parquet. This should be solvable with an extra serialization step.
Another issue is that columns in the output of the Parquet emitter can only be included or removed at the level of stores. Each process has its own store under log_updates that holds its entire update dictionary, so, for each process, you can only choose to include every column in that update dictionary or none. I think I could adapt your system for specifying emit paths to fix this as well as the currently broken emit_paths option (mentioned in my comment on ecoli_master_sim.py).
There was a problem hiding this comment.
Indeed! You should be able to directly reuse the logic in:
EmitPath.emitting_pathForestView.matches_emitted_prefix_path()XarrayBuffer.write()
| `DataArray`_\ s within an output buffer. | ||
| - Decouples the in-memory buffer size from the persistent chunk size, in order | ||
| to simplify performance tuning of large-scale simulations (see | ||
| :py:class:`.XarrayTransducer` and :py:class:`.AsyncBufferWriter` for details). |
There was a problem hiding this comment.
Do you know how Zarr implements incremental writes under the hood? I'm especially curious about performance on high-latency cloud object storage.
There was a problem hiding this comment.
Design rationale
The architectural design of concurrency control in the AsyncBufferWriter is documented here. The overall system will perform as well as these assumptions match the actual situation of the compute environment. In particular, if we treat the behavior of the Xarray and Zarr APIs as given, then the most important factors we can control are the following:
- How much time a buffer write cycle has available to complete concurrently, before it starts blocking the next buffer (
buffer size * sampling interval / compute time per simulation step). - How well we can reduce the amount of latency incurred by reducing the number of file system calls required (adequate chunk sizes, caching file handles and metadata)
- How well we can mask per-array latency by writing asynchronously to multiple arrays (number of arrays per buffer, number of async Zarr operations).
- How well we can leverage available total bandwidth by writing with multiple threads (number of Zarr worker threads).
- Whether we avoid congesting the available network or file system bandwidth with too many connections from many parallel simulation processes.
Internals
And here is what Xarray and Zarr do under the hood, in their current implementations.
B. async setitem()
This is the simplified call stack, at the point in time when the low-level write operation for an individual array slice is sent to the Zarr API:
.emitter.XarrayEmitter.flush().writer.AsyncBufferWriter.write().writer.AsyncBufferWriter._write().writer.AsyncBufferWriter.eval_effect().writer.AsyncArrayWriter.sync().zarr_writer.AsyncZarrArrayWriter._async()zarr.AsyncArray.setitem
A. set_variables()
Before being sent to (B.7), the low-level write operations are first generated by the Xarray API, at the following simplified call stack:
- (as above)
- (as above)
- (as above)
.zarr_writer.AsyncZarrBufferWriter.make_effect().zarr_writer._datatree_to_zarr()xarray.backends.writers.dump_to_store()xarray.backends.zarr.ZarrStore.store()xarray.backends.zarr.ZarrStore.set_variables().writer.AsyncArrayWriter.add()
In particular, (A.8) contains the Xarray logic for computing the indexing range for an appending write to an existing Zarr array. For a comparison, see Zarr's own "convenience function" zarr.core.array._append().
There was a problem hiding this comment.
This is all useful information. However, my question was more specifically about how objects in most cloud storage providers are immutable and do not support operations like append. Usually, you have to rewrite the entire object to update it. That could be inefficient if you are writing multiple buffers per chunk.
Thank you for your initial review! It is already prompting several important fixes, and I am glad to hear that you see value in the new design draft.
After I address your most urgent comments, I will be happy to go over implementation details with you. Answering any questions you might have -- be it via code changes, comments, docs or a traced conversation -- is likely to reduce very real future maintenance costs.
This PR is obviously focused on the write operations from the emitter. However, I expect to attach an example analysis script for a small Nextflow workflow sometime in the next 1-2 days, and we can start iterating on the more subtle questions from there. For now,
But fundamentally, both APIs are isomorphic to a
Write operationsThe most performance-critical configuration parameters that you may want to investigate for benchmarking are:
In general, the optimal resource parameters will depend on the workload and the compute infrastructure. The main goal of this PR is to enable declarative fine-grained control over these resource parameters, while subordinating the emitter architecture to explicit assumptions about the workflow-level communication pattern, as described in the top-level documentation. In particular, these explicit assumptions are intended to closely reflect the current practice, but in a way that can be easily generalized in subsequent work -- this concerns your questions about the current restrictions to single lineages, 1-dimensional arrays etc. Hopefully, you will come to agree that with this architecture, such generalizations will require only small code modifications. For the purpose of reviewing this PR, I suggest the following strategy:
Read operationsIn practice, with either the Zarr API or the Xarray API, achieving good read performance for large workflows will require careful resource management of the compute graphs, analogously to how achieving good performance with DuckDB requires careful SQL design for nontrivial computations. This includes:
I will try to showcase several of the relevant API parameters as a starting point for your benchmarking and for future downstream applications. But fundamentally, this is a question of application-specific performance engineering, and hence, providing a general solution for analysis workloads is not in scope for this PR. |
|
I finally got around to fixing Do you know if Zarr supports data like the above? If not, instead of using |
Rationale
From the main module documentation:
Notes for reviewers
zarr-emitterandxarray-emitter.