# Store large analysis outputs with artifact sidecars This how-to is for contributors whose analysis produces arrays, per-frame profiles, event tables, or other data that should not be stored directly in an artifact JSON payload. Use sidecars when you need durable data for aggregation, comparison, or plots, but the data is too large or too structured for the small JSON summary in a `ReplicateArtifact` or `ConditionArtifact`. ## Choose payload fields vs artifact sidecars Keep the artifact payload small and JSON-compatible. Put enough information in the payload to identify and summarize the result; put bulky data in sidecars and register those sidecars on the artifact. ### Put scalar metrics in the artifact payload Store finite scalar values used by default aggregation in `payload["metrics"]`. This lets aggregation combine replicate-level values directly without opening a large sidecar file. ### Put compact labels and dimensions in the payload or metadata Counts, labels, selections, dimensions, and compact summaries can live in the artifact `payload` or `metadata`. These fields make the artifact readable and auditable without loading arrays or tables. ### Put arrays in NPZ sidecars Per-frame arrays, residue-by-frame matrices, distance matrices, profiles, and other bulky numeric outputs belong in NPZ sidecars. NPZ keeps arrays typed and compact, and the artifact store records validation metadata for the file. ### Put event tables in CSV or table sidecars Event rows, contact tables, and per-frame tabular records belong in registered CSV or other table sidecars. Tables stay streamable and do not bloat the JSON payload. ### Never store raw MDAnalysis Results objects Raw MDAnalysis `Results` objects do not belong in the payload, metadata, or sidecars. They are runtime containers, not durable artifacts. Map their contents to JSON primitives, NPZ arrays, and registered table sidecars before constructing the artifact. ## Use store-relative sidecar paths Sidecar paths are artifact-store-relative paths such as `sidecars/pair_distances.npz`. Do not store absolute paths in artifact payloads. Use the `ArtifactSidecarRef` returned by `ArtifactStore`; it records the relative path, SHA-256 digest, size, media type, and metadata. Store-relative paths make artifacts portable across machines and build directories. Later aggregation, comparison, and plotting code should resolve and validate sidecars through `ArtifactStore`, not by joining absolute paths saved in payload fields. ## Write an NPZ array sidecar from a collector In a collector, use the `artifact_store` provided by `MDACollectorContext`. The store writes the file under the replicate artifact directory and returns a registered sidecar reference. ```python import numpy as np from polyzymd.analyses.mda import ( MDACollectorContext, MDAJobResult, ReplicateArtifact, frame_selection_payload, strict_json_payload, ) class DistanceMatrixCollector: """Collect one completed distance job into a sidecar-backed artifact.""" def __call__( self, ctx: MDACollectorContext, completed_jobs: list[MDAJobResult], ) -> ReplicateArtifact: if len(completed_jobs) != 1: raise ValueError(f"Expected one job, got {len(completed_jobs)}") job = completed_jobs[0] distances = np.asarray(job.results["distances_nm"], dtype=np.float64) frames = np.asarray(job.results["frames"], dtype=np.int64) metadata = {"result_kind": "distance_matrix_replicate"} if ctx.settings_fingerprint is not None: metadata["settings_fingerprint"] = ctx.settings_fingerprint sidecar = ctx.artifact_store.write_npz_sidecar( "sidecars/distance_matrix.npz", distances_nm=distances, frames=frames, metadata={ "kind": "distance_matrix", "layout": "pair_x_frame", "shape": list(distances.shape), }, ) return ReplicateArtifact( analysis_name=ctx.analysis_name, condition_label=ctx.condition_label, replicate=ctx.replicate, payload={ "n_pairs": int(distances.shape[0]), "n_frames": int(distances.shape[1]), "mean_distance_nm": float(np.nanmean(distances)), "metrics": {"mean_distance_nm": float(np.nanmean(distances))}, }, sidecars=[sidecar], provenance={ "source": "distance_matrix_job", "frame_selection": frame_selection_payload(ctx.frame_selection), "universe_policy": strict_json_payload( ctx.universe_policy.as_dict(), analysis_name=ctx.analysis_name, ), }, metadata=metadata, warnings=list(ctx.warnings), ) ``` The payload contains dimensions, a scalar summary, and optional default aggregation metrics. The complete per-pair, per-frame array stays in the NPZ sidecar. ## Write and register a CSV table sidecar For tabular outputs, write the file under a store-relative `sidecars/` path and then call `register_sidecar()`. The example below uses Python's standard `csv` module, but the same pattern works for any writer that creates a file under the artifact store root. ```python import csv from polyzymd.analyses.mda import MDACollectorContext, ReplicateArtifact def write_contact_events( ctx: MDACollectorContext, rows: list[dict[str, int | str | float]], ) -> ReplicateArtifact: sidecar_path = "sidecars/contact_events.csv" csv_path = ctx.artifact_store.resolve_sidecar(sidecar_path) csv_path.parent.mkdir(parents=True, exist_ok=True) with csv_path.open("w", newline="", encoding="utf-8") as handle: writer = csv.DictWriter( handle, fieldnames=["frame", "protein_resid", "polymer_atom", "distance_nm"], ) writer.writeheader() writer.writerows(rows) sidecar = ctx.artifact_store.register_sidecar( sidecar_path, media_type="text/csv", metadata={ "kind": "contact_events", "n_rows": len(rows), "columns": ["frame", "protein_resid", "polymer_atom", "distance_nm"], }, ) metadata = {"result_kind": "contact_events_replicate"} if ctx.settings_fingerprint is not None: metadata["settings_fingerprint"] = ctx.settings_fingerprint return ReplicateArtifact( analysis_name=ctx.analysis_name, condition_label=ctx.condition_label, replicate=ctx.replicate, payload={ "n_contact_events": len(rows), "event_table_sidecar": sidecar.path, }, sidecars=[sidecar], provenance={"source": "contact_event_collector"}, metadata=metadata, warnings=list(ctx.warnings), ) ``` Registering the sidecar is the important step. It makes the CSV part of the artifact contract instead of an arbitrary file that later code has to discover. ## Load sidecars through the artifact reference Aggregation, comparison, and plotting code should load sidecars from the artifact record. Do not scan directories for files that "look right". ```python import csv from polyzymd.analyses.mda import ArtifactSidecarRef, ArtifactStore, ReplicateArtifact def sidecar_by_kind(artifact: ReplicateArtifact, kind: str) -> ArtifactSidecarRef: matches = [ref for ref in artifact.sidecars if ref.metadata.get("kind") == kind] if len(matches) != 1: raise ValueError(f"Expected one {kind!r} sidecar, found {len(matches)}") return matches[0] def load_distance_matrix(run_dir): store = ArtifactStore(run_dir) artifact = store.read_replicate_result() sidecar = sidecar_by_kind(artifact, "distance_matrix") with store.load_npz_sidecar(sidecar) as data: distances_nm = data["distances_nm"].copy() frames = data["frames"].copy() return artifact, distances_nm, frames def load_contact_events(run_dir): store = ArtifactStore(run_dir) artifact = store.read_replicate_result() sidecar = sidecar_by_kind(artifact, "contact_events") csv_path = store.validate_sidecar(sidecar) with csv_path.open(newline="", encoding="utf-8") as handle: rows = list(csv.DictReader(handle)) return artifact, rows ``` `load_npz_sidecar()` and `validate_sidecar()` check the stored path, file size, and SHA-256 digest before returning data. If the sidecar is missing or stale, the store raises an artifact-store error instead of silently loading the wrong file. ## Use sidecars for artifact-only plotting `plot()` must read cached artifacts and registered sidecars only. It should not reload trajectories, re-run MDAnalysis jobs, or infer outputs by walking the filesystem. A typical plot method should: 1. Locate each condition's artifact directory from the `PlotContext` data the orchestrator provides. 2. Read `result.json` with `ArtifactStore.read_replicate_result()`, `read_condition_result()`, or a plugin helper built on those methods. 3. Select the needed `ArtifactSidecarRef` from `artifact.sidecars` by metadata or a documented payload key. 4. Validate and load the sidecar with `ArtifactStore`. 5. Render figures from those cached values. This keeps plotting reproducible: the same artifact and sidecar bytes that were aggregated or compared are the bytes used for the figure. ## Avoid these anti-patterns Keep these failure modes out of contributor plugins. ### Do not put large arrays in JSON payloads ```python # Wrong: huge JSON payloads make artifacts slow to read, diff, and validate. ReplicateArtifact( analysis_name="my_analysis", condition_label="Polymer", replicate=1, payload={"all_frame_distances": distances.tolist()}, ) ``` ### Do not store raw MDAnalysis results ```python # Wrong: raw MDAnalysis Results objects are runtime containers, not artifacts. ReplicateArtifact( analysis_name="my_analysis", condition_label="Polymer", replicate=1, payload={"results": job.results}, ) ``` ### Do not discover plugin-specific cache files by filename ```python # Wrong: arbitrary file discovery bypasses artifact validation and provenance. for path in run_dir.glob("*.csv"): rows = path.read_text() ``` Use a compact payload plus registered sidecars instead: the artifact says which files belong to it, and `ArtifactStore` validates those files before loading. ## Success state You have implemented sidecar handling when: - bulky arrays or tables are written under a store-relative `sidecars/` path; - every sidecar is registered and included in `artifact.sidecars`; - the JSON payload contains summaries, dimensions, labels, and sidecar keys, not large frame-by-frame data; - aggregation and plotting load sidecars through `ArtifactStore` and the artifact's sidecar references; and - no contributor code imports private framework modules.