Use Case 05 · Bundle, Deposit & Query

Crate-out

Finished results become a self-describing bundle

Use case 04 pulled crates in to assemble inputs. This is the symmetric step: the integrative model, its refined components, the maps, and the validation reports get packaged back out as one RO-Crate. The crate carries a LAMBDA-BER Dataset describing everything inside — so the bundle is not a folder of mystery files, it is a typed, checksummed, schema-validatable record.

▸ chd1-ncp-integrative.crate/ — what the agent assembles

chd1-ncp-integrative.crate/
├── ro-crate-metadata.json      # RO-Crate descriptor (the manifest)
├── dataset.yaml                # LAMBDA-BER Dataset — validates against the schema
├── models/
│   └── integrative_model.cif    # IMP output, the deliverable
├── maps/
│   └── complex_sharpened.map    # cryo-EM, 3.2 Å
├── components/
│   └── ncp_refined.cif          # phenix.refine output (X-ray)
├── saxs/
│   └── chd1.dat                 # I(q), P(r)
└── reports/
    ├── molprobity.json          # validation
    └── mtriage_fsc.json         # map resolution

Why the schema matters here: the same classes that described the inputs describe the outputs. A reviewer, a collaborator, or an agent six months from now reads dataset.yaml and knows exactly which file is which, how it was produced, and what it was derived from — no tribal knowledge required.

Two destinations

The same crate goes two places

A finished crate serves two very different audiences. Deposition is about publishing a single result for the world; the lakehouse is about accumulating many results for your own future questions. One bundle, both jobs.

publish

Deposit to the archives

The crate is the staging ground for canonical submissions.

PDB / PDB-Dev — atomic & integrative models
EMDB — cryo-EM maps + half-maps + FSC
SASBDB — SAXS curves, P(r), envelopes
Zenodo / institutional repo — the whole crate, with a DOI

accumulate

Land metadata in the lakehouse

Normalized LAMBDA-BER rows, versioned, queryable across every study.

Metadata → tables — Sample, ExperimentRun, WorkflowRun, DataFile
Data files → pointers — URIs + checksums, bytes stay federated
Iceberg / Delta — time-travel, schema evolution
One model, many studies — cross-facility analytics

Architecture

Strata: metadata in the warehouse, bytes in the lake

The lakehouse split is the whole point. Heavy data — maps, movies, MTZs — never moves into a warehouse; it stays in object storage at the facility, referenced by checksummed pointers. Only the LAMBDA-BER metadata is normalized into query tables. You get warehouse-grade SQL over lake-scale data.

The query payoff

Every past study, one SQL away

Because each crate lands the same LAMBDA-BER classes, the lakehouse holds a uniform table across facilities and years. Questions that used to mean emailing three labs become a single query over a flattened analytics view.

▸ "Sub-3 Å cryo-EM with a matching SAXS run"

SELECT s.sample_code, em.resolution_ang, sx.rg_ang
FROM   experiment_run em
JOIN   experiment_sample_assoc a  USING (experiment_id)
JOIN   sample s                    ON s.id = a.sample_id
JOIN   experiment_run sx          ON sx.sample_id = s.id
WHERE  em.technique = 'cryo_em'
  AND  em.resolution_ang <= 3.0
  AND  sx.technique = 'saxs';

▸ "Re-runnable provenance for any deposited model"

-- walk outputs back to the beamtime that produced them
SELECT w.workflow_code, w.software_name,
       d.file_name, d.checksum
FROM   workflow_run w
JOIN   workflow_output_assoc o USING (workflow_id)
JOIN   data_file d             ON d.id = o.data_file_id
WHERE  w.software_name = 'IMP';

The bytes referenced by data_file.checksum are still out in object storage — the query returns the pointer, and the agent streams the file only if the next step needs it. Same federation principle as use case 04, now running in reverse.

What lands in the lakehouse

The crate's metadata, normalized

Depositing is publishing one crate; the lakehouse ingest shreds that crate's dataset.yaml into the relational tables it already maps to.

Entity tables

Class	Rows from this crate
`Sample`	complex + components
`ExperimentRun`	cryo-EM, SAXS, X-ray
`WorkflowRun`	refine, real_space, IMP
`DataFile`	models, maps, reports + checksums

The edges that make it queryable

Association	Enables
`workflow_output`	model → its provenance
`workflow_input`	which files fed which job
`experiment_sample`	join techniques on a sample
`study_experiment`	group a whole project

The agent at work

Package, deposit, ingest

One pass: the agent validates the crate against the schema, files the public depositions, and ingests the metadata into the lakehouse — leaving the heavy data where it sits.

beril · claude-code · ~/chd1-nucleosome

> package the finished model and deposit it

⏺ build crate → chd1-ncp-integrative.crate/
write dataset.yaml · 4 WorkflowRun, 6 DataFile, output associations wired
✓ validates against lambda_ber_schema

⏺ deposit (public archives)
EMDB ← complex_sharpened.map + half-maps + FSC
PDB-Dev ← integrative_model.cif (+ component models)
SASBDB ← chd1.dat
→ accessions stored back on the DataFile rows

⏺ ingest to lakehouse (Iceberg)
shred dataset.yaml → 4 tables + 4 association tables
data files registered as pointers (URI + checksum) — 0 bytes copied
✓ snapshot v37 · now joinable with 212 prior studies

Payoff

What bundling + lakehousing buys

[ ✓ ] Trustworthy deliverables

A deposited crate is self-describing and checksummed — reviewers and reusers see exactly how each file was produced.

[ ⛁ ] Cross-study memory

Every project you finish makes the next query richer. "Have we ever seen this complex by SAXS?" becomes answerable.

[ ⇉ ] Bytes stay put

The warehouse holds metadata and pointers; terabytes of maps and movies never migrate. Lake-scale, warehouse-queryable.

[ ↺ ] Versioned & re-runnable

Iceberg/Delta snapshots plus typed provenance mean any past result can be traced, compared, or recomputed on demand.