The flip side of integration. Once the structures are solved and the model is built, the work has to be packaged so others — and future you — can trust and reuse it. A structural computational biologist, with an AI agent doing the assembly, turns finished results into a self-describing LAMBDA-BER crate, deposits it to the public archives, and lands its metadata in a lakehouse where every past study becomes queryable.
Crate-out
Use case 04 pulled crates in to assemble inputs. This is the symmetric
step: the integrative model, its refined components, the maps, and the validation
reports get packaged back out as one RO-Crate. The crate carries a
LAMBDA-BER Dataset describing everything inside — so the bundle is not
a folder of mystery files, it is a typed, checksummed, schema-validatable record.
chd1-ncp-integrative.crate/ ├── ro-crate-metadata.json # RO-Crate descriptor (the manifest) ├── dataset.yaml # LAMBDA-BER Dataset — validates against the schema ├── models/ │ └── integrative_model.cif # IMP output, the deliverable ├── maps/ │ └── complex_sharpened.map # cryo-EM, 3.2 Å ├── components/ │ └── ncp_refined.cif # phenix.refine output (X-ray) ├── saxs/ │ └── chd1.dat # I(q), P(r) └── reports/ ├── molprobity.json # validation └── mtriage_fsc.json # map resolution
dataset.yaml and knows exactly which
file is which, how it was produced, and what it was derived from — no tribal
knowledge required.
Two destinations
A finished crate serves two very different audiences. Deposition is about publishing a single result for the world; the lakehouse is about accumulating many results for your own future questions. One bundle, both jobs.
The crate is the staging ground for canonical submissions.
Normalized LAMBDA-BER rows, versioned, queryable across every study.
Architecture
The lakehouse split is the whole point. Heavy data — maps, movies, MTZs — never moves into a warehouse; it stays in object storage at the facility, referenced by checksummed pointers. Only the LAMBDA-BER metadata is normalized into query tables. You get warehouse-grade SQL over lake-scale data.
The query payoff
Because each crate lands the same LAMBDA-BER classes, the lakehouse holds a uniform table across facilities and years. Questions that used to mean emailing three labs become a single query over a flattened analytics view.
SELECT s.sample_code, em.resolution_ang, sx.rg_ang FROM experiment_run em JOIN experiment_sample_assoc a USING (experiment_id) JOIN sample s ON s.id = a.sample_id JOIN experiment_run sx ON sx.sample_id = s.id WHERE em.technique = 'cryo_em' AND em.resolution_ang <= 3.0 AND sx.technique = 'saxs';
-- walk outputs back to the beamtime that produced them SELECT w.workflow_code, w.software_name, d.file_name, d.checksum FROM workflow_run w JOIN workflow_output_assoc o USING (workflow_id) JOIN data_file d ON d.id = o.data_file_id WHERE w.software_name = 'IMP';
The bytes referenced by data_file.checksum are still out in object
storage — the query returns the pointer, and the agent streams the file only if the
next step needs it. Same federation principle as use case 04, now running in reverse.
What lands in the lakehouse
Depositing is publishing one crate; the lakehouse ingest shreds that crate's
dataset.yaml into the relational tables it already maps to.
| Class | Rows from this crate |
|---|---|
Sample | complex + components |
ExperimentRun | cryo-EM, SAXS, X-ray |
WorkflowRun | refine, real_space, IMP |
DataFile | models, maps, reports + checksums |
| Association | Enables |
|---|---|
workflow_output | model → its provenance |
workflow_input | which files fed which job |
experiment_sample | join techniques on a sample |
study_experiment | group a whole project |
The agent at work
One pass: the agent validates the crate against the schema, files the public depositions, and ingests the metadata into the lakehouse — leaving the heavy data where it sits.
Payoff
A deposited crate is self-describing and checksummed — reviewers and reusers see exactly how each file was produced.
Every project you finish makes the next query richer. "Have we ever seen this complex by SAXS?" becomes answerable.
The warehouse holds metadata and pointers; terabytes of maps and movies never migrate. Lake-scale, warehouse-queryable.
Iceberg/Delta snapshots plus typed provenance mean any past result can be traced, compared, or recomputed on demand.