05 · LAMBDA-BER Use Case

Bundle, deposit & query:
results crates into a lakehouse

The flip side of integration. Once the structures are solved and the model is built, the work has to be packaged so others — and future you — can trust and reuse it. A structural computational biologist, with an AI agent doing the assembly, turns finished results into a self-describing LAMBDA-BER crate, deposits it to the public archives, and lands its metadata in a lakehouse where every past study becomes queryable.

Lifecycle: produce → deposit → query Out: RO-Crate + public deposition Store: Iceberg / Delta lakehouse Data: stays federated

Crate-out

Finished results become a self-describing bundle

Use case 04 pulled crates in to assemble inputs. This is the symmetric step: the integrative model, its refined components, the maps, and the validation reports get packaged back out as one RO-Crate. The crate carries a LAMBDA-BER Dataset describing everything inside — so the bundle is not a folder of mystery files, it is a typed, checksummed, schema-validatable record.

chd1-ncp-integrative.crate/ — what the agent assembles
chd1-ncp-integrative.crate/
├── ro-crate-metadata.json      # RO-Crate descriptor (the manifest)
├── dataset.yaml                # LAMBDA-BER Dataset — validates against the schema
├── models/
│   └── integrative_model.cif    # IMP output, the deliverable
├── maps/
│   └── complex_sharpened.map    # cryo-EM, 3.2 Å
├── components/
│   └── ncp_refined.cif          # phenix.refine output (X-ray)
├── saxs/
│   └── chd1.dat                 # I(q), P(r)
└── reports/
    ├── molprobity.json          # validation
    └── mtriage_fsc.json         # map resolution
Why the schema matters here: the same classes that described the inputs describe the outputs. A reviewer, a collaborator, or an agent six months from now reads dataset.yaml and knows exactly which file is which, how it was produced, and what it was derived from — no tribal knowledge required.

Two destinations

The same crate goes two places

A finished crate serves two very different audiences. Deposition is about publishing a single result for the world; the lakehouse is about accumulating many results for your own future questions. One bundle, both jobs.

publish

Deposit to the archives

The crate is the staging ground for canonical submissions.

  • PDB / PDB-Dev — atomic & integrative models
  • EMDB — cryo-EM maps + half-maps + FSC
  • SASBDB — SAXS curves, P(r), envelopes
  • Zenodo / institutional repo — the whole crate, with a DOI
accumulate

Land metadata in the lakehouse

Normalized LAMBDA-BER rows, versioned, queryable across every study.

  • Metadata → tables — Sample, ExperimentRun, WorkflowRun, DataFile
  • Data files → pointers — URIs + checksums, bytes stay federated
  • Iceberg / Delta — time-travel, schema evolution
  • One model, many studies — cross-facility analytics

Architecture

Strata: metadata in the warehouse, bytes in the lake

The lakehouse split is the whole point. Heavy data — maps, movies, MTZs — never moves into a warehouse; it stays in object storage at the facility, referenced by checksummed pointers. Only the LAMBDA-BER metadata is normalized into query tables. You get warehouse-grade SQL over lake-scale data.

results crate Dataset + files checksummed public archives PDB · EMDB · SASBDB Zenodo (DOI) deposit query layer metadata tables · Iceberg / Delta pointers · URI + checksum land metadata federated object storage maps · movies · MTZ — at the facility, never copied in resolve on demand analytics & ML cross-study cohorts training tables QC dashboards

The query payoff

Every past study, one SQL away

Because each crate lands the same LAMBDA-BER classes, the lakehouse holds a uniform table across facilities and years. Questions that used to mean emailing three labs become a single query over a flattened analytics view.

"Sub-3 Å cryo-EM with a matching SAXS run"
SELECT s.sample_code, em.resolution_ang, sx.rg_ang
FROM   experiment_run em
JOIN   experiment_sample_assoc a  USING (experiment_id)
JOIN   sample s                    ON s.id = a.sample_id
JOIN   experiment_run sx          ON sx.sample_id = s.id
WHERE  em.technique = 'cryo_em'
  AND  em.resolution_ang <= 3.0
  AND  sx.technique = 'saxs';
"Re-runnable provenance for any deposited model"
-- walk outputs back to the beamtime that produced them
SELECT w.workflow_code, w.software_name,
       d.file_name, d.checksum
FROM   workflow_run w
JOIN   workflow_output_assoc o USING (workflow_id)
JOIN   data_file d             ON d.id = o.data_file_id
WHERE  w.software_name = 'IMP';

The bytes referenced by data_file.checksum are still out in object storage — the query returns the pointer, and the agent streams the file only if the next step needs it. Same federation principle as use case 04, now running in reverse.

What lands in the lakehouse

The crate's metadata, normalized

Depositing is publishing one crate; the lakehouse ingest shreds that crate's dataset.yaml into the relational tables it already maps to.

Entity tables
ClassRows from this crate
Samplecomplex + components
ExperimentRuncryo-EM, SAXS, X-ray
WorkflowRunrefine, real_space, IMP
DataFilemodels, maps, reports + checksums
The edges that make it queryable
AssociationEnables
workflow_outputmodel → its provenance
workflow_inputwhich files fed which job
experiment_samplejoin techniques on a sample
study_experimentgroup a whole project

The agent at work

Package, deposit, ingest

One pass: the agent validates the crate against the schema, files the public depositions, and ingests the metadata into the lakehouse — leaving the heavy data where it sits.

beril · claude-code · ~/chd1-nucleosome
> package the finished model and deposit it

⏺ build cratechd1-ncp-integrative.crate/
write dataset.yaml · 4 WorkflowRun, 6 DataFile, output associations wired
✓ validates against lambda_ber_schema

⏺ deposit (public archives)
EMDB ← complex_sharpened.map + half-maps + FSC
PDB-Dev ← integrative_model.cif (+ component models)
SASBDB ← chd1.dat
→ accessions stored back on the DataFile rows

⏺ ingest to lakehouse (Iceberg)
shred dataset.yaml → 4 tables + 4 association tables
data files registered as pointers (URI + checksum) — 0 bytes copied
✓ snapshot v37 · now joinable with 212 prior studies

Payoff

What bundling + lakehousing buys

[ ✓ ] Trustworthy deliverables

A deposited crate is self-describing and checksummed — reviewers and reusers see exactly how each file was produced.

[ ⛁ ] Cross-study memory

Every project you finish makes the next query richer. "Have we ever seen this complex by SAXS?" becomes answerable.

[ ⇉ ] Bytes stay put

The warehouse holds metadata and pointers; terabytes of maps and movies never migrate. Lake-scale, warehouse-queryable.

[ ↺ ] Versioned & re-runnable

Iceberg/Delta snapshots plus typed provenance mean any past result can be traced, compared, or recomputed on demand.