LAMBDA data organization

RO-Crates as the federation contract

A semantic exchange layer for structural biology data, with lambda-ber-schema as a typed projection for validation, analysis, and ingest.

facility-neutral graph small core profile MX pilot extension projection-tested LinkML
RO-Crate canonical graph facility systems metadata contexts workflow provenance LinkML projection

Position

The contract is the graph, not the directory layout

1

RO-Crate is the exchange contract

The crate graph carries identities, relationships, provenance, file integrity, and contextual metadata across facilities.

2

Facilities keep their native storage

A crate can describe files where they already live, then project to attached packages, Fuze views, relational tables, or graph stores.

3

Profiles stay small and composable

LAMBDA Core should define shared semantics. Technique and facility extensions should appear only where the shape or validation actually differs.

RO-Crate background

RO-Crate is a practical JSON-LD package for research data

The RO-Crate 1.2 specification describes a dataset and its context using a flat linked-data graph in ro-crate-metadata.json. It can describe local files, remote URIs, physical things, people, organizations, equipment, software, licenses, and provenance.

Plain JSON shape

A crate is an @graph of entities. Objects refer to each other by @id, so consumers do not have to infer relationships from folders.

Linked semantics

Terms mostly map through schema.org, while profiles add domain requirements without forking the base model.

Packaging neutral

A crate can sit beside files, point to files already in facility storage, or represent remote resources in repositories and databases.

Source: RO-Crate 1.2 introduction and RO-Crate specification page.

External precedent

Bioimaging is already moving in this direction

NFDI4BIOIMAGE and German BioImaging are a useful precedent: they face high-volume instrument data, heterogeneous local systems, cloud-oriented storage, and a need to preserve contextual metadata and workflow provenance.

NFDI4BIOIMAGE FAIR biological imaging metadata, tools, repositories OME-Zarr + RO-Crate top-level metadata attached and detached crates Repository export IDR, BIA, SSBD pipelines high-quality RDF views Why LAMBDA should pay attention The pattern is not microscopy-specific: instrument data + contextual metadata + workflow provenance + repository/lakehouse projection.

Sources: NFDI4BIOIMAGE consortium goals; Lubiana, Kunis, Moore, RO-Crates for BioImaging, SWAT4HCLS 2026 abstract.

Why it matters

Structural biology workflows already cross boundaries

The same experiment touches beamline control systems, sample tracking, file stores, workflow engines, archives, and downstream analysis. A graph contract gives those systems a shared reference model.

beamline log collection parameters sample tracker specimen identity file store images and products workflow engine processing provenance LAMBDA RO-Crate graph stable IDs + typed relationships

Architecture

Three layers, one contract

Facility-native layer EPICS logs LIMS and sample sheets detector file systems workflow databases facility archives storage stays local LAMBDA RO-Crate canonical graph and profile rules data sample run result identity, relationships, provenance, checksums Projection layer LinkML data package relational ingest graph database Fuze drive layout archive package same semantic IDs extract project

Core graph

Current schema shape: flat entities plus association tables

This view follows the actual lambda-ber-schema model: a Dataset contains flat entity collections, and many-to-many relationships are carried by explicit association classes.

RawUnit is not a current schema class or standard RO-Crate term. If we need that boundary, it is a LAMBDA profile/future-schema decision.

Dataset top-level container Entity collections Study logical grouping Sample biological material SamplePreparation technique prep Instrument equipment ExperimentRun collection session WorkflowRun processing DataFile raw or derived file Image 2D, 3D, movie... Association tables carry relationships StudySampleAssociation, StudyExperimentAssociation, ExperimentSampleAssociation, ExperimentInstrumentAssociation WorkflowExperimentAssociation, WorkflowInputAssociation, WorkflowOutputAssociation

Projection

lambda-ber-schema is the typed projection

The RO-Crate graph stays linked and web-native. LinkML gives LAMBDA a strict, testable shape for data packages, relational ingest, schema docs, and downstream APIs.

RO-Crate entity
Current / target class
Projection role
Crate root Dataset
Container plus collection sessions.
Acquisition boundary
profile decision, not current class
Usually projects through Sample, ExperimentRun, and file associations.
Files and file sets
Validate formats, sizes, checksums, paths, and data types.
CreateAction / processing
Capture software, parameters, inputs, outputs, and status.
People, orgs, instruments, software
Instrument + fields / targets
Normalize identifiers and local facility metadata.
graph IDs files actions projection Dataset DataFile Workflow Associations

Concrete JSON-LD

The crate starts with a descriptor and a root dataset

A LAMBDA crate should be readable as ordinary RO-Crate first, then checked against LAMBDA Core and optional technique profiles through conformsTo.

Minimal metadata document

{
  "@context": [
    "https://w3id.org/ro/crate/1.2/context",
    {
      "lambda": "http://w3id.org/lambda/",
      "mx": "http://w3id.org/lambda/mx/"
    }
  ],
  "@graph": [
    {
      "@id": "ro-crate-metadata.json",
      "@type": "CreativeWork",
      "conformsTo": { "@id": "https://w3id.org/ro/crate/1.2" },
      "about": { "@id": "./" }
    },
    {
      "@id": "./",
      "@type": ["Dataset", "lambda:Dataset"],
      "name": "ALS 8.3.1 MX collection 2026-05-21",
      "conformsTo": [
        { "@id": "https://lambda.berkeley.edu/profiles/core/0.1" },
        { "@id": "https://lambda.berkeley.edu/profiles/mx/0.1" }
      ],
      "hasPart": [
        { "@id": "#acquisition-crystal-a1" },
        { "@id": "images/scan_0001.cbf" }
      ]
    }
  ]
}

What this buys us

  • The descriptor says which RO-Crate version is being used and which root dataset it describes.
  • The root dataset is the package-level object LAMBDA can validate and project.
  • All entities live in one graph and are linked by stable @id values.
  • Profiles add LAMBDA constraints without breaking base RO-Crate consumers.

Schema slot overlay

Some JSON-LD properties can be actual LAMBDA slots

The crate can stay valid RO-Crate while adding lambda: properties that project directly into lambda-ber-schema. Other RO-Crate fields map during ingest.

ExperimentRun and DataFile fields

{
  "@id": "#run-a1",
  "@type": ["Action", "lambda:ExperimentRun"],
  "lambda:experiment_code": "ALS831-20260521-A1",
  "lambda:technique": "xray_crystallography",
  "lambda:beamline": "8.3.1",
  "lambda:number_of_images": { "value": 180 },
  "object": { "@id": "#sample-lysozyme-42" },
  "result": { "@id": "images/scan_0001.cbf" }
},
{
  "@id": "images/scan_0001.cbf",
  "@type": "File",
  "lambda:file_name": "scan_0001.cbf",
  "lambda:file_format": "cbf",
  "lambda:data_type": "diffraction",
  "lambda:checksum": "7d5d...e0a9",
  "encodingFormat": "image/cbf",
  "contentSize": "18492173"
}
JSON-LD field
Projects to
Status
lambda:experiment_code
Direct schema slot
lambda:technique, lambda:beamline
Direct schema slots
lambda:file_name, lambda:file_format
DataFile slots
Direct schema slots
encodingFormat, contentSize, sha256
file_format, file_size_bytes, checksum
RO-Crate fields mapped at ingest

Sample metadata

Yes, sample metadata can be part of a crate

In an experiment crate, the sample is contextual metadata linked to acquisitions and files. In a sample-only handoff, the crate can simply describe samples and their provenance before data collection exists.

Sample entity with actual Sample slots

{
  "@id": "./",
  "@type": "Dataset",
  "name": "Sample metadata package: LYZ-42",
  "about": { "@id": "#sample-lysozyme-42" },
  "hasPart": [{ "@id": "#sample-lysozyme-42" }]
},
{
  "@id": "#sample-lysozyme-42",
  "@type": ["BioChemEntity", "lambda:Sample"],
  "lambda:sample_code": "LYZ-42",
  "lambda:sample_type": "protein",
  "lambda:protein_name": "Hen egg white lysozyme",
  "lambda:organism": { "@id": "NCBITaxon:9031" },
  "lambda:concentration": { "value": 38, "unitText": "mg/mL" },
  "lambda:buffer_composition": {
    "lambda:ph": { "value": 4.6 },
    "lambda:components": ["0.1 M sodium acetate"]
  }
}

How LAMBDA would use it

  • lambda:sample_code and lambda:sample_type satisfy required Sample slots.
  • Optional biological context, buffer, storage, construct, ligand, mutation, and concentration slots can travel before any beamline run exists.
  • When an experiment crate arrives later, it links back to the same sample @id.
  • BRIDGE can ingest sample-only crates as reference metadata and later join them to runs and files.

Concrete JSON-LD

An MX raw collection as RO-Crate entities

The point is not to force an MX directory convention. It is to make the specimen unit, files, instrument context, and identifiers explicit enough to survive projection.

Acquisition dataset plus file and instrument context

{
  "@id": "#acquisition-crystal-a1",
  "@type": ["Dataset", "mx:CrystalMount"],
  "name": "Crystal A1 mounted for MX collection",
  "about": { "@id": "#sample-lysozyme-42" },
  "lambda:instrument": { "@id": "#als-831-detector" },
  "hasPart": [
    { "@id": "images/scan_0001.cbf" },
    { "@id": "images/scan_0002.cbf" }
  ]
},
{
  "@id": "images/scan_0001.cbf",
  "@type": "File",
  "encodingFormat": "image/cbf",
  "contentSize": "18492173",
  "sha256": "7d5d...e0a9"
},
{
  "@id": "#als-831-detector",
  "@type": ["IndividualProduct", "lambda:Instrument"],
  "name": "ALS beamline 8.3.1 detector"
}

Projection to LinkML

The acquisition boundary is a LAMBDA profile decision, not a built-in RO-Crate term and not a current LinkML class. Today it projects through Sample, ExperimentRun, DataFile, and association tables.

Concrete JSON-LD

Workflow provenance is a first-class part of the crate

Derived products should be linked to the action that made them, the inputs it used, and the software context needed to interpret them.

CreateAction for integration

{
  "@id": "#workflow-xds-integration-001",
  "@type": ["CreateAction", "lambda:WorkflowRun"],
  "name": "XDS integration for Crystal A1",
  "lambda:workflow_code": "XDS-INTEGRATE-001",
  "lambda:workflow_type": "integration",
  "lambda:software_name": "XDS",
  "lambda:software_version": "2024.1",
  "object": [
    { "@id": "#acquisition-crystal-a1" },
    { "@id": "images/scan_0001.cbf" }
  ],
  "instrument": [
    { "@id": "#als-831-detector" },
    { "@id": "#xds-2024.1" }
  ],
  "result": [
    { "@id": "process/xds/INTEGRATE.HKL" },
    { "@id": "process/xds/CORRECT.LP" }
  ],
  "lambda:started_at": "2026-05-21T18:23:12Z",
  "lambda:completed_at": "2026-05-21T18:31:44Z"
}

Software and product entities

{
  "@id": "#xds-2024.1",
  "@type": "SoftwareApplication",
  "name": "XDS",
  "softwareVersion": "2024.1"
},
{
  "@id": "process/xds/INTEGRATE.HKL",
  "@type": "File",
  "encodingFormat": "chemical/x-hkl",
  "contentSize": "3224981",
  "sha256": "ee31...9a7c",
  "isBasedOn": { "@id": "#acquisition-crystal-a1" }
}

Profile strategy

Small core, targeted extensions

Profiles should express validation commitments. They should not mirror every technique, beamline, or directory convention by default.

RO-Crate + schema.org + JSON-LD base interoperability and linked-data semantics LAMBDA Core Profile Experiment, acquisition datasets, files, products, workflows, instruments, software Projection Profile rules for LinkML, relational, graph, and file-package views MX first pilot CryoEM later extension SAXS / SANS later extension Facility overlays: ALS, NSLS-II, BioCAT, local beamline terms

Boundary rule

Do not make a profile just because a dataset has a name

A profile is worth maintaining when it changes validation, identity, graph shape, or round-trip behavior. Otherwise it belongs in controlled terms, facility overlays, or examples.

YCreate a profile when

  • required entities or relationships differ
  • quality rules or cardinalities are technique-specific
  • external standards must be referenced directly
  • converters need stable round-trip guarantees

NKeep it out when

  • only file names, folder names, or beamline labels differ
  • a local acronym can be modeled as a term
  • the same core graph already validates correctly
  • the extension would duplicate facility metadata

Holder and logistics vocabulary belongs in MX or facility overlays. LAMBDA Core should capture the stable relationship: specimen unit, acquired data, workflow outputs, and provenance.

MX first

Macromolecular crystallography is a good pilot

  • Acquisition has a crisp specimen unit: a mounted crystal or related holder context.
  • Diffraction image sets, integration outputs, scaling outputs, models, and reports form a clear product chain.
  • Existing domain standards give us names to align with, including PDBx/mmCIF concepts.
  • It tests whether LAMBDA Core can stay neutral while still being useful.
mounted crystal context diffraction image set beamline detector, energy integration workflow run products reflections, model, report

MX profile sketch

Technique extension on top of LAMBDA Core

MXExperiment LAMBDA Experiment + MX constraints CrystalMount Dataset loop, puck, cassette, holder terms Diffraction Image Set CBF, HDF5, Eiger, detector metadata Specimen Context sample, crystallization, cryoprotection Integration / Scaling Run CreateAction, software, parameters MX Products MTZ, mmCIF, map, validation report Deposition / Archive Projection about hasPart describes result uses

Conformance

Validation should be layered, not monolithic

Each layer answers a different question. That makes failures actionable and keeps LAMBDA Core from absorbing every local rule.

1 RO-Crate validity JSON-LD structure, root dataset, entities resolve 2 LAMBDA Core required entities, checksums, identifiers, provenance shape 3 Technique profile MX-specific acquisition and product constraints 4 Projection profile round trip to LinkML, SQL, graph, file package 5 Facility overlay beamline terms, storage policy, local identifiers

Projection targets

One graph, several operational views

LAMBDA RO-Crate graph stable identities and relationships Attached crate metadata package with file refs Fuze view canonical browse layout Relational ingest LinkML tables and associations Graph database query across facilities Archive package preservation and deposit Analysis API typed objects for tools

Use case

Facility data into BRIDGE / lakehouse without losing meaning

The crate is the handoff object. It can travel with files, point at facility storage, or be regenerated from BRIDGE when a user needs a portable package.

Facility beamline logs sample sheet Extract build crate graph validate profile BRIDGE metadata tables object refs Search / Join facility + sample workflow + product Object store raw files derived products Projection LinkML package SQL or graph Get it back RO-Crate export analysis-ready view

Use case walkthrough

Three ways the same contract gets used

1. Direct from facility

Beamline extraction creates ro-crate-metadata.json next to native files or URLs. The crate records checksums, instrument context, sample context, and workflow outputs before anything moves.

2. Deposit to BRIDGE

BRIDGE ingests the crate graph, stores profile conformance, projects entities into lakehouse tables, and preserves original @id links to raw and derived objects.

3. Retrieve from BRIDGE

A user queries by experiment, sample, acquisition dataset, data product, or workflow. BRIDGE can return a regenerated RO-Crate, a LinkML package, or an analysis view with the same relationship graph.

The user experience changes by path, but validation and relationships stay anchored to the same RO-Crate graph.

Roadmap

Build the contract as fixtures plus validators

The next artifact should be executable: a crate fixture, a LinkML projection, and validation reports that prove the round trip.

1

Publish Core draft

Define required entities, identifiers, file integrity rules, and workflow provenance.

2

Fill projection gaps

Decide whether acquisition-boundary, product, person, organization, and software concepts stay profile-only or become lambda-ber-schema classes.

3

Make MX fixture

Use one realistic collection with images, integration, scaling, model, and validation products.

4

Round-trip tests

Validate crate to LinkML and LinkML back to graph without losing identities or relationships.

5

Facility overlays

Add beamline-specific terms only after the shared contract is passing.

Decision slide

The working vision

  • Use RO-Crate as LAMBDA's canonical semantic exchange layer.
  • Keep LAMBDA Core technique-neutral and limited to shared commitments.
  • Start with MX as the first technique extension, not as the shape of the whole system.
  • Keep mounted-crystal holder and beamline logistics terms in MX or facility overlays.
  • Treat lambda-ber-schema as a conformance-tested projection, not the source of all graph semantics.

Next working artifact: one realistic MX RO-Crate fixture, one LinkML projection, and one validator report.