LAMBDA data organization

RO-Crates as the federation contract

A semantic exchange layer for structural biology data, with lambda-ber-schema as a typed projection for validation, analysis, and ingest.

facility-neutral graph small core profile MX pilot extension projection-tested LinkML

Position

The contract is the graph, not the directory layout

1

RO-Crate is the exchange contract

The crate graph carries identities, relationships, provenance, file integrity, and contextual metadata across facilities.

2

Facilities keep their native storage

A crate can describe files where they already live, then project to attached packages, Fuze views, relational tables, or graph stores.

3

Profiles stay small and composable

LAMBDA Core should define shared semantics. Technique and facility extensions should appear only where the shape or validation actually differs.

RO-Crate background

RO-Crate is a practical JSON-LD package for research data

The RO-Crate 1.2 specification describes a dataset and its context using a flat linked-data graph in ro-crate-metadata.json. It can describe local files, remote URIs, physical things, people, organizations, equipment, software, licenses, and provenance.

Plain JSON shape

A crate is an @graph of entities. Objects refer to each other by @id, so consumers do not have to infer relationships from folders.

Linked semantics

Terms mostly map through schema.org, while profiles add domain requirements without forking the base model.

Packaging neutral

A crate can sit beside files, point to files already in facility storage, or represent remote resources in repositories and databases.

Source: RO-Crate 1.2 introduction and RO-Crate specification page.

External precedent

Bioimaging is already moving in this direction

NFDI4BIOIMAGE and German BioImaging are a useful precedent: they face high-volume instrument data, heterogeneous local systems, cloud-oriented storage, and a need to preserve contextual metadata and workflow provenance.

Sources: NFDI4BIOIMAGE consortium goals; Lubiana, Kunis, Moore, RO-Crates for BioImaging, SWAT4HCLS 2026 abstract.

Why it matters

Structural biology workflows already cross boundaries

The same experiment touches beamline control systems, sample tracking, file stores, workflow engines, archives, and downstream analysis. A graph contract gives those systems a shared reference model.

Architecture

Three layers, one contract

Core graph

Current schema shape: flat entities plus association tables

This view follows the actual lambda-ber-schema model: a Dataset contains flat entity collections, and many-to-many relationships are carried by explicit association classes.

Dataset Study Sample SamplePreparation Instrument ExperimentRun WorkflowRun DataFile Image

RawUnit is not a current schema class or standard RO-Crate term. If we need that boundary, it is a LAMBDA profile/future-schema decision.

Projection

`lambda-ber-schema` is the typed projection

The RO-Crate graph stays linked and web-native. LinkML gives LAMBDA a strict, testable shape for data packages, relational ingest, schema docs, and downstream APIs.

Crate root Dataset

Dataset, ExperimentRun

Container plus collection sessions.

Acquisition boundary

profile decision, not current class

Usually projects through Sample, ExperimentRun, and file associations.

Files and file sets

DataFile, Image

Validate formats, sizes, checksums, paths, and data types.

CreateAction / processing

WorkflowRun

Capture software, parameters, inputs, outputs, and status.

People, orgs, instruments, software

Instrument + fields / targets

Normalize identifiers and local facility metadata.

Concrete JSON-LD

The crate starts with a descriptor and a root dataset

A LAMBDA crate should be readable as ordinary RO-Crate first, then checked against LAMBDA Core and optional technique profiles through conformsTo.

Minimal metadata document

{
  "@context": [
    "https://w3id.org/ro/crate/1.2/context",
    {
      "lambda": "http://w3id.org/lambda/",
      "mx": "http://w3id.org/lambda/mx/"
    }
  ],
  "@graph": [
    {
      "@id": "ro-crate-metadata.json",
      "@type": "CreativeWork",
      "conformsTo": { "@id": "https://w3id.org/ro/crate/1.2" },
      "about": { "@id": "./" }
    },
    {
      "@id": "./",
      "@type": ["Dataset", "lambda:Dataset"],
      "name": "ALS 8.3.1 MX collection 2026-05-21",
      "conformsTo": [
        { "@id": "https://lambda.berkeley.edu/profiles/core/0.1" },
        { "@id": "https://lambda.berkeley.edu/profiles/mx/0.1" }
      ],
      "hasPart": [
        { "@id": "#acquisition-crystal-a1" },
        { "@id": "images/scan_0001.cbf" }
      ]
    }
  ]
}

What this buys us

The descriptor says which RO-Crate version is being used and which root dataset it describes.
The root dataset is the package-level object LAMBDA can validate and project.
All entities live in one graph and are linked by stable @id values.
Profiles add LAMBDA constraints without breaking base RO-Crate consumers.

Schema slot overlay

Some JSON-LD properties can be actual LAMBDA slots

The crate can stay valid RO-Crate while adding lambda: properties that project directly into lambda-ber-schema. Other RO-Crate fields map during ingest.

ExperimentRun and DataFile fields

{
  "@id": "#run-a1",
  "@type": ["Action", "lambda:ExperimentRun"],
  "lambda:experiment_code": "ALS831-20260521-A1",
  "lambda:technique": "xray_crystallography",
  "lambda:beamline": "8.3.1",
  "lambda:number_of_images": { "value": 180 },
  "object": { "@id": "#sample-lysozyme-42" },
  "result": { "@id": "images/scan_0001.cbf" }
},
{
  "@id": "images/scan_0001.cbf",
  "@type": "File",
  "lambda:file_name": "scan_0001.cbf",
  "lambda:file_format": "cbf",
  "lambda:data_type": "diffraction",
  "lambda:checksum": "7d5d...e0a9",
  "encodingFormat": "image/cbf",
  "contentSize": "18492173"
}

lambda:experiment_code

ExperimentRun.experiment_code

Direct schema slot

lambda:technique, lambda:beamline

ExperimentRun slots

Direct schema slots

lambda:file_name, lambda:file_format

DataFile slots

Direct schema slots

encodingFormat, contentSize, sha256

file_format, file_size_bytes, checksum

RO-Crate fields mapped at ingest

Sample metadata

Yes, sample metadata can be part of a crate

In an experiment crate, the sample is contextual metadata linked to acquisitions and files. In a sample-only handoff, the crate can simply describe samples and their provenance before data collection exists.

Sample entity with actual Sample slots

{
  "@id": "./",
  "@type": "Dataset",
  "name": "Sample metadata package: LYZ-42",
  "about": { "@id": "#sample-lysozyme-42" },
  "hasPart": [{ "@id": "#sample-lysozyme-42" }]
},
{
  "@id": "#sample-lysozyme-42",
  "@type": ["BioChemEntity", "lambda:Sample"],
  "lambda:sample_code": "LYZ-42",
  "lambda:sample_type": "protein",
  "lambda:protein_name": "Hen egg white lysozyme",
  "lambda:organism": { "@id": "NCBITaxon:9031" },
  "lambda:concentration": { "value": 38, "unitText": "mg/mL" },
  "lambda:buffer_composition": {
    "lambda:ph": { "value": 4.6 },
    "lambda:components": ["0.1 M sodium acetate"]
  }
}

How LAMBDA would use it

lambda:sample_code and lambda:sample_type satisfy required Sample slots.
Optional biological context, buffer, storage, construct, ligand, mutation, and concentration slots can travel before any beamline run exists.
When an experiment crate arrives later, it links back to the same sample @id.
BRIDGE can ingest sample-only crates as reference metadata and later join them to runs and files.

Concrete JSON-LD

An MX raw collection as RO-Crate entities

The point is not to force an MX directory convention. It is to make the specimen unit, files, instrument context, and identifiers explicit enough to survive projection.

Acquisition dataset plus file and instrument context

{
  "@id": "#acquisition-crystal-a1",
  "@type": ["Dataset", "mx:CrystalMount"],
  "name": "Crystal A1 mounted for MX collection",
  "about": { "@id": "#sample-lysozyme-42" },
  "lambda:instrument": { "@id": "#als-831-detector" },
  "hasPart": [
    { "@id": "images/scan_0001.cbf" },
    { "@id": "images/scan_0002.cbf" }
  ]
},
{
  "@id": "images/scan_0001.cbf",
  "@type": "File",
  "encodingFormat": "image/cbf",
  "contentSize": "18492173",
  "sha256": "7d5d...e0a9"
},
{
  "@id": "#als-831-detector",
  "@type": ["IndividualProduct", "lambda:Instrument"],
  "name": "ALS beamline 8.3.1 detector"
}

Projection to LinkML

The acquisition boundary is a LAMBDA profile decision, not a built-in RO-Crate term and not a current LinkML class. Today it projects through Sample, ExperimentRun, DataFile, and association tables.

Sample ExperimentRun DataFile Image Instrument

Concrete JSON-LD

Workflow provenance is a first-class part of the crate

Derived products should be linked to the action that made them, the inputs it used, and the software context needed to interpret them.

CreateAction for integration

{
  "@id": "#workflow-xds-integration-001",
  "@type": ["CreateAction", "lambda:WorkflowRun"],
  "name": "XDS integration for Crystal A1",
  "lambda:workflow_code": "XDS-INTEGRATE-001",
  "lambda:workflow_type": "integration",
  "lambda:software_name": "XDS",
  "lambda:software_version": "2024.1",
  "object": [
    { "@id": "#acquisition-crystal-a1" },
    { "@id": "images/scan_0001.cbf" }
  ],
  "instrument": [
    { "@id": "#als-831-detector" },
    { "@id": "#xds-2024.1" }
  ],
  "result": [
    { "@id": "process/xds/INTEGRATE.HKL" },
    { "@id": "process/xds/CORRECT.LP" }
  ],
  "lambda:started_at": "2026-05-21T18:23:12Z",
  "lambda:completed_at": "2026-05-21T18:31:44Z"
}

Software and product entities

{
  "@id": "#xds-2024.1",
  "@type": "SoftwareApplication",
  "name": "XDS",
  "softwareVersion": "2024.1"
},
{
  "@id": "process/xds/INTEGRATE.HKL",
  "@type": "File",
  "encodingFormat": "chemical/x-hkl",
  "contentSize": "3224981",
  "sha256": "ee31...9a7c",
  "isBasedOn": { "@id": "#acquisition-crystal-a1" }
}

Profile strategy

Small core, targeted extensions

Profiles should express validation commitments. They should not mirror every technique, beamline, or directory convention by default.

Boundary rule

Do not make a profile just because a dataset has a name

A profile is worth maintaining when it changes validation, identity, graph shape, or round-trip behavior. Otherwise it belongs in controlled terms, facility overlays, or examples.

YCreate a profile when

required entities or relationships differ
quality rules or cardinalities are technique-specific
external standards must be referenced directly
converters need stable round-trip guarantees

NKeep it out when

only file names, folder names, or beamline labels differ
a local acronym can be modeled as a term
the same core graph already validates correctly
the extension would duplicate facility metadata

Holder and logistics vocabulary belongs in MX or facility overlays. LAMBDA Core should capture the stable relationship: specimen unit, acquired data, workflow outputs, and provenance.

MX first

Macromolecular crystallography is a good pilot

Acquisition has a crisp specimen unit: a mounted crystal or related holder context.
Diffraction image sets, integration outputs, scaling outputs, models, and reports form a clear product chain.
Existing domain standards give us names to align with, including PDBx/mmCIF concepts.
It tests whether LAMBDA Core can stay neutral while still being useful.

MX profile sketch

Technique extension on top of LAMBDA Core

Conformance

Validation should be layered, not monolithic

Each layer answers a different question. That makes failures actionable and keeps LAMBDA Core from absorbing every local rule.

Projection targets

One graph, several operational views

Use case

Facility data into BRIDGE / lakehouse without losing meaning

The crate is the handoff object. It can travel with files, point at facility storage, or be regenerated from BRIDGE when a user needs a portable package.

Use case walkthrough

Three ways the same contract gets used

1. Direct from facility

Beamline extraction creates ro-crate-metadata.json next to native files or URLs. The crate records checksums, instrument context, sample context, and workflow outputs before anything moves.

2. Deposit to BRIDGE

BRIDGE ingests the crate graph, stores profile conformance, projects entities into lakehouse tables, and preserves original @id links to raw and derived objects.

3. Retrieve from BRIDGE

A user queries by experiment, sample, acquisition dataset, data product, or workflow. BRIDGE can return a regenerated RO-Crate, a LinkML package, or an analysis view with the same relationship graph.

The user experience changes by path, but validation and relationships stay anchored to the same RO-Crate graph.

Roadmap

Build the contract as fixtures plus validators

The next artifact should be executable: a crate fixture, a LinkML projection, and validation reports that prove the round trip.

1

Publish Core draft

Define required entities, identifiers, file integrity rules, and workflow provenance.

2

Fill projection gaps

Decide whether acquisition-boundary, product, person, organization, and software concepts stay profile-only or become lambda-ber-schema classes.

3

Make MX fixture

Use one realistic collection with images, integration, scaling, model, and validation products.

4

Round-trip tests

Validate crate to LinkML and LinkML back to graph without losing identities or relationships.

5

Facility overlays

Add beamline-specific terms only after the shared contract is passing.

Decision slide

The working vision

Use RO-Crate as LAMBDA's canonical semantic exchange layer.
Keep LAMBDA Core technique-neutral and limited to shared commitments.
Start with MX as the first technique extension, not as the shape of the whole system.
Keep mounted-crystal holder and beamline logistics terms in MX or facility overlays.
Treat lambda-ber-schema as a conformance-tested projection, not the source of all graph semantics.

Next working artifact: one realistic MX RO-Crate fixture, one LinkML projection, and one validator report.

RO-Crates as the federation contract

The contract is the graph, not the directory layout

RO-Crate is the exchange contract

Facilities keep their native storage

Profiles stay small and composable

RO-Crate is a practical JSON-LD package for research data

Plain JSON shape

Linked semantics

Packaging neutral

Bioimaging is already moving in this direction

Structural biology workflows already cross boundaries

Three layers, one contract

Current schema shape: flat entities plus association tables

lambda-ber-schema is the typed projection

The crate starts with a descriptor and a root dataset

Minimal metadata document

What this buys us

Some JSON-LD properties can be actual LAMBDA slots

ExperimentRun and DataFile fields

Yes, sample metadata can be part of a crate

Sample entity with actual Sample slots

How LAMBDA would use it

An MX raw collection as RO-Crate entities

Acquisition dataset plus file and instrument context

Projection to LinkML

Workflow provenance is a first-class part of the crate

CreateAction for integration

Software and product entities

Small core, targeted extensions

Do not make a profile just because a dataset has a name

YCreate a profile when

NKeep it out when

Macromolecular crystallography is a good pilot

Technique extension on top of LAMBDA Core

Validation should be layered, not monolithic

One graph, several operational views

Facility data into BRIDGE / lakehouse without losing meaning

Three ways the same contract gets used

1. Direct from facility

2. Deposit to BRIDGE

3. Retrieve from BRIDGE

Build the contract as fixtures plus validators

Publish Core draft

Fill projection gaps

Make MX fixture

Round-trip tests

Facility overlays

The working vision

`lambda-ber-schema` is the typed projection