LAMBDA data organization
RO-Crates as the federation contract
A semantic exchange layer for structural biology data, with lambda-ber-schema as a typed projection for validation, analysis, and ingest.
Position
The contract is the graph, not the directory layout
RO-Crate is the exchange contract
The crate graph carries identities, relationships, provenance, file integrity, and contextual metadata across facilities.
Facilities keep their native storage
A crate can describe files where they already live, then project to attached packages, Fuze views, relational tables, or graph stores.
Profiles stay small and composable
LAMBDA Core should define shared semantics. Technique and facility extensions should appear only where the shape or validation actually differs.
RO-Crate background
RO-Crate is a practical JSON-LD package for research data
The RO-Crate 1.2 specification describes a dataset and its context using a flat linked-data graph in ro-crate-metadata.json. It can describe local files, remote URIs, physical things, people, organizations, equipment, software, licenses, and provenance.
Plain JSON shape
A crate is an @graph of entities. Objects refer to each other by @id, so consumers do not have to infer relationships from folders.
Linked semantics
Terms mostly map through schema.org, while profiles add domain requirements without forking the base model.
Packaging neutral
A crate can sit beside files, point to files already in facility storage, or represent remote resources in repositories and databases.
Source: RO-Crate 1.2 introduction and RO-Crate specification page.
External precedent
Bioimaging is already moving in this direction
NFDI4BIOIMAGE and German BioImaging are a useful precedent: they face high-volume instrument data, heterogeneous local systems, cloud-oriented storage, and a need to preserve contextual metadata and workflow provenance.
Sources: NFDI4BIOIMAGE consortium goals; Lubiana, Kunis, Moore, RO-Crates for BioImaging, SWAT4HCLS 2026 abstract.
Why it matters
Structural biology workflows already cross boundaries
The same experiment touches beamline control systems, sample tracking, file stores, workflow engines, archives, and downstream analysis. A graph contract gives those systems a shared reference model.
Architecture
Three layers, one contract
Core graph
Current schema shape: flat entities plus association tables
This view follows the actual lambda-ber-schema model: a Dataset contains flat entity collections, and many-to-many relationships are carried by explicit association classes.
RawUnit is not a current schema class or standard RO-Crate term. If we need that boundary, it is a LAMBDA profile/future-schema decision.
Projection
lambda-ber-schema is the typed projection
The RO-Crate graph stays linked and web-native. LinkML gives LAMBDA a strict, testable shape for data packages, relational ingest, schema docs, and downstream APIs.
Sample, ExperimentRun, and file associations.Instrument + fields / targetsConcrete JSON-LD
The crate starts with a descriptor and a root dataset
A LAMBDA crate should be readable as ordinary RO-Crate first, then checked against LAMBDA Core and optional technique profiles through conformsTo.
Minimal metadata document
{
"@context": [
"https://w3id.org/ro/crate/1.2/context",
{
"lambda": "http://w3id.org/lambda/",
"mx": "http://w3id.org/lambda/mx/"
}
],
"@graph": [
{
"@id": "ro-crate-metadata.json",
"@type": "CreativeWork",
"conformsTo": { "@id": "https://w3id.org/ro/crate/1.2" },
"about": { "@id": "./" }
},
{
"@id": "./",
"@type": ["Dataset", "lambda:Dataset"],
"name": "ALS 8.3.1 MX collection 2026-05-21",
"conformsTo": [
{ "@id": "https://lambda.berkeley.edu/profiles/core/0.1" },
{ "@id": "https://lambda.berkeley.edu/profiles/mx/0.1" }
],
"hasPart": [
{ "@id": "#acquisition-crystal-a1" },
{ "@id": "images/scan_0001.cbf" }
]
}
]
}
What this buys us
- The descriptor says which RO-Crate version is being used and which root dataset it describes.
- The root dataset is the package-level object LAMBDA can validate and project.
- All entities live in one graph and are linked by stable
@idvalues. - Profiles add LAMBDA constraints without breaking base RO-Crate consumers.
Schema slot overlay
Some JSON-LD properties can be actual LAMBDA slots
The crate can stay valid RO-Crate while adding lambda: properties that project directly into lambda-ber-schema. Other RO-Crate fields map during ingest.
ExperimentRun and DataFile fields
{
"@id": "#run-a1",
"@type": ["Action", "lambda:ExperimentRun"],
"lambda:experiment_code": "ALS831-20260521-A1",
"lambda:technique": "xray_crystallography",
"lambda:beamline": "8.3.1",
"lambda:number_of_images": { "value": 180 },
"object": { "@id": "#sample-lysozyme-42" },
"result": { "@id": "images/scan_0001.cbf" }
},
{
"@id": "images/scan_0001.cbf",
"@type": "File",
"lambda:file_name": "scan_0001.cbf",
"lambda:file_format": "cbf",
"lambda:data_type": "diffraction",
"lambda:checksum": "7d5d...e0a9",
"encodingFormat": "image/cbf",
"contentSize": "18492173"
}
encodingFormat, contentSize, sha256file_format, file_size_bytes, checksumSample metadata
Yes, sample metadata can be part of a crate
In an experiment crate, the sample is contextual metadata linked to acquisitions and files. In a sample-only handoff, the crate can simply describe samples and their provenance before data collection exists.
Sample entity with actual Sample slots
{
"@id": "./",
"@type": "Dataset",
"name": "Sample metadata package: LYZ-42",
"about": { "@id": "#sample-lysozyme-42" },
"hasPart": [{ "@id": "#sample-lysozyme-42" }]
},
{
"@id": "#sample-lysozyme-42",
"@type": ["BioChemEntity", "lambda:Sample"],
"lambda:sample_code": "LYZ-42",
"lambda:sample_type": "protein",
"lambda:protein_name": "Hen egg white lysozyme",
"lambda:organism": { "@id": "NCBITaxon:9031" },
"lambda:concentration": { "value": 38, "unitText": "mg/mL" },
"lambda:buffer_composition": {
"lambda:ph": { "value": 4.6 },
"lambda:components": ["0.1 M sodium acetate"]
}
}
How LAMBDA would use it
lambda:sample_codeandlambda:sample_typesatisfy requiredSampleslots.- Optional biological context, buffer, storage, construct, ligand, mutation, and concentration slots can travel before any beamline run exists.
- When an experiment crate arrives later, it links back to the same sample
@id. - BRIDGE can ingest sample-only crates as reference metadata and later join them to runs and files.
Concrete JSON-LD
An MX raw collection as RO-Crate entities
The point is not to force an MX directory convention. It is to make the specimen unit, files, instrument context, and identifiers explicit enough to survive projection.
Acquisition dataset plus file and instrument context
{
"@id": "#acquisition-crystal-a1",
"@type": ["Dataset", "mx:CrystalMount"],
"name": "Crystal A1 mounted for MX collection",
"about": { "@id": "#sample-lysozyme-42" },
"lambda:instrument": { "@id": "#als-831-detector" },
"hasPart": [
{ "@id": "images/scan_0001.cbf" },
{ "@id": "images/scan_0002.cbf" }
]
},
{
"@id": "images/scan_0001.cbf",
"@type": "File",
"encodingFormat": "image/cbf",
"contentSize": "18492173",
"sha256": "7d5d...e0a9"
},
{
"@id": "#als-831-detector",
"@type": ["IndividualProduct", "lambda:Instrument"],
"name": "ALS beamline 8.3.1 detector"
}
Projection to LinkML
The acquisition boundary is a LAMBDA profile decision, not a built-in RO-Crate term and not a current LinkML class. Today it projects through Sample, ExperimentRun, DataFile, and association tables.
Concrete JSON-LD
Workflow provenance is a first-class part of the crate
Derived products should be linked to the action that made them, the inputs it used, and the software context needed to interpret them.
CreateAction for integration
{
"@id": "#workflow-xds-integration-001",
"@type": ["CreateAction", "lambda:WorkflowRun"],
"name": "XDS integration for Crystal A1",
"lambda:workflow_code": "XDS-INTEGRATE-001",
"lambda:workflow_type": "integration",
"lambda:software_name": "XDS",
"lambda:software_version": "2024.1",
"object": [
{ "@id": "#acquisition-crystal-a1" },
{ "@id": "images/scan_0001.cbf" }
],
"instrument": [
{ "@id": "#als-831-detector" },
{ "@id": "#xds-2024.1" }
],
"result": [
{ "@id": "process/xds/INTEGRATE.HKL" },
{ "@id": "process/xds/CORRECT.LP" }
],
"lambda:started_at": "2026-05-21T18:23:12Z",
"lambda:completed_at": "2026-05-21T18:31:44Z"
}
Software and product entities
{
"@id": "#xds-2024.1",
"@type": "SoftwareApplication",
"name": "XDS",
"softwareVersion": "2024.1"
},
{
"@id": "process/xds/INTEGRATE.HKL",
"@type": "File",
"encodingFormat": "chemical/x-hkl",
"contentSize": "3224981",
"sha256": "ee31...9a7c",
"isBasedOn": { "@id": "#acquisition-crystal-a1" }
}
Profile strategy
Small core, targeted extensions
Profiles should express validation commitments. They should not mirror every technique, beamline, or directory convention by default.
Boundary rule
Do not make a profile just because a dataset has a name
A profile is worth maintaining when it changes validation, identity, graph shape, or round-trip behavior. Otherwise it belongs in controlled terms, facility overlays, or examples.
YCreate a profile when
- required entities or relationships differ
- quality rules or cardinalities are technique-specific
- external standards must be referenced directly
- converters need stable round-trip guarantees
NKeep it out when
- only file names, folder names, or beamline labels differ
- a local acronym can be modeled as a term
- the same core graph already validates correctly
- the extension would duplicate facility metadata
Holder and logistics vocabulary belongs in MX or facility overlays. LAMBDA Core should capture the stable relationship: specimen unit, acquired data, workflow outputs, and provenance.
MX first
Macromolecular crystallography is a good pilot
- Acquisition has a crisp specimen unit: a mounted crystal or related holder context.
- Diffraction image sets, integration outputs, scaling outputs, models, and reports form a clear product chain.
- Existing domain standards give us names to align with, including PDBx/mmCIF concepts.
- It tests whether LAMBDA Core can stay neutral while still being useful.
MX profile sketch
Technique extension on top of LAMBDA Core
Conformance
Validation should be layered, not monolithic
Each layer answers a different question. That makes failures actionable and keeps LAMBDA Core from absorbing every local rule.
Projection targets
One graph, several operational views
Use case
Facility data into BRIDGE / lakehouse without losing meaning
The crate is the handoff object. It can travel with files, point at facility storage, or be regenerated from BRIDGE when a user needs a portable package.
Use case walkthrough
Three ways the same contract gets used
1. Direct from facility
Beamline extraction creates ro-crate-metadata.json next to native files or URLs. The crate records checksums, instrument context, sample context, and workflow outputs before anything moves.
2. Deposit to BRIDGE
BRIDGE ingests the crate graph, stores profile conformance, projects entities into lakehouse tables, and preserves original @id links to raw and derived objects.
3. Retrieve from BRIDGE
A user queries by experiment, sample, acquisition dataset, data product, or workflow. BRIDGE can return a regenerated RO-Crate, a LinkML package, or an analysis view with the same relationship graph.
The user experience changes by path, but validation and relationships stay anchored to the same RO-Crate graph.
Roadmap
Build the contract as fixtures plus validators
The next artifact should be executable: a crate fixture, a LinkML projection, and validation reports that prove the round trip.
Publish Core draft
Define required entities, identifiers, file integrity rules, and workflow provenance.
Fill projection gaps
Decide whether acquisition-boundary, product, person, organization, and software concepts stay profile-only or become lambda-ber-schema classes.
Make MX fixture
Use one realistic collection with images, integration, scaling, model, and validation products.
Round-trip tests
Validate crate to LinkML and LinkML back to graph without losing identities or relationships.
Facility overlays
Add beamline-specific terms only after the shared contract is passing.
Decision slide
The working vision
- Use RO-Crate as LAMBDA's canonical semantic exchange layer.
- Keep LAMBDA Core technique-neutral and limited to shared commitments.
- Start with MX as the first technique extension, not as the shape of the whole system.
- Keep mounted-crystal holder and beamline logistics terms in MX or facility overlays.
- Treat
lambda-ber-schemaas a conformance-tested projection, not the source of all graph semantics.
Next working artifact: one realistic MX RO-Crate fixture, one LinkML projection, and one validator report.