A shift at the beamline leaves terabytes of frames sitting on the facility's storage. You don't want them on your laptop — you want them known: searchable, linked to the sample and instrument, ready to act on. The agent pulls a LAMBDA-BER RO-Crate over the lambda API and registers the run as rows in your local tables — pointers and checksums, not bytes.
The scenario
A SAXS session at SIBYLS or a cryo-EM collection on a Krios produces hundreds of gigabytes to terabytes of raw frames. Copying everything to local disk is slow, expensive, and usually pointless — you need most of it only when a processing step actually reads it. What you need immediately is the record: what was run, on what sample, with what instrument, and where the bytes live.
storage_uri, verifying integrity with the checksum. Heavy
files are streamed later, on demand, only if a workflow needs them.
How it works
The crate the facility emits validates against the same schema you use everywhere else, so ingest is a row-insert, not a parsing project. A single session populates the instrument, the experiment, and the file index together.
What gets registered
Each raw or derived file becomes a DataFile with enough to find it, verify
it, and reason about it — without a single byte leaving the facility.
data_files: - file_name: saxs_session_2024-05-16_buffer-subtracted.dat file_format: hdf5 data_type: processed file_size_bytes: 41229312 storage_uri: sibyls://als/12.3.1/2024-05-16/run047.h5 # lives at the facility checksum: sha256:9f2c…a71b # integrity on fetch related_entity: lambdaber:exp_chd1_saxs_001 experiment_runs: - id: lambdaber:exp_chd1_saxs_001 experiment_code: EXP-CHD1-SAXS-001 technique: saxs processing_status: collected # raw on day one; updated as it advances
| Class | Carries |
|---|---|
Instrument | code, model, current_status |
ExperimentRun | technique, conditions, quality_metrics |
DataFile | storage_uri, checksum, size, role |
| assoc tables | experiment↔instrument, experiment↔sample |
| Asset | Why |
|---|---|
| raw movie stacks | TB-scale, rarely re-read whole |
| detector frames | needed only by processing |
| intermediate maps | regenerable from inputs |
→ fetched by storage_uri | when a step demands it |
The agent at work
Payoff
A session is catalogued the moment the shift ends — no overnight transfer before you can even see what you have.
You track terabytes without storing them. Bytes stay at the facility and stream only when a step reads them.
Every pointer carries a SHA-256, so any later fetch is verified against what was collected.
Once registered, a run feeds the query, processing, integration, and deposition use cases — same rows, no re-import.