Use Case 01 · Federated Data Pull

The scenario

The data is huge, remote, and yours to track

A SAXS session at SIBYLS or a cryo-EM collection on a Krios produces hundreds of gigabytes to terabytes of raw frames. Copying everything to local disk is slow, expensive, and usually pointless — you need most of it only when a processing step actually reads it. What you need immediately is the record: what was run, on what sample, with what instrument, and where the bytes live.

The federation principle: the facility stays the source of truth for raw data. The agent ingests a lightweight crate that describes the run and points at the bytes by storage_uri, verifying integrity with the checksum. Heavy files are streamed later, on demand, only if a workflow needs them.

What gets registered

A DataFile row is a pointer, not a payload

Each raw or derived file becomes a DataFile with enough to find it, verify it, and reason about it — without a single byte leaving the facility.

▸ ingested rows (excerpt) — pointers + provenance, zero bytes copied

data_files:
- file_name: saxs_session_2024-05-16_buffer-subtracted.dat
  file_format: hdf5
  data_type: processed
  file_size_bytes: 41229312
  storage_uri: sibyls://als/12.3.1/2024-05-16/run047.h5   # lives at the facility
  checksum: sha256:9f2c…a71b                          # integrity on fetch
  related_entity: lambdaber:exp_chd1_saxs_001

experiment_runs:
- id: lambdaber:exp_chd1_saxs_001
  experiment_code: EXP-CHD1-SAXS-001
  technique: saxs
  processing_status: collected          # raw on day one; updated as it advances

Lands on ingest

Class	Carries
`Instrument`	code, model, current_status
`ExperimentRun`	technique, conditions, quality_metrics
`DataFile`	storage_uri, checksum, size, role
assoc tables	experiment↔instrument, experiment↔sample

Stays remote

Asset	Why
raw movie stacks	TB-scale, rarely re-read whole
detector frames	needed only by processing
intermediate maps	regenerable from inputs
→ fetched by `storage_uri`	when a step demands it

The agent at work

Ingest a session in one pass

beril · claude-code · ~/lab-catalog

> pull yesterday's SIBYLS session into the catalog

⏺ list available crates (lambda API · facility=sibyls, date=2024-05-16)
→ 1 session · EXP-CHD1-SAXS-001 · 47 runs · 612 GB raw (remote)

⏺ fetch metadata crate
GET sibyls://als/12.3.1/2024-05-16/ro-crate-metadata.json (318 KB)
✓ validates against lambda_ber_schema

⏺ register rows (DuckDB)
+ 1 Instrument · + 1 ExperimentRun · + 49 DataFile (pointers)
+ experiment↔instrument, experiment↔sample associations
✓ 612 GB now searchable · 0 bytes copied · checksums recorded

> how much of my catalog is still un-processed?
⏺ query WHERE processing_status IN ('collected','raw')
→ 1 run ready to hand to use case 03 (auto-orchestrate)

Payoff

What pulling metadata-first buys

[ ⇣ ] Seconds, not hours

A session is catalogued the moment the shift ends — no overnight transfer before you can even see what you have.

[ ⛃ ] No local data lake

You track terabytes without storing them. Bytes stay at the facility and stream only when a step reads them.

[ ✓ ] Integrity built in

Every pointer carries a SHA-256, so any later fetch is verified against what was collected.

[ ↦ ] Ready for everything else

Once registered, a run feeds the query, processing, integration, and deposition use cases — same rows, no re-import.

Federated data pull:
register a beamtime without hauling it home

The data is huge, remote, and yours to track

One pull, three things land locally

A DataFile row is a pointer, not a payload

Ingest a session in one pass

What pulling metadata-first buys

[ ⇣ ] Seconds, not hours

[ ⛃ ] No local data lake

[ ✓ ] Integrity built in

[ ↦ ] Ready for everything else