PDB Deposition and OneDep

Overview

This document describes the practical workflow for depositing structures to the Protein Data Bank (PDB) and how lambda-ber-schema relates to the deposition process.

How Structures Get Deposited

The Reality: Software-Generated Files

Researchers almost never manually author mmCIF files. The typical workflow:

Structure determination software generates mmCIF: PHENIX, CCP4, RELION, cryoSPARC, etc. output refined coordinates
OneDep web portal: Researchers upload files and complete metadata via web forms
Validation and review: wwPDB staff validate and may request corrections
Release: Structure becomes publicly available

OneDep Deposition System

The wwPDB provides OneDep - the unified deposition portal where researchers:

Upload coordinate files (mmCIF format)
Upload experimental data (structure factors, EM maps)
Fill in metadata through web forms
Answer questions about sample, methods, funding, authors

OneDep validates everything and helps fix errors interactively. Most experimental metadata is extracted from the coordinate file - OneDep asks researchers to confirm and supplement.

The Metadata Gap

The disconnect in current practice:

Beamline software records data collection parameters
Processing software tracks refinement statistics
Lab notebooks contain sample preparation details
None of these systems communicate automatically

Researchers manually transcribe from multiple sources into OneDep forms - an error-prone and tedious process. This is exactly the gap lambda-ber-schema aims to fill.

Legacy PDB Format vs mmCIF

The PDB historically used a fixed-column text format with specific record types. While mmCIF is now the primary format, many facilities and researchers still reference the legacy record names.

Legacy PDB Records to mmCIF Mapping

PDB Record	Contents	mmCIF Equivalent
`HEADER`	Classification, date, PDB ID	`_entry`, `_struct_keywords`
`TITLE`	Structure title	`_struct.title`
`COMPND`	Molecule names, chains	`_entity.pdbx_description`
`SOURCE`	Organism, expression system	`_entity_src_gen`, `_entity_src_nat`
`KEYWDS`	Search keywords	`_struct_keywords.text`
`EXPDTA`	Experimental method	`_exptl.method`
`AUTHOR`	Depositor names	`_audit_author`
`REVDAT`	Revision history	`_pdbx_audit_revision_history`
`JRNL`	Publication citation	`_citation`
`REMARK 200`	Crystallographic data collection	`_diffrn`, `_reflns`
`REMARK 280`	Crystal/crystallization conditions	`_exptl_crystal_grow`
`REMARK 350`	Biological assembly	`_pdbx_struct_assembly`
`REMARK 465`	Missing residues	`_pdbx_unobs_or_zero_occ_residues`
`CRYST1`	Unit cell, space group	`_cell`, `_symmetry`
`ORIGX1/2/3`	Original coordinate transforms	`_database_PDB_matrix`
`SCALE1/2/3`	Fractional coordinate transforms	`_atom_sites.fract_transf_*`

REMARK 200: Data Collection Details

REMARK 200 is particularly important as it contains extensive data collection metadata:

Wavelength
Temperature
Detector type and distance
Resolution limits
Beamline/source information
Data collection date

This maps to lambda-ber-schema's ExperimentRun class and the X-ray specific fields.

Facility Metadata Requirements

Different synchrotron facilities track similar metadata for PDB deposition. A typical checklist includes:

Category	Property	Notes
PDB_Headers	HEADER	Entry classification
PDB_Headers	TITLE	Structure title
PDB_Headers	COMPND	Compound description
PDB_Headers	SOURCE	Biological source
PDB_Headers	KEYWDS	Keywords
PDB_Headers	EXPDTA	Experimental method
PDB_Headers	AUTHOR	Author list
PDB_Headers	REVDAT	Revision dates
PDB_Headers	JRNL	Journal reference
PDB_Headers	REMARK_200	Data collection parameters
PDB_Headers	REMARK_280	Crystallization conditions
PDB_Headers	REMARK_350	Biological assembly
PDB_Headers	REMARK_465	Missing residues
PDB_Headers	CRYST1	Unit cell parameters
PDB_Headers	ORIGX1/2/3	Origin transforms
PDB_Headers	SCALE1/2/3	Scale transforms

PDB Headers to Schema Mapping

The following table shows how each PDB header record maps to specific classes and slots in lambda-ber-schema.

HEADER - Entry Information

PDB Field	Schema Class	Schema Slot(s)	Notes
Classification	`Dataset`	`keywords`	Entry classification terms
Deposition date	`WorkflowRun`	`deposited_to_pdb`, `end_date`	Tracked as workflow completion
PDB ID	`WorkflowRun`	`pdb_id`	Assigned accession code

TITLE - Structure Title

PDB Field	Schema Class	Schema Slot(s)	Notes
Title	`Study`	`title`, `description`	Human-readable structure title

COMPND - Compound/Molecule

PDB Field	Schema Class	Schema Slot(s)	Notes
Molecule name	`Sample`	`protein_name`	Protein/molecule name
Chain IDs	`Sample`	`molecular_composition`	Chain assignments
EC number	`Sample`	(via ontology terms)	Enzyme classification
Engineered	`Sample`	`construct`, `tag`, `mutations`	Construct details

SOURCE - Biological Source

PDB Field	Schema Class	Schema Slot(s)	Notes
Organism scientific	`Sample`	`organism`	Source organism (OntologyTerm)
Organism taxid	`Sample`	`ncbi_taxid`	NCBI taxonomy ID
Expression system	`Sample`	`expression_system`	Recombinant expression host
Expression system taxid	`SamplePreparation`	`expression_system`	Host organism details
Gene	`Sample`	`gene_synthesis_vendor`, `codon_optimization_organism`	Gene information
Strain	`SamplePreparation`	`cell_line`	Strain/cell line used

KEYWDS - Keywords

PDB Field	Schema Class	Schema Slot(s)	Notes
Keywords	`Dataset`	`keywords`	Searchable terms (multivalued)

EXPDTA - Experimental Method

PDB Field	Schema Class	Schema Slot(s)	Notes
Method	`ExperimentRun`	`technique`	TechniqueEnum value

AUTHOR - Depositor Information

PDB Field	Schema Class	Schema Slot(s)	Notes
Authors	`Study`	`contributors`	Author list

JRNL - Citation

PDB Field	Schema Class	Schema Slot(s)	Notes
Citation	`Study`	`references`	Publication references

REMARK 200 - Data Collection

PDB Field	Schema Class	Schema Slot(s)	Notes
Wavelength	`ExperimentRun`	`wavelength`	X-ray wavelength in Å
Temperature	`ExperimentRun`	`temperature_k`	Data collection temperature
Detector type	`XRayInstrument`	`detector_technology`	DetectorTechnologyEnum
Detector manufacturer	`XRayInstrument`	`detector_manufacturer`	e.g., Dectris, Rayonix
Detector model	`XRayInstrument`	`detector_model`	e.g., EIGER2 X 16M
Detector distance	`ExperimentRun`	`detector_distance`	Sample-detector distance (mm)
Beam center X/Y	`ExperimentRun`	`beam_center_x`, `beam_center_y`	Beam position on detector
Oscillation range	`ExperimentRun`	`oscillation_angle`	Rotation per image
Number of images	`ExperimentRun`	`number_of_images`	Total frames collected
Synchrotron	`XRayInstrument`	`facility_id`	FacilityEnum value
Beamline	`ExperimentRun`	`beamline`	Beamline identifier
Resolution range	`WorkflowRun`	`resolution_high`, `resolution_low`	Resolution limits
Completeness	`WorkflowRun`	`completeness_percent`	Data completeness
Rmerge	`WorkflowRun`	`rmerge`	Merging R-factor
I/sigmaI	`WorkflowRun`	`i_over_sigma`	Signal-to-noise ratio
Redundancy	`WorkflowRun`	`multiplicity`	Data redundancy

REMARK 280 - Crystallization

PDB Field	Schema Class	Schema Slot(s)	Notes
Crystallization method	`CrystallizationConditions`	`method`	Vapor diffusion, batch, etc.
pH	`CrystallizationConditions`	`ph`	Crystallization pH
Temperature	`CrystallizationConditions`	`temperature_celsius`	Growth temperature
Precipitant	`CrystallizationConditions`	`precipitant_type`, `precipitant_concentration`	Main precipitant
Buffer	`CrystallizationConditions`	`buffer`, `buffer_concentration`	Buffer system
Salt	`CrystallizationConditions`	`salt`, `salt_concentration`	Added salts
Additives	`CrystallizationConditions`	`additives`	Other components

REMARK 350 - Biological Assembly

PDB Field	Schema Class	Schema Slot(s)	Notes
Biological unit	`Sample`	`oligomeric_state`	Assembly state
Symmetry operators	-	Not directly modeled	Assembly transforms

CRYST1 - Unit Cell

PDB Field	Schema Class	Schema Slot(s)	Notes
a, b, c	`WorkflowRun`	`unit_cell_a`, `unit_cell_b`, `unit_cell_c`	Cell dimensions (Å)
alpha, beta, gamma	`WorkflowRun`	`unit_cell_alpha`, `unit_cell_beta`, `unit_cell_gamma`	Cell angles (°)
Space group	`WorkflowRun`	`space_group`	Hermann-Mauguin symbol
Z value	-	Not directly modeled	Molecules per unit cell

ORIGX / SCALE - Coordinate Transforms

PDB Field	Schema Class	Schema Slot(s)	Notes
Transform matrices	-	Not modeled	Coordinate transformations stored in coordinate files

PDB Field	Schema Class	Schema Slot(s)	Notes
R-work	`WorkflowRun`	`rwork`	Working set R-factor
R-free	`WorkflowRun`	`rfree`	Free set R-factor
RMSD bonds	`WorkflowRun`	`rmsd_bonds`	Bond length deviation
RMSD angles	`WorkflowRun`	`rmsd_angles`	Bond angle deviation
Ramachandran favored	`WorkflowRun`	`ramachandran_favored`	% in favored regions
Ramachandran outliers	`WorkflowRun`	`ramachandran_outliers`	% outliers
Clashscore	`WorkflowRun`	`clashscore`	MolProbity clashscore
Wilson B	`WorkflowRun`	`wilson_b_factor`	Wilson B-factor estimate

Phasing Information

PDB Field	Schema Class	Schema Slot(s)	Notes
Phasing method	`WorkflowRun`	`phasing_method`	PhasingMethodEnum (SAD, MAD, MR, etc.)
Search model	`WorkflowRun`	`search_model_pdb_id`	MR template PDB ID

Deposition Status

PDB Field	Schema Class	Schema Slot(s)	Notes
Deposited	`WorkflowRun`	`deposited_to_pdb`	Boolean flag
PDB ID	`WorkflowRun`	`pdb_id`	Assigned accession
Validation report	`WorkflowRun`	`validation_report_path`	Path to validation PDF

Coverage Summary

The schema provides structured capture for the majority of PDB deposition metadata:

PDB Section	Coverage	Notes
HEADER/TITLE/KEYWDS	✅ Full	Dataset and Study metadata
COMPND/SOURCE	✅ Full	Sample class with NSLS2 extensions
EXPDTA/AUTHOR	✅ Full	ExperimentRun and Study
REMARK 200 (data collection)	✅ Full	ExperimentRun + XRayInstrument
REMARK 280 (crystallization)	✅ Full	CrystallizationConditions class
REMARK 3 (refinement)	✅ Full	WorkflowRun refinement slots
CRYST1 (unit cell)	✅ Full	WorkflowRun crystallographic slots
REMARK 350 (assembly)	⚠️ Partial	oligomeric_state only
ORIGX/SCALE (transforms)	❌ None	Stored in coordinate files
ATOM/HETATM (coordinates)	❌ None	Schema tracks files, not coordinates

Integration Strategy

Pre-Deposition: Use lambda-ber-schema

Capture sample metadata at preparation time
Record instrument and experimental parameters automatically from beamline
Track processing workflows with software versions and parameters
Maintain complete provenance from sample to structure

Deposition: Export to OneDep

Generate metadata summary from lambda-ber-schema records
Use as reference when completing OneDep forms
Future: automated export to OneDep-compatible format

Post-Deposition: Link Records

Update WorkflowRun.pdb_id with assigned accession
Set WorkflowRun.deposited_to_pdb = true
Maintain bidirectional links between local records and PDB entry

OneDep Validation Categories

OneDep performs extensive validation before accepting a deposition. Understanding these checks helps ensure lambda-ber-schema captures the right metadata.

Geometry Validation

Check	What It Validates	Schema Support
Bond lengths	Deviations from ideal geometry	`WorkflowRun.rmsd_bonds`
Bond angles	Deviations from ideal angles	`WorkflowRun.rmsd_angles`
Ramachandran	Backbone dihedral angles	`WorkflowRun.ramachandran_favored`, `ramachandran_outliers`
Rotamers	Side chain conformations	(in coordinate file)
Clashes	Steric overlaps	`WorkflowRun.clashscore`

Data Quality Validation

Check	What It Validates	Schema Support
Resolution	Claimed vs actual resolution	`WorkflowRun.resolution_high`
Completeness	Data completeness	`WorkflowRun.completeness_percent`
R-factors	Rwork/Rfree gap	`WorkflowRun.rwork`, `rfree`
B-factors	Temperature factor distribution	`WorkflowRun.wilson_b_factor`

Metadata Validation

Check	What It Validates	Schema Support
Sequence match	Coordinates match deposited sequence	`Sample.molecular_composition`
Ligand identity	Chemical component dictionary match	`Sample` ligand fields
Source organism	Taxonomy ID validity	`Sample.ncbi_taxid`, `organism`
Expression system	Host organism consistency	`Sample.expression_system`

Common Deposition Issues

These are frequent problems researchers encounter during deposition, and how lambda-ber-schema helps prevent them:

1. Missing or Inconsistent Metadata

Problem: Data collection parameters don't match between coordinate file and deposition form.

Schema Solution: Single source of truth in ExperimentRun:

ExperimentRun:
  wavelength: 0.9792        # Captured at beamline
  temperature_k: 100        # Recorded automatically
  detector_distance: 250.0  # From beamline metadata
  beamline: "FMX"           # Facility identifier

2. Lost Sample Provenance

Problem: Can't remember expression system or crystallization conditions months after data collection.

Schema Solution: Captured at preparation time in Sample and CrystallizationConditions:

Sample:
  protein_name: "Lysozyme"
  organism: NCBITaxon:9031  # Gallus gallus
  expression_system: "E. coli BL21(DE3)"

CrystallizationConditions:
  method: "vapor_diffusion_hanging_drop"
  precipitant_type: "NaCl"
  precipitant_concentration: "1.0 M"
  ph: 4.5

3. Processing Statistics Mismatch

Problem: Reported statistics come from different processing runs.

Schema Solution: WorkflowRun links statistics to specific processing:

WorkflowRun:
  workflow_code: "lysozyme-processing-v2"
  software_name: "XDS"
  software_version: "March 2024"
  resolution_high: 1.64
  rmerge: 0.082
  completeness_percent: 99.8
  # All stats from same processing run

4. Incorrect Phasing Attribution

Problem: Phasing method or search model not properly documented.

Schema Solution: Explicit phasing tracking:

WorkflowRun:
  phasing_method: sad  # PhasingMethodEnum
  # or for molecular replacement:
  phasing_method: molecular_replacement
  search_model_pdb_id: "1LYZ"

Example: Schema to OneDep Export

Conceptual example of generating OneDep-ready metadata from lambda-ber-schema:

def generate_onedep_metadata(study: Study) -> dict:
    """Extract OneDep form data from lambda-ber-schema Study."""

    sample = study.samples[0]
    experiment = study.experiment_runs[0]
    workflow = study.workflow_runs[0]
    instrument = get_instrument(experiment.instrument_id)

    return {
        # TITLE
        "structure_title": study.title,

        # SOURCE
        "source_organism": sample.organism,
        "source_taxid": sample.ncbi_taxid,
        "expression_system": sample.expression_system,

        # COMPND
        "molecule_name": sample.protein_name,

        # REMARK 200 - Data Collection
        "wavelength": experiment.wavelength,
        "temperature": experiment.temperature_k,
        "detector": instrument.detector_model,
        "beamline": experiment.beamline,
        "synchrotron": instrument.facility_id,

        # CRYST1
        "space_group": workflow.space_group,
        "unit_cell": {
            "a": workflow.unit_cell_a,
            "b": workflow.unit_cell_b,
            "c": workflow.unit_cell_c,
            "alpha": workflow.unit_cell_alpha,
            "beta": workflow.unit_cell_beta,
            "gamma": workflow.unit_cell_gamma,
        },

        # Data quality
        "resolution": workflow.resolution_high,
        "rmerge": workflow.rmerge,
        "completeness": workflow.completeness_percent,

        # Refinement
        "rwork": workflow.rwork,
        "rfree": workflow.rfree,

        # Phasing
        "phasing_method": workflow.phasing_method,
        "search_model": workflow.search_model_pdb_id,
    }

Workflow Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                        STRUCTURE DETERMINATION WORKFLOW                      │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   SAMPLE    │───▶│   DATA      │───▶│ PROCESSING  │───▶│ REFINEMENT  │
│ PREPARATION │    │ COLLECTION  │    │             │    │             │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
      │                  │                  │                  │
      ▼                  ▼                  ▼                  ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Sample    │    │ Experiment  │    │ WorkflowRun │    │ WorkflowRun │
│   + Cryst   │    │    Run      │    │ (indexing)  │    │(refinement) │
│ Conditions  │    │             │    │             │    │             │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
      │                  │                  │                  │
      │                  │                  │                  │
      └──────────────────┴──────────────────┴──────────────────┘
                                   │
                                   ▼
                    ┌──────────────────────────────┐
                    │    lambda-ber-schema         │
                    │    (unified metadata)        │
                    └──────────────────────────────┘
                                   │
                    ┌──────────────┴──────────────┐
                    │                             │
                    ▼                             ▼
         ┌─────────────────┐           ┌─────────────────┐
         │  OneDep Forms   │           │  Local Archive  │
         │  (deposition)   │           │  (provenance)   │
         └─────────────────┘           └─────────────────┘
                    │
                    ▼
         ┌─────────────────┐
         │      PDB        │
         │   (public)      │
         └─────────────────┘

Standard	Scope	Relationship to Schema
mmCIF/PDBx	Atomic coordinates, structure metadata	Schema maps to mmCIF categories via `exact_mappings`
IHMCIF	Integrative/hybrid methods	Future alignment for multi-technique studies
SIFTS	Sequence-structure mapping	`Sample.molecular_composition` alignment
UniProt	Protein sequences	`Sample` protein identifiers
NCBI Taxonomy	Organism classification	`Sample.organism`, `ncbi_taxid`

References

wwPDB OneDep - Unified deposition portal
mmCIF Dictionary - Official mmCIF/PDBx dictionary
PDB File Format - Legacy format documentation
PDB Validation - Standalone validation server
wwPDB Validation Reports - Understanding validation output
PDB Data Harvesting - Automatic metadata extraction