LinkML Schema Development and Repository Architecture
Linked Data Modeling Language - A framework for defining data models:
Think of it as "schema definition as code"
Every LinkML schema is itself an instance of the LinkML metamodel:
classes: Sample: description: A biological sample attributes: sample_code: range: string required: true sample_type: range: SampleTypeEnum required: true
Metamodel concepts: classes, slots, types, enums
classes
slots
types
enums
Located at: src/lambda_ber_schema/schema/lambda-ber-schema.yaml
src/lambda_ber_schema/schema/lambda-ber-schema.yaml
id: https://w3id.org/lambda-ber/lambda-ber-schema name: lambda-ber-schema description: Schema for structural biology imaging data prefixes: linkml: https://w3id.org/linkml/ schema: http://schema.org/ imports: - linkml:types classes: # Core classes defined here...
Container Classes:
Dataset
Study
Entity Classes:
Sample
SamplePreparation
Instrument
ExperimentRun
WorkflowRun
DataFile
Image
Supporting Classes:
MolecularComposition
BufferComposition
StorageConditions
ExperimentalConditions
Slots are reusable attributes defined separately:
slots: sample_code: description: Unique identifier for the sample range: string required: true identifier: true temperature: description: Temperature in Kelvin range: float unit: ucum_code: K
Slots can be inherited and reused across classes.
enums: TechniqueEnum: permissible_values: cryoem: description: Cryo-electron microscopy xray: description: X-ray crystallography saxs: description: Small angle X-ray scattering waxs: description: Wide angle X-ray scattering sans: description: Small angle neutron scattering
Provides controlled vocabularies throughout the schema.
classes: Study: attributes: samples: range: Sample multivalued: true inlined: true inlined_as_list: true
multivalued: true
inlined: true
inlined_as_list: true
classes: Sample: attributes: sample_code: identifier: true # Primary key ExperimentRun: attributes: sample_id: range: Sample # Foreign key reference required: true
Supports both embedding and referencing.
Problem: Different systems use different datetime formats
Solution: Use string type with ISO 8601 recommendation
string
slots: collection_date: range: string # Not 'date' type description: Date in ISO 8601 format (YYYY-MM-DD)
More forgiving for real-world data integration.
Problem: YAML scientific notation (2.0e12) causes JSON Schema issues
Solution: Document that numbers should be written out fully
# Avoid: particle_count: 2.0e12 # Use instead: particle_count: 2000000000000
Core compilation command:
uv run gen-project \ --config-file config.yaml \ src/lambda_ber_schema/schema/lambda-ber-schema.yaml \ -d assets
Generates all downstream artifacts from the schema.
From a single YAML schema:
assets/lambda-ber-schema.py
assets/jsonschema/
assets/docs/
assets/graphql/
assets/owl/
assets/shacl/
assets/jsonld/
assets/sqlschema/
config.yaml controls what gets generated:
config.yaml
generators: - python - json_schema - owl - graphql - shacl - markdown_docs generator_args: python: package: lambda_ber_schema markdown_docs: directory: docs
Generated from schema:
from dataclasses import dataclass from typing import Optional, List @dataclass class Sample: sample_code: str sample_type: str sample_name: Optional[str] = None molecular_composition: Optional[MolecularComposition] = None # ... more fields
Ready to use in Python code!
from lambda_ber_schema import Sample, MolecularComposition # Create a sample sample = Sample( sample_code="sample-001", sample_type="protein", sample_name="My Protein", molecular_composition=MolecularComposition( proteins=["UniProt:P12345"] ) ) # Serialize to dict sample_dict = asdict(sample) # Write to YAML/JSON import yaml with open('sample.yaml', 'w') as f: yaml.dump(sample_dict, f)
Generated JSON Schema can validate data:
# Using linkml-validate uv run linkml-validate \ -s src/lambda_ber_schema/schema/lambda-ber-schema.yaml \ tests/data/valid/Sample-protein.yaml # Or using JSON Schema directly jsonschema -i data.json assets/jsonschema/lambda-ber-schema.json
Schema docs generated at assets/docs/:
Formatted as markdown and/or HTML.
lambda-ber-schema/ ├── src/lambda_ber_schema/ │ ├── schema/ │ │ └── lambda-ber-schema.yaml # Source schema │ └── datamodel/ ├── assets/ # Generated outputs ├── tests/data/valid/ # Example data files ├── docs/ # Documentation │ ├── slides/ # Presentation slides │ ├── spec.md # Specification │ └── background/ # Research docs ├── config.yaml # Generation config ├── pyproject.toml # Python project config └── justfile / project.justfile # Build automation
Primary source of truth: src/lambda_ber_schema/schema/lambda-ber-schema.yaml
All other files are generated from this schema.
Never edit generated files directly!
Edit the schema, then regenerate.
Auto-generated, do not edit:
assets/ ├── lambda-ber-schema.py # Python dataclasses ├── jsonschema/ │ └── lambda-ber-schema.json ├── docs/ # Generated documentation ├── graphql/ │ └── lambda-ber-schema.graphql ├── owl/ │ └── lambda-ber-schema.owl.ttl └── ... more formats
tests/data/valid/ ├── Sample-protein.yaml ├── Sample-hetBGL.yaml ├── ExperimentRun-cryoet.yaml ├── WorkflowRun-3dclass.yaml ├── Dataset-berkeley-tfiid.yaml └── ... more examples
Each file is:
docs/ ├── spec.md # Complete specification ├── background/ # Research and context │ ├── nexus.md │ ├── mmcif.md │ ├── empiar.md │ └── ... more ├── slides/ # Presentations │ ├── overview.md │ └── technical-overview.md └── examples/ # Analyzed examples
# Clone the repository git clone https://github.com/lambda-ber/lambda-ber-schema.git cd lambda-ber-schema # Install dependencies using uv just install # or directly: uv sync --group dev # Verify installation uv run linkml-lint --version
Requires: Python 3.9+, uv, just (optional but recommended)
just gen-project
tests/data/valid/
just test-examples
just test
# Install dependencies just install # Generate all artifacts from schema just gen-project # Validate all example files just test-examples # Run full test suite just test # Generate documentation just gendoc # Serve docs locally just serve
Task: Add a new field to Sample class
classes: Sample: attributes: # Existing fields... ph_value: # NEW FIELD range: float description: pH of the sample buffer minimum_value: 0 maximum_value: 14
Then: just gen-project to regenerate all artifacts
classes: CryoGridPreparation: # NEW CLASS is_a: SamplePreparation description: Details of cryo-EM grid preparation attributes: grid_type: range: GridTypeEnum blot_time: range: float unit: ucum_code: s blot_force: range: integer
Inheritance via is_a reuses parent class attributes.
is_a
enums: GridTypeEnum: # NEW ENUM permissible_values: quantifoil_r1.2_1.3: description: Quantifoil R1.2/1.3 holey carbon grids c_flat_1.2_1.3: description: C-flat 1.2/1.3 holey carbon grids ultrathin_carbon: description: Ultrathin continuous carbon grids graphene_oxide: description: Graphene oxide grids
LinkML supports sophisticated validation:
classes: Sample: attributes: concentration_value: range: float minimum_value: 0 # Must be non-negative rules: - preconditions: slot_conditions: concentration_value: required: true postconditions: slot_conditions: concentration_unit: required: true description: If concentration_value is provided, unit is required
Start with minimal required fields:
# tests/data/valid/Sample-minimal.yaml sample_code: "sample-min-001" sample_type: "protein"
Then expand with optional fields for richer examples:
sample_code: "sample-full-001" sample_type: "protein_complex" sample_name: "TFIID Complex" molecular_composition: proteins: - "UniProt:P12345" - "UniProt:P67890" buffer_composition: components: - name: "Tris-HCl" concentration_value: 50 concentration_unit: "mM"
git checkout -b add-feature-x
Schema Enhancements:
Examples:
Documentation:
description
Classes: PascalCase (e.g., SamplePreparation)
PascalCase
Slots: snake_case (e.g., sample_code)
snake_case
sample_code
Enums: PascalCaseEnum (e.g., TechniqueEnum)
PascalCaseEnum
TechniqueEnum
Enum values: snake_case (e.g., protein_complex)
protein_complex
Files: kebab-case (e.g., lambda-ber-schema.yaml)
kebab-case
lambda-ber-schema.yaml
Pull requests are reviewed for:
Schema versions follow semver (MAJOR.MINOR.PATCH):
Incremented in schema version field:
version
version: "1.2.0"
Multiple layers of testing:
All run via: just test
Validate the schema against LinkML metamodel:
uv run linkml-lint \ src/lambda_ber_schema/schema/lambda-ber-schema.yaml
Checks for:
linkml-run-examples validates and converts example data:
linkml-run-examples
uv run linkml-run-examples \ -t yaml -t json -t ttl \ -s src/lambda_ber_schema/schema/lambda-ber-schema.yaml \ -e tests/data/valid \ -d examples
For each example:
examples/
Validate individual files:
uv run linkml-validate \ -s src/lambda_ber_schema/schema/lambda-ber-schema.yaml \ tests/data/valid/Sample-protein.yaml
Useful for debugging specific instances.
pytest tests in tests/ directory:
tests/
import pytest from lambda_ber_schema import Sample def test_sample_creation(): """Test creating a Sample instance.""" sample = Sample( sample_code="test-001", sample_type="protein" ) assert sample.sample_code == "test-001" assert sample.sample_type == "protein"
Run with: just pytest
just pytest
mypy ensures type safety in Python code:
just mypy # or directly: uv run mypy src tests
Catches type errors before runtime.
GitHub Actions runs on every PR:
# .github/workflows/test.yml - name: Run tests run: | uv sync --group dev just gen-project just test-examples just test
Ensures all contributions pass tests.
classes: MolecularComposition: attributes: proteins: range: uriorcurie description: UniProt identifiers pattern: "^UniProt:\\w+$" multivalued: true
Enables semantic integration with external resources.
Using UCUM (Unified Code for Units of Measure):
slots: temperature: range: float unit: ucum_code: K wavelength: range: float unit: ucum_code: nm
classes: Timestamped: # Mixin mixin: true attributes: created_at: range: string updated_at: range: string Sample: mixins: - Timestamped # Inherits timestamp attributes
Define once, use many times:
slots: sample_id: # Slot definition range: Sample classes: ExperimentRun: slot_usage: sample_id: # Customize for this class required: true description: Sample used in this experiment
classes: ExperimentRun: rules: - preconditions: slot_conditions: technique: equals_string: "cryoem" postconditions: slot_conditions: instrument_id: range: CryoEMInstrument
Technique-specific validation.
For large datasets:
# Generate RDF from YAML instance uv run linkml-convert \ -s schema.yaml \ -t rdf \ input.yaml -o output.ttl # Query with SPARQL # Load into triple store (e.g., Apache Jena)
Bridge to semantic web ecosystem.
Adding fields: Always backwards compatible
Renaming fields: Use deprecated and aliases
deprecated
aliases
slots: sample_id: aliases: - sample_identifier # Old name deprecated: "Use sample_id instead"
Removing fields: Deprecate first, remove in major version
Extend LinkML with custom generators:
from linkml.generators import Generator class CustomGenerator(Generator): def serialize(self): # Your custom output format pass
Or use Python scripts to post-process generated artifacts.
API Integration:
Database Integration:
Data Lake Integration:
Every class and slot should have:
classes: Sample: description: > A biological sample that is the subject of study. This could be a purified protein, protein complex, nucleic acid, or other biological material. comments: - Samples should have unique identifiers within a study see_also: - https://example.org/sample-preparation-guide
Upcoming capabilities:
LinkML Resources:
This Project:
Schema Questions:
LinkML Questions:
For Users:
For Contributors:
lambda-ber-schema: Structured metadata for structural biology
Questions? Open a GitHub issue or discussion
Want to contribute? Fork and submit a PR
Let's build better data standards together!