.. Copyright 2026 Entalpic Architecture ============ This page describes the current Atompack design as it exists in this repository. The short version is that Atompack is an append-only molecule store with a Python API and a Rust storage engine. It is built around a simple unit of storage, the molecule, and around predictable read/write modes for dataset pipelines. System View ----------- Atompack sits between dataset producers and dataset consumers. The Python layer is the ergonomic surface, while the Rust layer owns the file format, indexing, and read/write paths. .. code-block:: text ASE / numpy / Python producers | v +---------------------+ | atompack Python | | - Molecule | | - Database | | - add_ase_batch | | - hub helpers | +---------------------+ | v +---------------------+ | Rust core | | - AtomDatabase | | - SOA record build | | - trailing index | | - mmap read mode | +---------------------+ | v +---------------------+ | .atp file / shards | +---------------------+ | v training loops / evaluation / Hub distribution Repository Layout ----------------- - ``atompack/``: Rust core crate with the storage engine, file format, and core data model - ``atompack-py/``: PyO3 bindings plus the Python package - ``docs/``: Sphinx documentation - ``scripts/``: helper scripts such as stub generation Core Data Model --------------- The main domain type is ``Molecule``. A molecule stores: - ``positions`` and ``atomic_numbers`` - builtin optional fields such as ``energy``, ``forces``, ``charges``, ``velocities``, ``cell``, ``stress``, and ``pbc`` - custom per-atom and per-molecule properties ``Atom`` exists as a lightweight convenience type, but the stored representation is already structure-of-arrays oriented. In practice, Atompack is optimized for moving full molecule records between disk, Rust, numpy, and ASE rather than for manipulating atom-by-atom objects in storage. Custom Properties ----------------- Custom properties are dataset-specific values keyed by name. They are separate from builtin fields: ``energy``, ``forces``, ``charges``, ``velocities``, ``cell``, ``stress``, ``pbc``, ``name``, ``positions``, and ``atomic_numbers`` keep their dedicated storage and API paths. Each custom property key has one owner scope: - molecule properties store one value for the whole molecule - atom properties store one value per atom The same custom key cannot exist in both scopes on one molecule, so property reads do not need a scope argument. New custom keys default to molecule scope; new atom properties must be written explicitly as atom properties. Overwriting an existing atom property keeps atom scope and validates that the new value still has one leading entry per atom. Custom values can be scalars, strings, ``None``, numeric arrays, or tensor-shaped numeric arrays. For tensor-shaped values, Atompack preserves the dtype and the shape of each stored value. Tensor shape is value-level metadata, not a global schema constraint: the same key may have shape ``(128,)`` on one molecule and ``(4, 32)`` on another. Atom-scoped tensor values must still have ``n_atoms`` as their first dimension; trailing dimensions are arbitrary. This flexibility applies to per-molecule storage and retrieval. APIs that build one concatenated array are necessarily stricter: - ``Database.add_arrays_batch(...)`` accepts tensor custom properties as stacked ndarrays, not lists or tuples of differently shaped arrays. Existing ``list[str]`` molecule properties remain valid for batched string columns. - ``Database.get_molecules_flat(...)`` can concatenate tensor properties only when the selected records have compatible shapes for that key. If shapes differ, the dataset is still valid, but callers should retrieve molecule records with ``db[i]`` or ``db.get_molecules(...)`` instead of asking for a flat representation. ASE ingestion follows the same ownership rule. ``from_ase(...)`` copies supported custom ndarray values as molecule properties and does not infer atom-property scope from ``atoms.arrays``, ``atoms.info``, calculator results, or ndarray shape. Component Overview ------------------ .. grid:: 2 :gutter: 2 .. grid-item-card:: Python API User-facing entry points such as ``Database(...)``, ``Database.open(...)``, ``Molecule.from_arrays(...)``, ``add_ase_batch(...)``, and ``atompack.hub``. .. grid-item-card:: Rust Storage Engine Owns the file format, crash-safe header handling, indexing, append paths, and mmap-backed read mode. .. grid-item-card:: SOA Records Molecules are stored as geometry plus builtin/custom property payloads in an array-oriented representation that matches numpy-heavy workloads. .. grid-item-card:: Distribution Layer Local files, shard directories, and Hugging Face dataset snapshots are all exposed through the same high-level reading model. Python API ---------- The Python package is intentionally small and centered around a few workflows: - ``atompack.Database(path, ...)`` creates a new file - ``atompack.Database.open(path, mmap=True, populate=False)`` opens an existing file - ``Molecule.from_arrays(...)`` builds a molecule directly from numpy arrays - ``Database.add_arrays_batch(...)`` writes stacked numpy batches without creating one Python molecule per record - ``Database.get_molecules_flat(indices)`` returns training-friendly stacked arrays already batched - ``atompack.from_ase(...)`` and ``Molecule.to_ase()`` integrations to ASE - ``atompack.hub`` uploads, downloads, and opens local or remote shard layouts through one reader interface allowing easy sharing through the Hugging Face Hub Two open modes matter: - writable mode: create a file or reopen with ``mmap=False`` when appending - read-only mmap mode: the default for ``Database.open(...)`` and the preferred mode for serving static datasets Storage Layout -------------- The on-disk format lives in ``atompack/src/storage/`` and currently uses: 1. two 4 KiB header slots 2. a data region containing molecule records 3. a trailing index written on ``flush()`` Each header slot stores the format version, generation number, index location, molecule count, record format, codec metadata, and a checksum. On open, Atompack reads both slots and chooses the newest valid one. This design gives Atompack its main operational properties: - appends stay simple because new records are written sequentially - ``flush()`` publishes a new index snapshot atomically enough for crash recovery - molecule lookup is O(1) through the trailing index .. code-block:: text +---------------------------------------------------------------+ | Header slot A (4 KiB) | | - magic + version | | - generation | | - index offset / length | | - molecule count | | - record / codec metadata | | - checksum | +---------------------------------------------------------------+ | Header slot B (4 KiB) | | - same fields, alternate commit target | +---------------------------------------------------------------+ | Data region | | - record 0: positions, atomic_numbers, builtin/custom fields | | - record 1: ... | | - ... | | - record N-1 | +---------------------------------------------------------------+ | Trailing index | | - count | | - per-record offset | | - compressed size | | - uncompressed size | | - atom count | +---------------------------------------------------------------+ At commit time, Atompack writes the index first and then updates the newer valid header slot. On open, it reads both header slots and chooses the highest valid generation. Record Shape ------------ At a conceptual level, one stored molecule looks like this: .. code-block:: text Molecule record | +-- positions: (n_atoms, 3) float32 +-- atomic_numbers: (n_atoms,) uint8 +-- builtin fields: | +-- energy | +-- forces | +-- charges | +-- velocities | +-- cell | +-- stress | +-- pbc | +-- name | +-- custom atom properties | +-- custom molecule properties Read Path --------- When a file is opened read-only with mmap: - Atompack validates the file header - loads the index in memory-mapped mode - optionally prefaults mapped pages on Linux when ``populate=True`` - fetches molecules by index without reopening or rescanning the file For Python users, this means ``db[i]``, ``db.get_molecules(...)``, and ``db.get_molecules_flat(...)`` are all built on direct indexed access to the underlying file. Write Path ---------- When a file is opened writable: - new molecules are appended to the end of the file - batch ingestion paths can serialize records from numpy arrays directly - ``flush()`` rewrites the trailing index and advances the committed header generation If the file contains an uncommitted tail after a crash or interrupted write, writable open will truncate back to the last committed state before continuing. Current Tradeoffs ----------------- - The storage unit is the whole molecule, not a partial field projection. - Writable mode and mmap-backed read mode are distinct operational modes. - Updates and deletes require rewriting the dataset. - The file format is explicit and simple, but it is specialized for atomistic ML datasets rather than for general-purpose tabular workloads. Reference Points ---------------- For public APIs, the generated docs are usually the best entry point: - :doc:`Python package API `: ``Database``, ``Molecule``, top-level helpers - :doc:`ASE helpers `: ``from_ase(...)``, ``to_ase(...)``, ``add_ase_batch(...)`` - :doc:`Hub helpers `: local and Hugging Face dataset access - :doc:`Rust API `: rustdoc for the core crate and bindings crate