.. Copyright 2026 Entalpic

Getting Started
===============

Atompack is a Python package and Rust-backed file format for atomistic datasets. It stores whole
molecules together with builtin fields such as forces, energies, charges, velocities, stress,
PBC, and arbitrary custom properties for additional properties. The main workflow is:

- write molecules or stacked numpy batches to an append-only ``.atp`` file
- or use the Hugging Face Hub integration to download and open remote shard layouts
- reopen the file in read-only mmap mode for random access
- convert structures to or from ASE when needed
- publish a single file or shard directory through the Hugging Face Hub when distribution matters

Installation
------------

Install the Python package from Pypi: 

.. code-block:: bash

   pip install atompack-db

or from the source repository:

.. code-block:: bash

   uv pip install "git+https://github.com/LeMaterial/atompack.git@main#subdirectory=atompack-py"

Quickstart
----------

Create a molecule, attach properties, write it to disk, and read it back:

.. code-block:: python

   import atompack
   import numpy as np

   positions = np.array([[0.0, 0.0, 0.0], [1.0, 0.0, 0.0]], dtype=np.float32)
   atomic_numbers = np.array([6, 8], dtype=np.uint8)
   mol = atompack.Molecule.from_arrays(positions, atomic_numbers)

   mol.energy = -123.456
   mol.forces = np.array([[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]], dtype=np.float32)

   db = atompack.Database("data.atp", overwrite=True)
   db.add_molecule(mol)
   db.flush()

   db = atompack.Database.open("data.atp")
   mol = db[0]
   print(mol.energy)

Writable vs read-only opens:

- ``atompack.Database(path, ...)`` creates a new file
- ``atompack.Database.open(path)`` opens read-only with mmap by default
- ``atompack.Database.open(path, mmap=False)`` reopens the file for appends

Batch Writing, Simple Reading
-----------------------------

For ingestion, the batched write path is the most efficient. For reads, a simple indexed loop is
often already fast enough:

.. code-block:: python

   import atompack
   import numpy as np

   positions = np.random.rand(32, 64, 3).astype(np.float32)
   atomic_numbers = np.full((32, 64), 6, dtype=np.uint8)

   db = atompack.Database("batch.atp", overwrite=True)
   db.add_arrays_batch(positions, atomic_numbers)
   db.flush()

   db = atompack.Database.open("batch.atp")
   for i in range(4):
       mol = db[i]
       print(i, len(mol), mol.positions.shape)

Read-Only Mode (Memory-Mapped Index)
------------------------------------

Read-only mmap mode is the default for ``Database.open(...)`` and is the right choice for
read-mostly datasets:

.. code-block:: python

   db = atompack.Database.open("data.atp")  # mmap=True by default
   mol = db[0]

On Linux you can also prefault mapped pages:

.. code-block:: python

   db = atompack.Database.open("data.atp", mmap=True, populate=True)

ASE Integration
---------------

If you use ASE, you can convert individual structures with ``from_ase(...)`` and write many
structures efficiently with ``add_ase_batch(...)``:

.. code-block:: python

   import atompack
   from ase import Atoms

   ase_atoms = Atoms("H2O", positions=[[0, 0, 0], [1, 0, 0], [0, 1, 0]])
   mol = atompack.from_ase(ase_atoms)
   restored = mol.to_ase()

   structures = [
       Atoms("H2O", positions=[[0, 0, 0], [1, 0, 0], [0, 1, 0]]),
       Atoms("CO2", positions=[[0, 0, 0], [1.16, 0, 0], [-1.16, 0, 0]]),
   ]

   db = atompack.Database("ase_data.atp", overwrite=True)
   atompack.add_ase_batch(db, structures, batch_size=256)
   db.flush()

``add_ase_batch(...)`` is the preferred path when you already have an iterator or list of
``ase.Atoms`` objects and want to ingest them directly into a database.

Custom ASE values are copied as molecule properties during ingestion. ``atoms.info``,
custom ``atoms.arrays``, and non-builtin calculator results support ``None``, strings,
numeric scalars, and numeric arrays with dtype ``float32``, ``float64``, ``int32``, or
``int64``. Higher-rank arrays are stored as tensor properties; Atompack does not infer
atom-property scope from ASE array shape.

When Atompack Is A Good Fit
---------------------------

- dataset creation pipelines that append many molecules and publish immutable snapshots
- training workloads that repeatedly sample whole molecules at random
- pipelines that want both a Python API and a low-level Rust storage engine
- projects that need ASE conversion or Hub distribution without introducing a full database service

Tradeoffs
---------

- Append-only: updates and deletes require rewriting the file, but thanks to the efficient storage format, this is very fast to do.
- Read/write mode and read-only mmap mode are separate on purpose.
- The storage unit is a whole molecule; Atompack is not a query engine or column store.
- Compression is optional, but it is not the main abstraction. The main abstraction is a durable
  molecule record with direct indexing.

Next Steps
----------

- Python API reference: :doc:`autoapi/index`
- Hub upload/download helpers: :doc:`huggingface`
- Storage format and internals: :doc:`architecture`
- Benchmarks and reproducibility: :doc:`performance`
- Rust crate API docs: :doc:`rust-api`