Getting Started¶
Atompack is a Python package and Rust-backed file format for atomistic datasets. It stores whole molecules together with builtin fields such as forces, energies, charges, velocities, stress, PBC, and arbitrary custom properties for additional properties. The main workflow is:
write molecules or stacked numpy batches to an append-only
.atpfileor use the Hugging Face Hub integration to download and open remote shard layouts
reopen the file in read-only mmap mode for random access
convert structures to or from ASE when needed
publish a single file or shard directory through the Hugging Face Hub when distribution matters
Installation¶
Install the Python package from Pypi:
pip install atompack-db
or from the source repository:
uv pip install "git+https://github.com/LeMaterial/atompack.git@main#subdirectory=atompack-py"
Quickstart¶
Create a molecule, attach properties, write it to disk, and read it back:
import atompack
import numpy as np
positions = np.array([[0.0, 0.0, 0.0], [1.0, 0.0, 0.0]], dtype=np.float32)
atomic_numbers = np.array([6, 8], dtype=np.uint8)
mol = atompack.Molecule.from_arrays(positions, atomic_numbers)
mol.energy = -123.456
mol.forces = np.array([[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]], dtype=np.float32)
db = atompack.Database("data.atp", overwrite=True)
db.add_molecule(mol)
db.flush()
db = atompack.Database.open("data.atp")
mol = db[0]
print(mol.energy)
Writable vs read-only opens:
atompack.Database(path, ...)creates a new fileatompack.Database.open(path)opens read-only with mmap by defaultatompack.Database.open(path, mmap=False)reopens the file for appends
Batch Writing, Simple Reading¶
For ingestion, the batched write path is the most efficient. For reads, a simple indexed loop is often already fast enough:
import atompack
import numpy as np
positions = np.random.rand(32, 64, 3).astype(np.float32)
atomic_numbers = np.full((32, 64), 6, dtype=np.uint8)
db = atompack.Database("batch.atp", overwrite=True)
db.add_arrays_batch(positions, atomic_numbers)
db.flush()
db = atompack.Database.open("batch.atp")
for i in range(4):
mol = db[i]
print(i, len(mol), mol.positions.shape)
Read-Only Mode (Memory-Mapped Index)¶
Read-only mmap mode is the default for Database.open(...) and is the right choice for
read-mostly datasets:
db = atompack.Database.open("data.atp") # mmap=True by default
mol = db[0]
On Linux you can also prefault mapped pages:
db = atompack.Database.open("data.atp", mmap=True, populate=True)
ASE Integration¶
If you use ASE, you can convert individual structures with from_ase(...) and write many
structures efficiently with add_ase_batch(...):
import atompack
from ase import Atoms
ase_atoms = Atoms("H2O", positions=[[0, 0, 0], [1, 0, 0], [0, 1, 0]])
mol = atompack.from_ase(ase_atoms)
restored = mol.to_ase()
structures = [
Atoms("H2O", positions=[[0, 0, 0], [1, 0, 0], [0, 1, 0]]),
Atoms("CO2", positions=[[0, 0, 0], [1.16, 0, 0], [-1.16, 0, 0]]),
]
db = atompack.Database("ase_data.atp", overwrite=True)
atompack.add_ase_batch(db, structures, batch_size=256)
db.flush()
add_ase_batch(...) is the preferred path when you already have an iterator or list of
ase.Atoms objects and want to ingest them directly into a database.
Custom ASE values are copied as molecule properties during ingestion. atoms.info,
custom atoms.arrays, and non-builtin calculator results support None, strings,
numeric scalars, and numeric arrays with dtype float32, float64, int32, or
int64. Higher-rank arrays are stored as tensor properties; Atompack does not infer
atom-property scope from ASE array shape.
When Atompack Is A Good Fit¶
dataset creation pipelines that append many molecules and publish immutable snapshots
training workloads that repeatedly sample whole molecules at random
pipelines that want both a Python API and a low-level Rust storage engine
projects that need ASE conversion or Hub distribution without introducing a full database service
Tradeoffs¶
Append-only: updates and deletes require rewriting the file, but thanks to the efficient storage format, this is very fast to do.
Read/write mode and read-only mmap mode are separate on purpose.
The storage unit is a whole molecule; Atompack is not a query engine or column store.
Compression is optional, but it is not the main abstraction. The main abstraction is a durable molecule record with direct indexing.
Next Steps¶
Python API reference: API Reference
Hub upload/download helpers: Hugging Face
Storage format and internals: Architecture
Benchmarks and reproducibility: Performance
Rust crate API docs: Rust API (Internal)