Skip to content

Improve Sampler Parameter Documentation #178

@selmanozleyen

Description

@selmanozleyen

disclaimer: prettified with ai

The sampler documentation should be cleaned up so it is easier to read and better aligned with the public API. In ChunkSampler, the parameter section should be reordered to follow the constructor flow more naturally, and several descriptions should be rewritten to better explain chunking, batching, masking, and RNG behavior. In DistributedRandomSampler, long parameter descriptions should be wrapped and tightened so the generated docs are easier to scan.

Proposed Changes

  • Reorder the ChunkSampler parameter docs to match the way users read and configure the sampler.
  • Clarify chunk_size, preload_nchunks, and batch_size, especially the relationship between them.
  • Rewrite shuffle, drop_last, mask, and rng descriptions to be more explicit and user-facing.
  • Reformat long DistributedRandomSampler parameter descriptions for readability.
  • Keep this as a documentation-only change with no functional behavior change.

Focused Diff

diff --git a/src/annbatch/samplers/_chunk_sampler.py b/src/annbatch/samplers/_chunk_sampler.py
@@
-    batch_size
-        Number of observations per batch.
     chunk_size
-        Size of each chunk i.e. the range of each chunk yielded.
-    mask
-        A slice defining the observation range to sample from (start:stop).
-    shuffle
-        Whether to shuffle chunk and index order.
+        Number of contiguous observations per on-disk chunk.
     preload_nchunks
-        Number of chunks to load per iteration.
-    drop_last
-        Whether to drop the last incomplete batch.
-    rng
-        Random number generator for shuffling. Note that :func:`torch.manual_seed`
-        has no effect on reproducibility here; pass a seeded
-        :class:`numpy.random.Generator` to control randomness.
+        Number of chunks to group into each I/O request.
+        ``chunk_size * preload_nchunks`` must be divisible by
+        ``batch_size``.
+    batch_size
+        Number of observations per batch. Must not exceed
+        ``chunk_size * preload_nchunks``.
@@
+    shuffle
+        If ``True``, shuffle chunk order within each epoch.
+    drop_last
+        If ``True``, drop the final batch when it contains fewer than
+        ``batch_size`` observations.
+    mask
+        A ``slice`` restricting sampling to a sub-range of observations.
+        For example, ``slice(100, 500)`` limits sampling to observations
+        100 through 499.
+    rng
+        A :class:`numpy.random.Generator` used for shuffling and
+        replacement draws. When ``None``, a new default generator is
+        created.

diff --git a/src/annbatch/samplers/_distributed_random_sampler.py b/src/annbatch/samplers/_distributed_random_sampler.py
@@
-        Either a string naming a distributed backend (``"torch"`` or ``"jax"``),
-        or a callable that returns ``(rank, world_size)``.
+        Either a string naming a distributed backend (``"torch"`` or
+        ``"jax"``), or a callable that returns ``(rank, world_size)``.
@@
-        If *True*, round each rank's observation count down to a multiple of ``batch_size`` so that all workers (ranks) yield the same number of batches.
-        Set to *False* to use the raw ``n_obs // world_size`` split, which may result in an uneven number of batches per worker.
+        If *True*, round each rank's observation count down to a
+        multiple of ``batch_size`` so that all workers (ranks) yield
+        the same number of batches.
+        Set to *False* to use the raw ``n_obs // world_size`` split,
+        which may result in an uneven number of batches per worker.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions