Skip to content

Commit 4954a2f

Browse files
Return ragged data (#36)
This is a major breaking change for GVL. Users should view the "What's a gvl.Dataset?" page in the documentation for details, but major breaks include: - removed the `length` argument from gvl.write(). Regions/BED files are now used as-is. If you want uniform length regions centered on inputs/peaks as before, preprocess your BED file with `gvl.with_length`. - changed Dataset.output_length from a property to a dynamic setting with behavior describe in the "What's a gvl.Dataset?" page. - changed track output shape to have a track axis. - Datasets are now deterministic by default. As a result of these changes, GVL seamlessly supports ragged length output and also paves the way for on-the-fly splicing. Since many changes were made, I wouldn't be surprised if a few bugs crop up despite my best efforts -- please leave issues if so! * feat: option to return ragged data from gvl.Dataset * fix: disable shifts for ragged output. fix(wip): may need to skip some variants if they are outside ragged region. * feat(wip): ragged output * fix: incorrect mask from get_keep_mask_for_length (#37) * bump: version 0.8.0 → 0.8.1 * feat!: output_length is set dynamically. fix: hap reconstruction matches bcftools * feat!: change default for Dataset.deterministic from False to True. change track output from a list of arrays to having a track dimension i.e. from shape (b [p] l) to (b t [p] l). docs: add dataset.md, faq.md and overhaul geuvadis.ipynb to be simpler and reflect changes in API. --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
1 parent 67661d1 commit 4954a2f

File tree

174 files changed

+6805
-1911
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

174 files changed

+6805
-1911
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@ scripts/
1414
.custom_benchmarks/
1515
benchmarking/
1616
genvarloader*.tar.gz
17+
.nfs*
18+
commit_msg.txt
1719

1820
# Byte-compiled / optimized / DLL files
1921
__pycache__/

README.md

Lines changed: 20 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,35 @@
1+
<img src=docs/source/_static/gvl_logo.png width="200">
2+
13
[![PyPI version](https://badge.fury.io/py/genvarloader.svg)](https://pypi.org/project/genvarloader/)
24
[![Documentation Status](https://readthedocs.org/projects/genvarloader/badge/?version=latest)](https://genvarloader.readthedocs.io)
35
[![Downloads](https://static.pepy.tech/badge/genvarloader)](https://pepy.tech/project/genvarloader)
46
[![PyPI - Downloads](https://img.shields.io/pypi/dm/genvarloader)](https://img.shields.io/pypi/dm/genvarloader)
57
[![GitHub stars](https://badgen.net/github/stars/mcvickerlab/GenVarLoader)](https://GitHub.com/Naereen/mcvickerlab/GenVarLoader)
8+
[![bioRxiv](https://img.shields.io/badge/bioRxiv-2025.01.15.633240-b31b1b.svg)](https://www.biorxiv.org/content/10.1101/2025.01.15.633240)
69

7-
<img src=docs/source/_static/gvl_logo.png width="200">
8-
9-
[[Documentation](https://genvarloader.readthedocs.io/en/latest/)][[Preprint](https://www.biorxiv.org/content/10.1101/2025.01.15.633240v1)]
10+
## Features
1011

11-
GenVarLoader provides a fast, memory efficient data loader for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. [Dalla-Torre et al.](https://www.biorxiv.org/content/10.1101/2023.01.11.523679)) or train sequence to function models with genetic variation (e.g. [Celaj et al.](https://www.biorxiv.org/content/10.1101/2023.09.20.558508v1), [Drusinsky et al.](https://www.biorxiv.org/content/10.1101/2024.07.27.605449v1), [He et al.](https://www.biorxiv.org/content/10.1101/2024.10.15.618510v1), and [Rastogi et al.](https://www.biorxiv.org/content/10.1101/2024.09.23.614632v1)).
12+
GenVarLoader provides a fast, memory efficient data structure for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. [Dalla-Torre et al.](https://www.biorxiv.org/content/10.1101/2023.01.11.523679)) or train sequence to function models with genetic variation (e.g. [Celaj et al.](https://www.biorxiv.org/content/10.1101/2023.09.20.558508v1), [Drusinsky et al.](https://www.biorxiv.org/content/10.1101/2024.07.27.605449v1), [He et al.](https://www.biorxiv.org/content/10.1101/2024.10.15.618510v1), and [Rastogi et al.](https://www.biorxiv.org/content/10.1101/2024.09.23.614632v1)).
1213

13-
## Features
14-
- Avoids writing any sequences to disk
15-
- Works with datasets that are larger than RAM
16-
- Generates haplotypes up to 1,000 times faster than reading a FASTA file
17-
- Generates tracks up to 450 times faster than reading a BigWig
18-
- Supports indels and re-aligns tracks to haplotypes that have them
14+
- Avoid writing any sequences to disk (can save >2,000x storage vs. writing personalized genomes with bcftools consensus)
15+
- Generate haplotypes up to 1,000 times faster than reading a FASTA file
16+
- Generate tracks up to 450 times faster than reading a BigWig
17+
- **Supports indels** and re-aligns tracks to haplotypes that have them
1918
- Extensible to new file formats: drop a feature request! Currently supports VCF, PGEN, and BigWig
2019

20+
See our [preprint](https://www.biorxiv.org/content/10.1101/2025.01.15.633240) for benchmarking and implementation details.
21+
2122
## Installation
2223

2324
```bash
2425
pip install genvarloader
2526
```
2627

27-
A PyTorch dependency is not included since it may require [special instructions](https://pytorch.org/get-started/locally/).
28+
A PyTorch dependency is **not** included since it may require [special instructions](https://pytorch.org/get-started/locally/).
2829

2930
## Quick Start
3031

31-
### Write a `gvl.Dataset`
32+
### Write a [`gvl.Dataset`](https://genvarloader.readthedocs.io/en/latest/api.html#genvarloader.Dataset)
3233

3334
GenVarLoader has both a CLI and Python API for writing datasets. The Python API provides some extra flexibility, for example for a multi-task objective.
3435

@@ -53,15 +54,14 @@ gvl.write(
5354
)
5455
```
5556

56-
### Open a `gvl.Dataset` and get a PyTorch DataLoader
57+
### Open a [`gvl.Dataset`](https://genvarloader.readthedocs.io/en/latest/api.html#genvarloader.Dataset) and get a PyTorch DataLoader
5758

5859
```python
5960
import genvarloader as gvl
6061

6162
dataset = gvl.Dataset.open(path="cool_dataset.gvl", reference="hg38.fa")
6263
train_samples = ["David", "Aaron"]
63-
# use first 10 regions for training
64-
train_dataset = dataset.subset_to(regions=slice(10), samples=train_samples)
64+
train_dataset = dataset.subset_to(regions="train_regions.bed", samples=train_samples)
6565
train_dataloader = train_dataset.to_dataloader(batch_size=32, shuffle=True, num_workers=1)
6666

6767
# use it in your training loop
@@ -72,11 +72,9 @@ for haplotypes, tracks in train_dataloader:
7272
### Inspect specific instances
7373

7474
```python
75-
dataset[99] # 100-th instance of the raveled dataset
7675
dataset[0, 9] # first region, 10th sample
77-
dataset.isel(regions=0, samples=9)
78-
dataset[:10] # first 10 instances
79-
dataset[:10, :5] # first 10 regions and 5 samples
76+
dataset[:10, 4] # first 10 regions, 5th sample
77+
dataset[:10, :5] # first 10 regions and first 5 samples
8078
```
8179

8280
### Transform the data on-the-fly
@@ -87,57 +85,12 @@ from einops import rearrange
8785

8886
def transform(haplotypes, tracks):
8987
ohe = sp.DNA.ohe(haplotypes)
90-
ohe = rearrange(ohe, "batch length alphabet -> batch alphabet length")
88+
ohe = rearrange(ohe, "... length alphabet -> ... alphabet length")
9189
return ohe, tracks
9290

9391
transformed_dataset = dataset.with_settings(transform=transform)
9492
```
9593

96-
### Pre-computing transformed tracks
97-
98-
Suppose we want to return tracks that are the z-scored, log(CPM + 1) version of the original. Sometimes it is better to write this to disk to avoid having to recompute it during training or inference.
99-
100-
```python
101-
import numpy as np
102-
103-
# We'll assume we already have an array of total counts for each sample.
104-
# This usually can't be derived from a gvl.Dataset since it only has data for specific regions.
105-
total_counts = np.load('total_counts.npy') # shape: (samples) float32
106-
107-
# We'll compute the mean and std log(CPM + 1) using the training split
108-
means = np.empty((train_dataset.n_regions, train_dataset.region_length), np.float32)
109-
stds = np.empty_like(means)
110-
just_tracks = train_dataset.with_settings(return_sequences=False, jitter=0)
111-
for region in range(len(means)):
112-
cpm = np.log1p(just_tracks[region, :] / total_counts[:, None] * 1e6)
113-
means[region] = cpm.mean(0)
114-
stds[region] = cpm.std(0)
115-
116-
# Define our transformation
117-
def z_log_cpm(dataset_indices, region_indices, sample_indices, tracks: gvl.Ragged[np.float32]):
118-
# In the event that the dataset only has SNPs, the full length tracks will all be the same length.
119-
# So, we can reshape the ragged data into a regular array.
120-
_tracks = tracks.data.reshape(-1, dataset.region_length)
121-
122-
# Otherwise, we would have to leave `tracks`as a gvl.Ragged array to accommodate different lengths.
123-
# In that case, we could do the transformation with a Numba compiled function instead.
124-
125-
# original tracks -> log(CPM + 1) -> z-score
126-
_tracks = np.log1p(_tracks / total_counts[sample_indices, None] * 1e6)
127-
_tracks = (_tracks - means[region_indices]) / stds[region_indices]
128-
129-
return gvl.Ragged.from_offsets(_tracks.ravel(), tracks.shape, tracks.offsets)
130-
131-
# This can take about as long as writing the original tracks or longer, depending on the transformation.
132-
dataset_with_zlogcpm = dataset.write_transformed_track("z-log-cpm", "bigwig", transform=z_log_cpm)
133-
134-
# The dataset now has both tracks available, "bigwig" and "z-log-cpm", and we can choose to return either one or both.
135-
haps_and_zlogcpm = dataset_with_zlogcpm.with_settings(return_tracks="z-log-cpm")
136-
137-
# If we re-opened the dataset after running this then we could write...
138-
dataset = gvl.Dataset.open("cool_dataset.gvl", "hg38.fa", return_tracks="z-log-cpm")
139-
```
140-
14194
## Performance tips
142-
- GenVarLoader uses multithreading extensively, so it's best to use 0 or 1 workers with your PyTorch `DataLoader`.
143-
- A GenVarLoader `Dataset` is most efficient when given batches of indices, rather than one at a time. PyTorch `DataLoader` by default uses one index at a time, so if you want to use a ***custom*** PyTorch `Sampler` you should wrap it with a PyTorch `BatchSampler` before passing it to `Dataset.to_dataloader()`.
95+
- GenVarLoader uses multithreading extensively, so it's best to use `0` or `1` workers with your [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader).
96+
- A GenVarLoader [`Dataset`](https://genvarloader.readthedocs.io/en/latest/api.html#genvarloader.Dataset) is most efficient when given batches of indices, rather than one at a time. By default, [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)s use one index at a time, so if you want to use a ***custom*** [`Sampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler) you should wrap it with a [`BatchSampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.BatchSampler) before passing it to [`Dataset.to_dataloader()`](https://genvarloader.readthedocs.io/en/latest/api.html#genvarloader.Dataset.to_dataloader).

docs/requirements.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,4 +27,5 @@ nbsphinx
2727
torch
2828
torchvision
2929
torchaudio
30-
ipython
30+
ipython
31+
awkward

0 commit comments

Comments
 (0)