You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a major breaking change for GVL. Users should view the "What's a gvl.Dataset?" page in the documentation for details, but major breaks include:
- removed the `length` argument from gvl.write(). Regions/BED files are now used as-is. If you want uniform length regions centered on inputs/peaks as before, preprocess your BED file with `gvl.with_length`.
- changed Dataset.output_length from a property to a dynamic setting with behavior describe in the "What's a gvl.Dataset?" page.
- changed track output shape to have a track axis.
- Datasets are now deterministic by default.
As a result of these changes, GVL seamlessly supports ragged length output and also paves the way for on-the-fly splicing. Since many changes were made, I wouldn't be surprised if a few bugs crop up despite my best efforts -- please leave issues if so!
* feat: option to return ragged data from gvl.Dataset
* fix: disable shifts for ragged output. fix(wip): may need to skip some variants if they are outside ragged region.
* feat(wip): ragged output
* fix: incorrect mask from get_keep_mask_for_length (#37)
* bump: version 0.8.0 → 0.8.1
* feat!: output_length is set dynamically. fix: hap reconstruction matches bcftools
* feat!: change default for Dataset.deterministic from False to True. change track output from a list of arrays to having a track dimension i.e. from shape (b [p] l) to (b t [p] l). docs: add dataset.md, faq.md and overhaul geuvadis.ipynb to be simpler and reflect changes in API.
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
GenVarLoader provides a fast, memory efficient data loader for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. [Dalla-Torre et al.](https://www.biorxiv.org/content/10.1101/2023.01.11.523679)) or train sequence to function models with genetic variation (e.g. [Celaj et al.](https://www.biorxiv.org/content/10.1101/2023.09.20.558508v1), [Drusinsky et al.](https://www.biorxiv.org/content/10.1101/2024.07.27.605449v1), [He et al.](https://www.biorxiv.org/content/10.1101/2024.10.15.618510v1), and [Rastogi et al.](https://www.biorxiv.org/content/10.1101/2024.09.23.614632v1)).
12
+
GenVarLoader provides a fast, memory efficient data structure for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. [Dalla-Torre et al.](https://www.biorxiv.org/content/10.1101/2023.01.11.523679)) or train sequence to function models with genetic variation (e.g. [Celaj et al.](https://www.biorxiv.org/content/10.1101/2023.09.20.558508v1), [Drusinsky et al.](https://www.biorxiv.org/content/10.1101/2024.07.27.605449v1), [He et al.](https://www.biorxiv.org/content/10.1101/2024.10.15.618510v1), and [Rastogi et al.](https://www.biorxiv.org/content/10.1101/2024.09.23.614632v1)).
12
13
13
-
## Features
14
-
- Avoids writing any sequences to disk
15
-
- Works with datasets that are larger than RAM
16
-
- Generates haplotypes up to 1,000 times faster than reading a FASTA file
17
-
- Generates tracks up to 450 times faster than reading a BigWig
18
-
- Supports indels and re-aligns tracks to haplotypes that have them
14
+
- Avoid writing any sequences to disk (can save >2,000x storage vs. writing personalized genomes with bcftools consensus)
15
+
- Generate haplotypes up to 1,000 times faster than reading a FASTA file
16
+
- Generate tracks up to 450 times faster than reading a BigWig
17
+
-**Supports indels** and re-aligns tracks to haplotypes that have them
19
18
- Extensible to new file formats: drop a feature request! Currently supports VCF, PGEN, and BigWig
20
19
20
+
See our [preprint](https://www.biorxiv.org/content/10.1101/2025.01.15.633240) for benchmarking and implementation details.
21
+
21
22
## Installation
22
23
23
24
```bash
24
25
pip install genvarloader
25
26
```
26
27
27
-
A PyTorch dependency is not included since it may require [special instructions](https://pytorch.org/get-started/locally/).
28
+
A PyTorch dependency is **not** included since it may require [special instructions](https://pytorch.org/get-started/locally/).
28
29
29
30
## Quick Start
30
31
31
-
### Write a `gvl.Dataset`
32
+
### Write a [`gvl.Dataset`](https://genvarloader.readthedocs.io/en/latest/api.html#genvarloader.Dataset)
32
33
33
34
GenVarLoader has both a CLI and Python API for writing datasets. The Python API provides some extra flexibility, for example for a multi-task objective.
34
35
@@ -53,15 +54,14 @@ gvl.write(
53
54
)
54
55
```
55
56
56
-
### Open a `gvl.Dataset` and get a PyTorch DataLoader
57
+
### Open a [`gvl.Dataset`](https://genvarloader.readthedocs.io/en/latest/api.html#genvarloader.Dataset) and get a PyTorch DataLoader
Suppose we want to return tracks that are the z-scored, log(CPM + 1) version of the original. Sometimes it is better to write this to disk to avoid having to recompute it during training or inference.
99
-
100
-
```python
101
-
import numpy as np
102
-
103
-
# We'll assume we already have an array of total counts for each sample.
104
-
# This usually can't be derived from a gvl.Dataset since it only has data for specific regions.
- GenVarLoader uses multithreading extensively, so it's best to use 0 or 1 workers with your PyTorch `DataLoader`.
143
-
- A GenVarLoader `Dataset` is most efficient when given batches of indices, rather than one at a time. PyTorch `DataLoader` by default uses one index at a time, so if you want to use a ***custom***PyTorch `Sampler` you should wrap it with a PyTorch `BatchSampler` before passing it to `Dataset.to_dataloader()`.
95
+
- GenVarLoader uses multithreading extensively, so it's best to use `0` or `1` workers with your [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader).
96
+
- A GenVarLoader [`Dataset`](https://genvarloader.readthedocs.io/en/latest/api.html#genvarloader.Dataset) is most efficient when given batches of indices, rather than one at a time. By default, [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)s use one index at a time, so if you want to use a ***custom***[`Sampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler) you should wrap it with a [`BatchSampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.BatchSampler) before passing it to [`Dataset.to_dataloader()`](https://genvarloader.readthedocs.io/en/latest/api.html#genvarloader.Dataset.to_dataloader).
0 commit comments