|
| 1 | +# Dataset Discovery |
| 2 | + |
| 3 | +Automatically discover `.fast_llm_dataset` files and generate a blended config with token-proportional weights. |
| 4 | + |
| 5 | +## Quick Start |
| 6 | + |
| 7 | +Using the tools wrapper: |
| 8 | +```bash |
| 9 | +python tools/discover_datasets.py <directory> -o <output.yaml> |
| 10 | +``` |
| 11 | + |
| 12 | +Using Fast-LLM CLI with config file: |
| 13 | +```yaml |
| 14 | +type: prepare_dataset_discovery |
| 15 | +directory: /path/to/datasets |
| 16 | +output: blended_dataset.yaml |
| 17 | +ignore_paths: [test_data, checkpoints] # Optional |
| 18 | +``` |
| 19 | +
|
| 20 | +```bash |
| 21 | +python -m fast_llm.cli --config config.yaml |
| 22 | +``` |
| 23 | + |
| 24 | +## What It Does |
| 25 | + |
| 26 | +1. Scans directory tree for `.fast_llm_dataset` files |
| 27 | +2. Reads token counts from binary file headers |
| 28 | +3. Generates hierarchical blended config with automatic weights |
| 29 | +4. Preserves directory structure |
| 30 | + |
| 31 | +## Example |
| 32 | + |
| 33 | +Input directory structure: |
| 34 | +``` |
| 35 | +datasets/ |
| 36 | +├── domain_a/ |
| 37 | +│ ├── shard_0.fast_llm_dataset (1B tokens) |
| 38 | +│ └── shard_1.fast_llm_dataset (1B tokens) |
| 39 | +└── domain_b/ |
| 40 | + └── shard_0.fast_llm_dataset (4B tokens) |
| 41 | +``` |
| 42 | + |
| 43 | +Generated config (`blended.yaml`): |
| 44 | +```yaml |
| 45 | +type: blended |
| 46 | +name: datasets |
| 47 | +datasets: |
| 48 | + - type: blended |
| 49 | + name: domain_a |
| 50 | + datasets: |
| 51 | + - type: memmap |
| 52 | + path: datasets/domain_a/shard_0.fast_llm_dataset |
| 53 | + - type: memmap |
| 54 | + path: datasets/domain_a/shard_1.fast_llm_dataset |
| 55 | + weights: [1.0, 1.0] |
| 56 | + - type: memmap |
| 57 | + path: datasets/domain_b/shard_0.fast_llm_dataset |
| 58 | +weights: [2.0, 4.0] # In billions |
| 59 | +``` |
| 60 | +
|
| 61 | +Use in training: |
| 62 | +```yaml |
| 63 | +data: |
| 64 | + datasets: |
| 65 | + training: |
| 66 | + type: file |
| 67 | + path: blended.yaml |
| 68 | +``` |
| 69 | +
|
| 70 | +## Options |
| 71 | +
|
| 72 | +- **directory**: Root directory to scan (required) |
| 73 | +- **output**: Output YAML file path (required) |
| 74 | +- **ignore_paths**: Paths to exclude, relative or absolute (optional) |
| 75 | +
|
| 76 | +## Key Features |
| 77 | +
|
| 78 | +- **Token-proportional sampling**: Datasets sampled by token count (larger datasets sampled more) |
| 79 | +- **Hierarchical grouping**: Directory structure preserved in config |
| 80 | +- **Automatic weights**: Calculated from binary file metadata |
| 81 | +- **Error handling**: Skips unreadable files with warnings |
| 82 | +
|
| 83 | +## Notes |
| 84 | +
|
| 85 | +- Single datasets returned directly (not wrapped) |
| 86 | +- Files with 0 tokens skipped with warning |
| 87 | +- Empty directories raise error |
| 88 | +- Datasets sorted alphabetically |
| 89 | +
|
| 90 | +## Testing |
| 91 | +
|
| 92 | +```bash |
| 93 | +pytest tests/data/test_dataset_discovery.py |
| 94 | +``` |
0 commit comments