|
| 1 | +# 🎨 NeMo Data Designer |
| 2 | + |
| 3 | +[](https://github.com/NVIDIA-NeMo/DataDesigner/actions/workflows/ci.yml) |
| 4 | +[](https://opensource.org/licenses/Apache-2.0) |
| 5 | +[](https://www.python.org/downloads/) [](https://docs.nvidia.com/nemo/microservices/latest/index.html) [](https://nvidia-nemo.github.io/DataDesigner/)  |
| 6 | + |
| 7 | +**Generate high-quality synthetic datasets from scratch or using your own seed data.** |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## Welcome! |
| 12 | + |
| 13 | +Data Designer helps you create synthetic datasets that go beyond simple LLM prompting. Whether you need diverse statistical distributions, meaningful correlations between fields, or validated high-quality outputs, Data Designer provides a flexible framework for building production-grade synthetic data. |
| 14 | + |
| 15 | +## What can you do with Data Designer? |
| 16 | + |
| 17 | +- **Generate diverse data** using statistical samplers, LLMs, or existing seed datasets |
| 18 | +- **Control relationships** between fields with dependency-aware generation |
| 19 | +- **Validate quality** with built-in Python, SQL, and custom local and remote validators |
| 20 | +- **Score outputs** using LLM-as-a-judge for quality assessment |
| 21 | +- **Iterate quickly** with preview mode before full-scale generation |
| 22 | + |
| 23 | +--- |
| 24 | + |
| 25 | +### ⚠️ Security Notice: LiteLLM Supply-Chain Incident (2026-03-24) |
| 26 | + |
| 27 | +On March 24, 2026, malicious versions of `litellm` ([1.82.7 and 1.82.8](https://github.com/BerriAI/litellm/issues/24518)) were published to PyPI containing a credential stealer. The compromised packages were available for [approximately five hours](https://www.okta.com/blog/threat-intelligence/litellm-supply-chain-attack--an-explainer-for-identity-pros/) (10:39 – 16:00 UTC) before being removed. |
| 28 | + |
| 29 | +The only Data Designer releases that could resolve to these versions are **v0.2.2** (Dec 2025) and **v0.2.3** (Jan 2026), which carried a looser `litellm<2` upper bound. These are nearly three months old and have been superseded by eight subsequent releases — both have been yanked from PyPI as a precaution. All other releases (v0.3.0 – v0.5.3) pinned `litellm` to `>=1.73.6,<1.80.12` and were never compatible with 1.82.x. Starting with v0.5.4, `litellm` is no longer a dependency. |
| 30 | + |
| 31 | +To have been impacted through Data Designer, you would need to have had one of these two old versions explicitly pinned *and* run a fresh `pip install` or dependency-cache update that resolved `litellm` during the five-hour window on March 24. If you believe you may be affected, see [BerriAI's incident report](https://github.com/BerriAI/litellm/issues/24518) for remediation steps. |
| 32 | + |
| 33 | +--- |
| 34 | + |
| 35 | +## Quick Start |
| 36 | + |
| 37 | +### 1. Install |
| 38 | + |
| 39 | +```bash |
| 40 | +pip install data-designer |
| 41 | +``` |
| 42 | + |
| 43 | +Or install from source: |
| 44 | + |
| 45 | +```bash |
| 46 | +git clone https://github.com/NVIDIA-NeMo/DataDesigner.git |
| 47 | +cd DataDesigner |
| 48 | +make install |
| 49 | +``` |
| 50 | + |
| 51 | +### 2. Set your API key |
| 52 | + |
| 53 | +Start with one of our default model providers: |
| 54 | + |
| 55 | +- [NVIDIA Build API](https://build.nvidia.com) |
| 56 | +- [OpenAI](https://platform.openai.com/api-keys) |
| 57 | +- [OpenRouter](https://openrouter.ai) |
| 58 | + |
| 59 | +Grab your API key(s) using the above links and set one or more of the following environment variables: |
| 60 | +```bash |
| 61 | +export NVIDIA_API_KEY="your-api-key-here" |
| 62 | + |
| 63 | +export OPENAI_API_KEY="your-openai-api-key-here" |
| 64 | + |
| 65 | +export OPENROUTER_API_KEY="your-openrouter-api-key-here" |
| 66 | +``` |
| 67 | + |
| 68 | +### 3. Start generating data! |
| 69 | +```python |
| 70 | +import data_designer.config as dd |
| 71 | +from data_designer.interface import DataDesigner |
| 72 | + |
| 73 | +# Initialize with default settings |
| 74 | +data_designer = DataDesigner() |
| 75 | +config_builder = dd.DataDesignerConfigBuilder() |
| 76 | + |
| 77 | +# Add a product category |
| 78 | +config_builder.add_column( |
| 79 | + dd.SamplerColumnConfig( |
| 80 | + name="product_category", |
| 81 | + sampler_type=dd.SamplerType.CATEGORY, |
| 82 | + params=dd.CategorySamplerParams( |
| 83 | + values=["Electronics", "Clothing", "Home & Kitchen", "Books"], |
| 84 | + ), |
| 85 | + ) |
| 86 | +) |
| 87 | + |
| 88 | +# Generate personalized customer reviews |
| 89 | +config_builder.add_column( |
| 90 | + dd.LLMTextColumnConfig( |
| 91 | + name="review", |
| 92 | + model_alias="nvidia-text", |
| 93 | + prompt="Write a brief product review for a {{ product_category }} item you recently purchased.", |
| 94 | + ) |
| 95 | +) |
| 96 | + |
| 97 | +# Preview your dataset |
| 98 | +preview = data_designer.preview(config_builder=config_builder) |
| 99 | +preview.display_sample_record() |
| 100 | +``` |
| 101 | + |
| 102 | +--- |
| 103 | + |
| 104 | +## What's next? |
| 105 | + |
| 106 | +### 📚 Learn more |
| 107 | + |
| 108 | +- **[Getting Started](https://nvidia-nemo.github.io/DataDesigner/latest/)** – Install, configure, and generate your first dataset |
| 109 | +- **[Tutorial Notebooks](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/)** – Step-by-step interactive tutorials |
| 110 | +- **[Column Types](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/columns/)** – Explore samplers, LLM columns, validators, and more |
| 111 | +- **[Validators](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/validators/)** – Learn how to validate generated data with Python, SQL, and remote validators |
| 112 | +- **[Model Configuration](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/model-configs/)** – Configure custom models and providers |
| 113 | +- **[Person Sampling](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/person_sampling/)** – Learn how to sample realistic person data with demographic attributes |
| 114 | + |
| 115 | +### 🔧 Configure models via CLI |
| 116 | + |
| 117 | +```bash |
| 118 | +data-designer config providers # Configure model providers |
| 119 | +data-designer config models # Set up your model configurations |
| 120 | +data-designer config list # View current settings |
| 121 | +``` |
| 122 | + |
| 123 | +### 🤖 Agent Skill |
| 124 | + |
| 125 | +Data Designer has a [skill](https://nvidia-nemo.github.io/DataDesigner/latest/devnotes/data-designer-got-skills/) for coding agents. Just describe the dataset you want, and your agent handles schema design, validation, and generation. While the skill should work with other coding agents that support skills, our development and testing has focused on [Claude Code](https://code.claude.com) at this stage. |
| 126 | + |
| 127 | +**Install via [skills.sh](https://skills.sh)** (be sure to select Claude Code as an additional agent): |
| 128 | + |
| 129 | +```bash |
| 130 | +npx skills add NVIDIA-NeMo/DataDesigner |
| 131 | +``` |
| 132 | + |
| 133 | +After installation, type `/data-designer` or describe the dataset you want and the skill will kick in. |
| 134 | + |
| 135 | +### 🤝 Get involved |
| 136 | + |
| 137 | +This repository supports agent-assisted development — see [CONTRIBUTING.md](CONTRIBUTING.md) for the recommended workflow. |
| 138 | + |
| 139 | +- **[Contributing Guide](CONTRIBUTING.md)** – How to contribute, including agent-assisted workflows |
| 140 | +- **[GitHub Issues](https://github.com/NVIDIA-NeMo/DataDesigner/issues)** – Report bugs or make a feature request |
| 141 | + |
| 142 | +--- |
| 143 | + |
| 144 | +## Telemetry |
| 145 | + |
| 146 | +Data Designer collects telemetry to help us improve the library for developers. We collect: |
| 147 | + |
| 148 | +* The names of models used |
| 149 | +* The count of input tokens |
| 150 | +* The count of output tokens |
| 151 | + |
| 152 | +**No user or device information is collected.** This data is not used to track any individual user behavior. It is used to see an aggregation of which models are the most popular for SDG. We will share this usage data with the community. |
| 153 | + |
| 154 | +Specifically, a model name that is defined a `ModelConfig` object, is what will be collected. In the below example config: |
| 155 | + |
| 156 | +```python |
| 157 | +ModelConfig( |
| 158 | + alias="nv-reasoning", |
| 159 | + model="nvidia/nemotron-3-super-120b-a12b", |
| 160 | + provider="nvidia", |
| 161 | + inference_parameters=ChatCompletionInferenceParams( |
| 162 | + temperature=1.0, |
| 163 | + top_p=0.95, |
| 164 | + max_tokens=4096, |
| 165 | + ), |
| 166 | +) |
| 167 | +``` |
| 168 | + |
| 169 | +The value `nvidia/nemotron-3-super-120b-a12b` would be collected. |
| 170 | + |
| 171 | +To disable telemetry capture, set `NEMO_TELEMETRY_ENABLED=false`. |
| 172 | + |
| 173 | +### Top Models |
| 174 | + |
| 175 | +This chart represents the breakdown of models used for Data Designer across all synthetic data generation jobs from 2/23/2026 to 3/23/2026. |
| 176 | + |
| 177 | + |
| 178 | + |
| 179 | +_Last updated on 3/23/2026_ |
| 180 | + |
| 181 | +--- |
| 182 | + |
| 183 | +## License |
| 184 | + |
| 185 | +Apache License 2.0 – see [LICENSE](LICENSE) for details. |
| 186 | + |
| 187 | +--- |
| 188 | + |
| 189 | +## Citation |
| 190 | + |
| 191 | +If you use NeMo Data Designer in your research, please cite it using the following BibTeX entry: |
| 192 | + |
| 193 | +```bibtex |
| 194 | +@misc{nemo-data-designer, |
| 195 | + author = {The NeMo Data Designer Team, NVIDIA}, |
| 196 | + title = {NeMo Data Designer: A framework for generating synthetic data from scratch or based on your own seed data}, |
| 197 | + howpublished = {\url{https://github.com/NVIDIA-NeMo/DataDesigner}}, |
| 198 | + year = {2025}, |
| 199 | + note = {GitHub Repository}, |
| 200 | +} |
| 201 | +``` |
0 commit comments