Quickstart Guide

Get up and running with Synthetic Data Solution in minutes.

Prerequisites

Python 3.11 or higher
An API key for OpenAI or Anthropic (or both)
uv package manager (recommended) or pip

Installation

Using uv (Recommended)

git clone https://github.com/your-org/synthetic-data-solution.git
cd synthetic-data-solution
uv sync

Using pip

git clone https://github.com/your-org/synthetic-data-solution.git
cd synthetic-data-solution
pip install -e ".[dev]"

Configuration

Create a .env file in the project root:

# At least one API key is required
ANTHROPIC_API_KEY=sk-ant-your-key-here
OPENAI_API_KEY=sk-your-key-here

# Optional: Set your preferred provider
DEFAULT_LLM_PROVIDER=anthropic

Your First Generation

Option 1: CLI (Simplest)

# Generate healthcare patient data
synth generate -c "Healthcare clinic with 100 patients, need patient demographics and appointment records"

This will:

Analyze your context to understand requirements
Infer appropriate schemas
Generate a small sample for review
Ask for your approval
Generate the full dataset

Option 2: Interactive Mode

For more control over the generation process:

synth generate -c "Legal firm needs case management data with clients, cases, and documents" --interactive

Interactive mode lets you:

Review and modify inferred schemas
Approve or reject sample data
Adjust field types and relationships
Provide feedback before full generation

Option 3: API

Start the server:

synth serve --reload

Then make API calls:

# Step 1: Analyze context
curl -X POST "http://localhost:8000/api/v1/context/analyze" \
  -H "Content-Type: application/json" \
  -d '{"context": "Financial services firm with customer accounts and transactions"}'

# Step 2: Generate samples
curl -X POST "http://localhost:8000/api/v1/generate/sample" \
  -H "Content-Type: application/json" \
  -d '{"context": "Financial services firm with customer accounts and transactions", "sample_size": 5}'

# Step 3: Generate full corpus
curl -X POST "http://localhost:8000/api/v1/generate/corpus" \
  -H "Content-Type: application/json" \
  -d '{"context": "Financial services firm with customer accounts and transactions", "corpus_size": 500}'

Example Use Cases

Consulting Project Data

synth generate -c "Management consulting firm tracking client engagements. Need clients, projects with phases, consultant assignments, timesheets, and deliverables. Projects range from 3-12 months."

Healthcare Records

synth generate -c "Medical practice with 500 patients. Need patient demographics, medical history, appointments, prescriptions, and billing records. HIPAA-compliant format required."

Legal Case Files

synth generate -c "Law firm specializing in corporate litigation. Need case files, client information, legal documents, court filings, and billing records. Track attorney assignments and case outcomes."

Financial Portfolio Data

synth generate -c "Investment advisory firm managing client portfolios. Need client accounts, holdings, transactions, market data, and performance reports. Include multiple asset classes."

Output Formats

Specify your preferred format:

# CSV (default) - one file per schema
synth generate -c "..." --format csv

# JSON - nested structures with relationships
synth generate -c "..." --format json

# Excel - all schemas in one workbook
synth generate -c "..." --format xlsx

# SQL - DDL and INSERT statements
synth generate -c "..." --format sql

Controlling Output Size

# Small sample for testing (default: 10)
synth generate -c "..." --sample-size 5

# Large corpus for development
synth generate -c "..." --size 10000

# Batch processing for very large datasets
synth generate -c "..." --size 100000 --batch-size 500

Output Location

# Default: ./output directory
synth generate -c "..."

# Custom output directory
synth generate -c "..." --output ./data/healthcare/

Skipping the Review Process

For automated pipelines:

# Skip sample review and generate directly
synth generate -c "..." --auto-approve

# Or skip sample generation entirely
synth generate -c "..." --skip-sample

Validating Without Generating

Preview inferred schemas without generating data:

synth validate -c "Healthcare clinic needs patient records"

Output (table format):

Schema: patients
┌─────────────┬──────────┬──────────┬─────────────────────┐
│ Field       │ Type     │ Required │ Constraints         │
├─────────────┼──────────┼──────────┼─────────────────────┤
│ id          │ UUID     │ Yes      │ primary_key         │
│ first_name  │ STRING   │ Yes      │                     │
│ last_name   │ STRING   │ Yes      │                     │
│ dob         │ DATE     │ Yes      │                     │
│ email       │ EMAIL    │ No       │                     │
│ phone       │ PHONE    │ No       │                     │
└─────────────┴──────────┴──────────┴─────────────────────┘

Or get JSON output:

synth validate -c "..." --format json

Next Steps

API Reference - Complete API documentation
Deployment Guide - Deploy to production
Configuration - All configuration options

Troubleshooting

"No API key configured"

Make sure you have a .env file with at least one valid API key:

ANTHROPIC_API_KEY=sk-ant-...
# or
OPENAI_API_KEY=sk-...

"Rate limit exceeded"

The API has built-in rate limiting (60 requests/minute). For high-volume generation:

Use larger batch sizes
Use corpus generation instead of multiple sample requests
Consider running your own deployment

"Schema validation failed"

Review the validation errors and adjust your context to be more specific:

# Too vague
synth generate -c "company data"

# Better - specific requirements
synth generate -c "Software company with employee records including name, department, hire date, and salary range"

"Generation timeout"

For large datasets, use background corpus generation:

# CLI: Split into batches
synth generate -c "..." --size 50000 --batch-size 1000

# API: Use async endpoint and poll status
curl -X POST ".../generate/corpus" -d '{"corpus_size": 50000}'
# Returns job_id, then poll /jobs/{job_id}/status

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quickstart Guide

Prerequisites

Installation

Using uv (Recommended)

Using pip

Configuration

Your First Generation

Option 1: CLI (Simplest)

Option 2: Interactive Mode

Option 3: API

Example Use Cases

Consulting Project Data

Healthcare Records

Legal Case Files

Financial Portfolio Data

Output Formats

Controlling Output Size

Output Location

Skipping the Review Process

Validating Without Generating

Next Steps

Troubleshooting

"No API key configured"

"Rate limit exceeded"

"Schema validation failed"

"Generation timeout"

FilesExpand file tree

quickstart.md

Latest commit

History

quickstart.md

File metadata and controls

Quickstart Guide

Prerequisites

Installation

Using uv (Recommended)

Using pip

Configuration

Your First Generation

Option 1: CLI (Simplest)

Option 2: Interactive Mode

Option 3: API

Example Use Cases

Consulting Project Data

Healthcare Records

Legal Case Files

Financial Portfolio Data

Output Formats

Controlling Output Size

Output Location

Skipping the Review Process

Validating Without Generating

Next Steps

Troubleshooting

"No API key configured"

"Rate limit exceeded"

"Schema validation failed"

"Generation timeout"