Get up and running with Synthetic Data Solution in minutes.
- Python 3.11 or higher
- An API key for OpenAI or Anthropic (or both)
uvpackage manager (recommended) orpip
git clone https://github.com/your-org/synthetic-data-solution.git
cd synthetic-data-solution
uv syncgit clone https://github.com/your-org/synthetic-data-solution.git
cd synthetic-data-solution
pip install -e ".[dev]"Create a .env file in the project root:
# At least one API key is required
ANTHROPIC_API_KEY=sk-ant-your-key-here
OPENAI_API_KEY=sk-your-key-here
# Optional: Set your preferred provider
DEFAULT_LLM_PROVIDER=anthropic# Generate healthcare patient data
synth generate -c "Healthcare clinic with 100 patients, need patient demographics and appointment records"This will:
- Analyze your context to understand requirements
- Infer appropriate schemas
- Generate a small sample for review
- Ask for your approval
- Generate the full dataset
For more control over the generation process:
synth generate -c "Legal firm needs case management data with clients, cases, and documents" --interactiveInteractive mode lets you:
- Review and modify inferred schemas
- Approve or reject sample data
- Adjust field types and relationships
- Provide feedback before full generation
Start the server:
synth serve --reloadThen make API calls:
# Step 1: Analyze context
curl -X POST "http://localhost:8000/api/v1/context/analyze" \
-H "Content-Type: application/json" \
-d '{"context": "Financial services firm with customer accounts and transactions"}'
# Step 2: Generate samples
curl -X POST "http://localhost:8000/api/v1/generate/sample" \
-H "Content-Type: application/json" \
-d '{"context": "Financial services firm with customer accounts and transactions", "sample_size": 5}'
# Step 3: Generate full corpus
curl -X POST "http://localhost:8000/api/v1/generate/corpus" \
-H "Content-Type: application/json" \
-d '{"context": "Financial services firm with customer accounts and transactions", "corpus_size": 500}'synth generate -c "Management consulting firm tracking client engagements. Need clients, projects with phases, consultant assignments, timesheets, and deliverables. Projects range from 3-12 months."synth generate -c "Medical practice with 500 patients. Need patient demographics, medical history, appointments, prescriptions, and billing records. HIPAA-compliant format required."synth generate -c "Law firm specializing in corporate litigation. Need case files, client information, legal documents, court filings, and billing records. Track attorney assignments and case outcomes."synth generate -c "Investment advisory firm managing client portfolios. Need client accounts, holdings, transactions, market data, and performance reports. Include multiple asset classes."Specify your preferred format:
# CSV (default) - one file per schema
synth generate -c "..." --format csv
# JSON - nested structures with relationships
synth generate -c "..." --format json
# Excel - all schemas in one workbook
synth generate -c "..." --format xlsx
# SQL - DDL and INSERT statements
synth generate -c "..." --format sql# Small sample for testing (default: 10)
synth generate -c "..." --sample-size 5
# Large corpus for development
synth generate -c "..." --size 10000
# Batch processing for very large datasets
synth generate -c "..." --size 100000 --batch-size 500# Default: ./output directory
synth generate -c "..."
# Custom output directory
synth generate -c "..." --output ./data/healthcare/For automated pipelines:
# Skip sample review and generate directly
synth generate -c "..." --auto-approve
# Or skip sample generation entirely
synth generate -c "..." --skip-samplePreview inferred schemas without generating data:
synth validate -c "Healthcare clinic needs patient records"Output (table format):
Schema: patients
┌─────────────┬──────────┬──────────┬─────────────────────┐
│ Field │ Type │ Required │ Constraints │
├─────────────┼──────────┼──────────┼─────────────────────┤
│ id │ UUID │ Yes │ primary_key │
│ first_name │ STRING │ Yes │ │
│ last_name │ STRING │ Yes │ │
│ dob │ DATE │ Yes │ │
│ email │ EMAIL │ No │ │
│ phone │ PHONE │ No │ │
└─────────────┴──────────┴──────────┴─────────────────────┘
Or get JSON output:
synth validate -c "..." --format json- API Reference - Complete API documentation
- Deployment Guide - Deploy to production
- Configuration - All configuration options
Make sure you have a .env file with at least one valid API key:
ANTHROPIC_API_KEY=sk-ant-...
# or
OPENAI_API_KEY=sk-...The API has built-in rate limiting (60 requests/minute). For high-volume generation:
- Use larger batch sizes
- Use corpus generation instead of multiple sample requests
- Consider running your own deployment
Review the validation errors and adjust your context to be more specific:
# Too vague
synth generate -c "company data"
# Better - specific requirements
synth generate -c "Software company with employee records including name, department, hire date, and salary range"For large datasets, use background corpus generation:
# CLI: Split into batches
synth generate -c "..." --size 50000 --batch-size 1000
# API: Use async endpoint and poll status
curl -X POST ".../generate/corpus" -d '{"corpus_size": 50000}'
# Returns job_id, then poll /jobs/{job_id}/status