Skip to content

Latest commit

 

History

History
251 lines (180 loc) · 6.76 KB

File metadata and controls

251 lines (180 loc) · 6.76 KB

Quickstart Guide

Get up and running with Synthetic Data Solution in minutes.

Prerequisites

  • Python 3.11 or higher
  • An API key for OpenAI or Anthropic (or both)
  • uv package manager (recommended) or pip

Installation

Using uv (Recommended)

git clone https://github.com/your-org/synthetic-data-solution.git
cd synthetic-data-solution
uv sync

Using pip

git clone https://github.com/your-org/synthetic-data-solution.git
cd synthetic-data-solution
pip install -e ".[dev]"

Configuration

Create a .env file in the project root:

# At least one API key is required
ANTHROPIC_API_KEY=sk-ant-your-key-here
OPENAI_API_KEY=sk-your-key-here

# Optional: Set your preferred provider
DEFAULT_LLM_PROVIDER=anthropic

Your First Generation

Option 1: CLI (Simplest)

# Generate healthcare patient data
synth generate -c "Healthcare clinic with 100 patients, need patient demographics and appointment records"

This will:

  1. Analyze your context to understand requirements
  2. Infer appropriate schemas
  3. Generate a small sample for review
  4. Ask for your approval
  5. Generate the full dataset

Option 2: Interactive Mode

For more control over the generation process:

synth generate -c "Legal firm needs case management data with clients, cases, and documents" --interactive

Interactive mode lets you:

  • Review and modify inferred schemas
  • Approve or reject sample data
  • Adjust field types and relationships
  • Provide feedback before full generation

Option 3: API

Start the server:

synth serve --reload

Then make API calls:

# Step 1: Analyze context
curl -X POST "http://localhost:8000/api/v1/context/analyze" \
  -H "Content-Type: application/json" \
  -d '{"context": "Financial services firm with customer accounts and transactions"}'

# Step 2: Generate samples
curl -X POST "http://localhost:8000/api/v1/generate/sample" \
  -H "Content-Type: application/json" \
  -d '{"context": "Financial services firm with customer accounts and transactions", "sample_size": 5}'

# Step 3: Generate full corpus
curl -X POST "http://localhost:8000/api/v1/generate/corpus" \
  -H "Content-Type: application/json" \
  -d '{"context": "Financial services firm with customer accounts and transactions", "corpus_size": 500}'

Example Use Cases

Consulting Project Data

synth generate -c "Management consulting firm tracking client engagements. Need clients, projects with phases, consultant assignments, timesheets, and deliverables. Projects range from 3-12 months."

Healthcare Records

synth generate -c "Medical practice with 500 patients. Need patient demographics, medical history, appointments, prescriptions, and billing records. HIPAA-compliant format required."

Legal Case Files

synth generate -c "Law firm specializing in corporate litigation. Need case files, client information, legal documents, court filings, and billing records. Track attorney assignments and case outcomes."

Financial Portfolio Data

synth generate -c "Investment advisory firm managing client portfolios. Need client accounts, holdings, transactions, market data, and performance reports. Include multiple asset classes."

Output Formats

Specify your preferred format:

# CSV (default) - one file per schema
synth generate -c "..." --format csv

# JSON - nested structures with relationships
synth generate -c "..." --format json

# Excel - all schemas in one workbook
synth generate -c "..." --format xlsx

# SQL - DDL and INSERT statements
synth generate -c "..." --format sql

Controlling Output Size

# Small sample for testing (default: 10)
synth generate -c "..." --sample-size 5

# Large corpus for development
synth generate -c "..." --size 10000

# Batch processing for very large datasets
synth generate -c "..." --size 100000 --batch-size 500

Output Location

# Default: ./output directory
synth generate -c "..."

# Custom output directory
synth generate -c "..." --output ./data/healthcare/

Skipping the Review Process

For automated pipelines:

# Skip sample review and generate directly
synth generate -c "..." --auto-approve

# Or skip sample generation entirely
synth generate -c "..." --skip-sample

Validating Without Generating

Preview inferred schemas without generating data:

synth validate -c "Healthcare clinic needs patient records"

Output (table format):

Schema: patients
┌─────────────┬──────────┬──────────┬─────────────────────┐
│ Field       │ Type     │ Required │ Constraints         │
├─────────────┼──────────┼──────────┼─────────────────────┤
│ id          │ UUID     │ Yes      │ primary_key         │
│ first_name  │ STRING   │ Yes      │                     │
│ last_name   │ STRING   │ Yes      │                     │
│ dob         │ DATE     │ Yes      │                     │
│ email       │ EMAIL    │ No       │                     │
│ phone       │ PHONE    │ No       │                     │
└─────────────┴──────────┴──────────┴─────────────────────┘

Or get JSON output:

synth validate -c "..." --format json

Next Steps

Troubleshooting

"No API key configured"

Make sure you have a .env file with at least one valid API key:

ANTHROPIC_API_KEY=sk-ant-...
# or
OPENAI_API_KEY=sk-...

"Rate limit exceeded"

The API has built-in rate limiting (60 requests/minute). For high-volume generation:

  • Use larger batch sizes
  • Use corpus generation instead of multiple sample requests
  • Consider running your own deployment

"Schema validation failed"

Review the validation errors and adjust your context to be more specific:

# Too vague
synth generate -c "company data"

# Better - specific requirements
synth generate -c "Software company with employee records including name, department, hire date, and salary range"

"Generation timeout"

For large datasets, use background corpus generation:

# CLI: Split into batches
synth generate -c "..." --size 50000 --batch-size 1000

# API: Use async endpoint and poll status
curl -X POST ".../generate/corpus" -d '{"corpus_size": 50000}'
# Returns job_id, then poll /jobs/{job_id}/status