Skip to content

Specification driven datagen#372

Closed
anupkalburgi wants to merge 32 commits intomasterfrom
ak/spec
Closed

Specification driven datagen#372
anupkalburgi wants to merge 32 commits intomasterfrom
ak/spec

Conversation

@anupkalburgi
Copy link
Collaborator

Changes

Linked issues

Resolves #..

Requirements

  • manually tested
  • updated documentation
  • updated demos
  • updated tests

@codecov
Copy link

codecov bot commented Nov 6, 2025

Codecov Report

❌ Patch coverage is 95.01558% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.28%. Comparing base (4687b8c) to head (d527300).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
dbldatagen/spec/generator_spec_impl.py 91.89% 3 Missing and 6 partials ⚠️
dbldatagen/spec/generator_spec.py 95.41% 3 Missing and 2 partials ⚠️
dbldatagen/spec/compat.py 60.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #372      +/-   ##
==========================================
+ Coverage   92.12%   92.28%   +0.16%     
==========================================
  Files          47       55       +8     
  Lines        4217     4538     +321     
  Branches      766      836      +70     
==========================================
+ Hits         3885     4188     +303     
- Misses        186      195       +9     
- Partials      146      155       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@anupkalburgi anupkalburgi marked this pull request as ready for review December 2, 2025 14:31
@anupkalburgi anupkalburgi requested review from a team as code owners December 2, 2025 14:31
@anupkalburgi anupkalburgi requested review from nfx and renardeinside and removed request for a team December 2, 2025 14:31
@ghanse ghanse requested a review from Copilot December 4, 2025 18:04
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new Pydantic-based specification API for dbldatagen, providing a declarative, type-safe approach to synthetic data generation. The changes add comprehensive validation, test coverage, and example specifications while updating documentation and build configuration to support both Pydantic V1 and V2.

Key Changes:

  • New spec-based API with Pydantic models for defining data generation configurations
  • Comprehensive validation framework with error collection and reporting
  • Pydantic V1/V2 compatibility layer for broad environment support

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_specs.py Comprehensive test suite for ValidationResult, ColumnDefinition, DatagenSpec validation, and target configurations
tests/test_datasets_with_specs.py Tests for Pydantic model validation with BasicUser and BasicStockTicker specifications
tests/test_datagen_specs.py Tests for DatagenSpec creation, validation, and generator options
pyproject.toml Added ipython dependency, test matrix for Pydantic versions, and disabled warn_unused_ignores
makefile Updated to use Pydantic version-specific test environments and removed .venv target
examples/datagen_from_specs/basic_user_datagen_spec.py Example DatagenSpec factory for generating basic user data with pre-configured specs
examples/datagen_from_specs/basic_stock_ticker_datagen_spec.py Complex example with OHLC stock data generation including time-series and volatility modeling
examples/datagen_from_specs/README.md Documentation for Pydantic-based dataset specifications with usage examples
dbldatagen/spec/validation.py ValidationResult class for collecting and reporting validation errors and warnings
dbldatagen/spec/output_targets.py Pydantic models for UCSchemaTarget and FilePathTarget output destinations
dbldatagen/spec/generator_spec_impl.py Generator class implementing the spec-to-DataFrame conversion logic
dbldatagen/spec/generator_spec.py Core DatagenSpec and TableDefinition models with comprehensive validation
dbldatagen/spec/compat.py Pydantic V1/V2 compatibility layer enabling cross-version support
dbldatagen/spec/column_spec.py ColumnDefinition model with validation for primary keys and constraints
dbldatagen/spec/init.py Module initialization with lazy imports to avoid heavy dependencies
README.md Updated feature list and formatting to mention new Pydantic-based API
CHANGELOG.md Added entry for Pydantic-based specification API feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +263 to +272
# Write data based on destination type
if isinstance(output_destination, FilePathTarget):
output_path = posixpath.join(output_destination.base_path, table_name)
df.write.format(output_destination.output_format).mode("overwrite").save(output_path)
logger.info(f"Wrote table '{table_name}' to file path: {output_path}")

elif isinstance(output_destination, UCSchemaTarget):
output_table = f"{output_destination.catalog}.{output_destination.schema_}.{table_name}"
df.write.mode("overwrite").saveAsTable(output_table)
logger.info(f"Wrote table '{table_name}' to Unity Catalog: {output_table}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use utils.write_data_to_output for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method in utils relies on OutputDataset. See the other comments above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have OutputDataset in config.py. I think we can reuse it here instead of creating new classes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo, The output target in spec would be better suited for the spec as it does validations for UC and file paths volumes. This config.OutputDataset is generic, and getting specific errors would mean essientially copying this over to the new spec folder or other way round. Is there something I am missing here ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutputDataset could extend BaseModel and perform the same validations.

ghanse and others added 2 commits December 10, 2025 14:36
* added use of ABC to mark TextGenerator as abstract

* Lint text generators module

* Add persistence methods

* Add tests and docs; Update PR template

* Update hatch installation for push action

* Refactor

* Update method names and signatures

---------

Co-authored-by: ronanstokes-db <[email protected]>
Co-authored-by: Ronan Stokes <[email protected]>
@anupkalburgi anupkalburgi changed the title Ak/spec Specification driven datagneration Dec 18, 2025
@anupkalburgi anupkalburgi changed the title Specification driven datagneration Specification driven datagen Dec 18, 2025
@alexott alexott requested a review from Copilot December 23, 2025 13:54
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

known_options = ["random", "randomSeed", "randomSeedMethod", "verbose", "debug", "seedColumnName"]
for key in self.generator_options:
if key not in known_options:
result.add_warning(f"Unknown generator option: '{key}'. " "This may be ignored during generation.")
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning message contains two separate string literals that should be combined into a single string for better readability.

Suggested change
result.add_warning(f"Unknown generator option: '{key}'. " "This may be ignored during generation.")
result.add_warning(f"Unknown generator option: '{key}'. This may be ignored during generation.")

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

@alexott
Copy link
Contributor

alexott commented Dec 23, 2025

@anupkalburgi how hard it will be to extend this implementation to load spec from YAML files? I.e., with pydantic-yaml.

I'm looking into PRs like #376 and thinking that it could be easier to provide these generators as a "standard" generation notebook that will receive a file name as a parameter and generate data...

@anupkalburgi
Copy link
Collaborator Author

@anupkalburgi how hard it will be to extend this implementation to load spec from YAML files? I.e., with pydantic-yaml.

I'm looking into PRs like #376 and thinking that it could be easier to provide these generators as a "standard" generation notebook that will receive a file name as a parameter and generate data...

That is the idea with core idea behind this PR as far as we can get input format (yaml/json/py dict) to pydantic model, we would be able to pass that as the config to the generator that takes the pydantic object. Will be adding examples and helper methods in subsequent prs

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 20 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@alexott alexott requested a review from GeekSheikh January 13, 2026 16:11
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@ghanse ghanse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. We also need to add documentation.

Comment on lines +25 to +31
:param number_of_rows: Total number of data rows to generate for this table.
Must be a positive integer
:param partitions: Number of Spark partitions to use when generating data.
If None, defaults to Spark's default parallelism setting.
More partitions can improve generation speed for large datasets
:param columns: List of ColumnDefinition objects specifying the columns to generate
in this table. At least one column must be specified
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a user need to set the same property for every columnspec? For example for random seed value?

known_options = ["random", "randomSeed", "randomSeedMethod", "verbose", "debug", "seedColumnName"]
for key in self.generator_options:
if key not in known_options:
result.add_warning(f"Unknown generator option: '{key}'. " "This may be ignored during generation.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

if key not in known_options:
result.add_warning(f"Unknown generator option: '{key}'. " "This may be ignored during generation.")

def validate(self, strict: bool = True) -> ValidationResult: # type: ignore[override]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we modify this slightly to return something usable when validation passes? Maybe add an is_valid field to ValidationResult?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValidationResult has is_valid method

Copy link
Collaborator

@ghanse ghanse Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we need a method for this? Is it not a property of the validation result? Is a single validation result intended to be re-used?

)
display(HTML(message))

df = pd.DataFrame([col.dict() for col in table_def.columns])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use Spark here to avoid the Pandas dependency?

Copy link
Collaborator Author

@anupkalburgi anupkalburgi Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used for displaying the columns using display()
This is done before we invoke the generator, and access to the Spark object is not verified.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are ways to display DataFrames with Spark, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutputDataset could extend BaseModel and perform the same validations.

@anupkalburgi anupkalburgi requested a review from ghanse January 23, 2026 21:28
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 9 comments.

Comments suppressed due to low confidence (1)

tests/test_output.py:110

  • This test unconditionally sleeps for the full time_limit (40s) before stopping the streaming query, so it will always add ~40s per parametrized case even if data is available earlier. Consider polling for output/progress and breaking early (with a max timeout) to keep CI runtime bounded while still being robust.
        start_time = time.time()
        elapsed_time = 0
        time_limit = 40.0

        while elapsed_time < time_limit:
            time.sleep(1)
            elapsed_time = time.time() - start_time


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +48 to +52
if not v.strip():
raise ValueError("Identifier must be non-empty.")
if not v.isidentifier():
logger.warning(f"'{v}' is not a basic Python identifier. Ensure validity for Unity Catalog.")
return v.strip()
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate_identifiers checks v.isidentifier() before stripping whitespace, which can generate spurious warnings for values like ' main '. Strip first (and ideally validate/log using the stripped value) so warnings reflect the actual persisted identifier.

Suggested change
if not v.strip():
raise ValueError("Identifier must be non-empty.")
if not v.isidentifier():
logger.warning(f"'{v}' is not a basic Python identifier. Ensure validity for Unity Catalog.")
return v.strip()
value = v.strip()
if not value:
raise ValueError("Identifier must be non-empty.")
if not value.isidentifier():
logger.warning(
f"'{value}' is not a basic Python identifier. Ensure validity for Unity Catalog."
)
return value

Copilot uses AI. Check for mistakes.
Comment on lines +78 to +81
if is_primary:
if "min" in options or "max" in options:
raise ValueError(f"Primary key column '{name}' cannot have min/max options.")

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Primary-key validation only rejects option keys min/max, but the rest of the codebase (and dbldatagen APIs) predominantly use minValue/maxValue (see existing dataset providers). As a result, primary keys can still be configured with minValue/maxValue without error, contradicting the docstring constraint. Consider rejecting minValue/maxValue as well (and/or normalizing legacy keys).

Copilot uses AI. Check for mistakes.
Comment on lines +265 to +266
logger.warning("No output destination specified, skipping data write")
return
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _writePreparedData, the else branch logs a warning and returns when output_destination is None or an unexpected type. This exits the method on the first table, so later tables are never even built, which conflicts with the docstring (“data is generated but not persisted anywhere”) and also with the documented ValueError behavior for unknown destination types. Consider: (1) if destination is None, still build all tables and just skip write (use continue), and (2) if destination is an unrecognized type, raise ValueError instead of silently returning.

Suggested change
logger.warning("No output destination specified, skipping data write")
return
if output_destination is None:
logger.warning(
f"No output destination specified for table '{table_name}', skipping data write"
)
continue
raise ValueError(
f"Unrecognized output destination type: {type(output_destination).__name__}"
)

Copilot uses AI. Check for mistakes.
Comment on lines +134 to +145
current = col.name

while current:
if current in visited:
# Found a cycle
cycle_path = " -> ".join([*list(visited), current])
errors.append(
f"Table '{table_name}': Circular dependency detected in column '{col.name}': {cycle_path}"
)
break

visited.add(current)
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Circular dependency error messages build cycle_path from a set (visited), which produces non-deterministic ordering in the reported path. Using an ordered structure (e.g., list for the traversal path plus a separate set for membership checks) would produce stable, readable cycle reports.

Suggested change
current = col.name
while current:
if current in visited:
# Found a cycle
cycle_path = " -> ".join([*list(visited), current])
errors.append(
f"Table '{table_name}': Circular dependency detected in column '{col.name}': {cycle_path}"
)
break
visited.add(current)
path: list[str] = []
current = col.name
while current:
if current in visited:
# Found a cycle; build a deterministic, ordered cycle path
if current in path:
start_index = path.index(current)
cycle_nodes = path[start_index:] + [current]
else:
cycle_nodes = path + [current]
cycle_path = " -> ".join(cycle_nodes)
errors.append(
f"Table '{table_name}': Circular dependency detected in column '{col.name}': {cycle_path}"
)
break
visited.add(current)
path.append(current)

Copilot uses AI. Check for mistakes.

## Requirements

- Python 3.8+
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README states a Python 3.8+ requirement, but the project metadata requires Python >=3.10 (pyproject.toml requires-python). Please align this requirement text with the actual supported minimum version to avoid confusing users.

Suggested change
- Python 3.8+
- Python 3.10+

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +8
import pandas as pd
from IPython.display import HTML, display

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generator_spec.py unconditionally imports pandas and IPython.display at module import time, but display_all_tables()’s docstring claims it will fall back in non-notebook environments. With the current imports, environments without IPython/pandas will fail at import-time before any fallback can occur. Consider moving these imports inside display_all_tables() (or guarding them with try/except) so validation/serialization features can be used without notebook-only deps.

Suggested change
import pandas as pd
from IPython.display import HTML, display
try:
import pandas as pd
except ImportError: # Optional dependency; used only for display utilities.
pd = None # type: ignore[assignment]
try:
from IPython.display import HTML, display
except ImportError: # Optional dependency; used only in notebook environments.
HTML = None # type: ignore[assignment]
display = None # type: ignore[assignment]

Copilot uses AI. Check for mistakes.
:param result: ValidationResult to collect errors/warnings
"""
column_names = [col.name for col in table_def.columns]
duplicates = [name for name in set(column_names) if column_names.count(name) > 1]
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate column detection builds duplicates by iterating a set(column_names), which makes the resulting error message order non-deterministic across runs. Sorting the duplicates (or preserving first-seen order) will make validation output stable and easier to test/debug.

Suggested change
duplicates = [name for name in set(column_names) if column_names.count(name) > 1]
seen: set[str] = set()
duplicates: list[str] = []
duplicate_seen: set[str] = set()
for name in column_names:
if name in seen and name not in duplicate_seen:
duplicates.append(name)
duplicate_seen.add(name)
else:
seen.add(name)

Copilot uses AI. Check for mistakes.
Comment on lines +72 to +83
else:
kwargs = col_def.options.copy() if col_def.options is not None else {}

if col_type:
kwargs["colType"] = col_type
if col_def.baseColumn:
kwargs["baseColumn"] = col_def.baseColumn
if col_def.baseColumnType:
kwargs["baseColumnType"] = col_def.baseColumnType
if col_def.omit is not None:
kwargs["omit"] = col_def.omit

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnDefinition.nullable is defined and documented, but _columnSpecToDatagenColumnSpec never forwards it to DataGenerator.withColumn(...) (which supports a nullable kwarg). This means nullable settings in specs are silently ignored. Consider adding nullable=col_def.nullable (or omitting it when None) in the non-primary (and possibly primary) branch to match the spec model.

Copilot uses AI. Check for mistakes.
max_val: int

@root_validator()
def validate_range(cls, values):
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normal methods should have 'self', rather than 'cls', as their first parameter.

Suggested change
def validate_range(cls, values):
def validate_range(self, values):

Copilot uses AI. Check for mistakes.
@anupkalburgi
Copy link
Collaborator Author

This is now part of the new PR #381

@anupkalburgi
Copy link
Collaborator Author

Moved all the changes to PR 381

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants