Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #372 +/- ##
==========================================
+ Coverage 92.12% 92.28% +0.16%
==========================================
Files 47 55 +8
Lines 4217 4538 +321
Branches 766 836 +70
==========================================
+ Hits 3885 4188 +303
- Misses 186 195 +9
- Partials 146 155 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR introduces a new Pydantic-based specification API for dbldatagen, providing a declarative, type-safe approach to synthetic data generation. The changes add comprehensive validation, test coverage, and example specifications while updating documentation and build configuration to support both Pydantic V1 and V2.
Key Changes:
- New spec-based API with Pydantic models for defining data generation configurations
- Comprehensive validation framework with error collection and reporting
- Pydantic V1/V2 compatibility layer for broad environment support
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_specs.py | Comprehensive test suite for ValidationResult, ColumnDefinition, DatagenSpec validation, and target configurations |
| tests/test_datasets_with_specs.py | Tests for Pydantic model validation with BasicUser and BasicStockTicker specifications |
| tests/test_datagen_specs.py | Tests for DatagenSpec creation, validation, and generator options |
| pyproject.toml | Added ipython dependency, test matrix for Pydantic versions, and disabled warn_unused_ignores |
| makefile | Updated to use Pydantic version-specific test environments and removed .venv target |
| examples/datagen_from_specs/basic_user_datagen_spec.py | Example DatagenSpec factory for generating basic user data with pre-configured specs |
| examples/datagen_from_specs/basic_stock_ticker_datagen_spec.py | Complex example with OHLC stock data generation including time-series and volatility modeling |
| examples/datagen_from_specs/README.md | Documentation for Pydantic-based dataset specifications with usage examples |
| dbldatagen/spec/validation.py | ValidationResult class for collecting and reporting validation errors and warnings |
| dbldatagen/spec/output_targets.py | Pydantic models for UCSchemaTarget and FilePathTarget output destinations |
| dbldatagen/spec/generator_spec_impl.py | Generator class implementing the spec-to-DataFrame conversion logic |
| dbldatagen/spec/generator_spec.py | Core DatagenSpec and TableDefinition models with comprehensive validation |
| dbldatagen/spec/compat.py | Pydantic V1/V2 compatibility layer enabling cross-version support |
| dbldatagen/spec/column_spec.py | ColumnDefinition model with validation for primary keys and constraints |
| dbldatagen/spec/init.py | Module initialization with lazy imports to avoid heavy dependencies |
| README.md | Updated feature list and formatting to mention new Pydantic-based API |
| CHANGELOG.md | Added entry for Pydantic-based specification API feature |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Write data based on destination type | ||
| if isinstance(output_destination, FilePathTarget): | ||
| output_path = posixpath.join(output_destination.base_path, table_name) | ||
| df.write.format(output_destination.output_format).mode("overwrite").save(output_path) | ||
| logger.info(f"Wrote table '{table_name}' to file path: {output_path}") | ||
|
|
||
| elif isinstance(output_destination, UCSchemaTarget): | ||
| output_table = f"{output_destination.catalog}.{output_destination.schema_}.{table_name}" | ||
| df.write.mode("overwrite").saveAsTable(output_table) | ||
| logger.info(f"Wrote table '{table_name}' to Unity Catalog: {output_table}") |
There was a problem hiding this comment.
We should use utils.write_data_to_output for this.
There was a problem hiding this comment.
The method in utils relies on OutputDataset. See the other comments above.
There was a problem hiding this comment.
We have OutputDataset in config.py. I think we can reuse it here instead of creating new classes?
There was a problem hiding this comment.
imo, The output target in spec would be better suited for the spec as it does validations for UC and file paths volumes. This config.OutputDataset is generic, and getting specific errors would mean essientially copying this over to the new spec folder or other way round. Is there something I am missing here ?
There was a problem hiding this comment.
OutputDataset could extend BaseModel and perform the same validations.
* added use of ABC to mark TextGenerator as abstract * Lint text generators module * Add persistence methods * Add tests and docs; Update PR template * Update hatch installation for push action * Refactor * Update method names and signatures --------- Co-authored-by: ronanstokes-db <[email protected]> Co-authored-by: Ronan Stokes <[email protected]>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
dbldatagen/spec/generator_spec.py
Outdated
| known_options = ["random", "randomSeed", "randomSeedMethod", "verbose", "debug", "seedColumnName"] | ||
| for key in self.generator_options: | ||
| if key not in known_options: | ||
| result.add_warning(f"Unknown generator option: '{key}'. " "This may be ignored during generation.") |
There was a problem hiding this comment.
The warning message contains two separate string literals that should be combined into a single string for better readability.
| result.add_warning(f"Unknown generator option: '{key}'. " "This may be ignored during generation.") | |
| result.add_warning(f"Unknown generator option: '{key}'. This may be ignored during generation.") |
|
@anupkalburgi how hard it will be to extend this implementation to load spec from YAML files? I.e., with pydantic-yaml. I'm looking into PRs like #376 and thinking that it could be easier to provide these generators as a "standard" generation notebook that will receive a file name as a parameter and generate data... |
That is the idea with core idea behind this PR as far as we can get input format (yaml/json/py dict) to pydantic model, we would be able to pass that as the config to the generator that takes the pydantic object. Will be adding examples and helper methods in subsequent prs |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 20 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ghanse
left a comment
There was a problem hiding this comment.
Left a few comments. We also need to add documentation.
| :param number_of_rows: Total number of data rows to generate for this table. | ||
| Must be a positive integer | ||
| :param partitions: Number of Spark partitions to use when generating data. | ||
| If None, defaults to Spark's default parallelism setting. | ||
| More partitions can improve generation speed for large datasets | ||
| :param columns: List of ColumnDefinition objects specifying the columns to generate | ||
| in this table. At least one column must be specified |
There was a problem hiding this comment.
Would a user need to set the same property for every columnspec? For example for random seed value?
dbldatagen/spec/generator_spec.py
Outdated
| known_options = ["random", "randomSeed", "randomSeedMethod", "verbose", "debug", "seedColumnName"] | ||
| for key in self.generator_options: | ||
| if key not in known_options: | ||
| result.add_warning(f"Unknown generator option: '{key}'. " "This may be ignored during generation.") |
| if key not in known_options: | ||
| result.add_warning(f"Unknown generator option: '{key}'. " "This may be ignored during generation.") | ||
|
|
||
| def validate(self, strict: bool = True) -> ValidationResult: # type: ignore[override] |
There was a problem hiding this comment.
Can we modify this slightly to return something usable when validation passes? Maybe add an is_valid field to ValidationResult?
There was a problem hiding this comment.
ValidationResult has is_valid method
There was a problem hiding this comment.
Why should we need a method for this? Is it not a property of the validation result? Is a single validation result intended to be re-used?
| ) | ||
| display(HTML(message)) | ||
|
|
||
| df = pd.DataFrame([col.dict() for col in table_def.columns]) |
There was a problem hiding this comment.
Can we use Spark here to avoid the Pandas dependency?
There was a problem hiding this comment.
This is used for displaying the columns using display()
This is done before we invoke the generator, and access to the Spark object is not verified.
There was a problem hiding this comment.
There are ways to display DataFrames with Spark, right?
There was a problem hiding this comment.
OutputDataset could extend BaseModel and perform the same validations.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 26 out of 26 changed files in this pull request and generated 9 comments.
Comments suppressed due to low confidence (1)
tests/test_output.py:110
- This test unconditionally sleeps for the full
time_limit(40s) before stopping the streaming query, so it will always add ~40s per parametrized case even if data is available earlier. Consider polling for output/progress and breaking early (with a max timeout) to keep CI runtime bounded while still being robust.
start_time = time.time()
elapsed_time = 0
time_limit = 40.0
while elapsed_time < time_limit:
time.sleep(1)
elapsed_time = time.time() - start_time
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if not v.strip(): | ||
| raise ValueError("Identifier must be non-empty.") | ||
| if not v.isidentifier(): | ||
| logger.warning(f"'{v}' is not a basic Python identifier. Ensure validity for Unity Catalog.") | ||
| return v.strip() |
There was a problem hiding this comment.
validate_identifiers checks v.isidentifier() before stripping whitespace, which can generate spurious warnings for values like ' main '. Strip first (and ideally validate/log using the stripped value) so warnings reflect the actual persisted identifier.
| if not v.strip(): | |
| raise ValueError("Identifier must be non-empty.") | |
| if not v.isidentifier(): | |
| logger.warning(f"'{v}' is not a basic Python identifier. Ensure validity for Unity Catalog.") | |
| return v.strip() | |
| value = v.strip() | |
| if not value: | |
| raise ValueError("Identifier must be non-empty.") | |
| if not value.isidentifier(): | |
| logger.warning( | |
| f"'{value}' is not a basic Python identifier. Ensure validity for Unity Catalog." | |
| ) | |
| return value |
| if is_primary: | ||
| if "min" in options or "max" in options: | ||
| raise ValueError(f"Primary key column '{name}' cannot have min/max options.") | ||
|
|
There was a problem hiding this comment.
Primary-key validation only rejects option keys min/max, but the rest of the codebase (and dbldatagen APIs) predominantly use minValue/maxValue (see existing dataset providers). As a result, primary keys can still be configured with minValue/maxValue without error, contradicting the docstring constraint. Consider rejecting minValue/maxValue as well (and/or normalizing legacy keys).
| logger.warning("No output destination specified, skipping data write") | ||
| return |
There was a problem hiding this comment.
In _writePreparedData, the else branch logs a warning and returns when output_destination is None or an unexpected type. This exits the method on the first table, so later tables are never even built, which conflicts with the docstring (“data is generated but not persisted anywhere”) and also with the documented ValueError behavior for unknown destination types. Consider: (1) if destination is None, still build all tables and just skip write (use continue), and (2) if destination is an unrecognized type, raise ValueError instead of silently returning.
| logger.warning("No output destination specified, skipping data write") | |
| return | |
| if output_destination is None: | |
| logger.warning( | |
| f"No output destination specified for table '{table_name}', skipping data write" | |
| ) | |
| continue | |
| raise ValueError( | |
| f"Unrecognized output destination type: {type(output_destination).__name__}" | |
| ) |
| current = col.name | ||
|
|
||
| while current: | ||
| if current in visited: | ||
| # Found a cycle | ||
| cycle_path = " -> ".join([*list(visited), current]) | ||
| errors.append( | ||
| f"Table '{table_name}': Circular dependency detected in column '{col.name}': {cycle_path}" | ||
| ) | ||
| break | ||
|
|
||
| visited.add(current) |
There was a problem hiding this comment.
Circular dependency error messages build cycle_path from a set (visited), which produces non-deterministic ordering in the reported path. Using an ordered structure (e.g., list for the traversal path plus a separate set for membership checks) would produce stable, readable cycle reports.
| current = col.name | |
| while current: | |
| if current in visited: | |
| # Found a cycle | |
| cycle_path = " -> ".join([*list(visited), current]) | |
| errors.append( | |
| f"Table '{table_name}': Circular dependency detected in column '{col.name}': {cycle_path}" | |
| ) | |
| break | |
| visited.add(current) | |
| path: list[str] = [] | |
| current = col.name | |
| while current: | |
| if current in visited: | |
| # Found a cycle; build a deterministic, ordered cycle path | |
| if current in path: | |
| start_index = path.index(current) | |
| cycle_nodes = path[start_index:] + [current] | |
| else: | |
| cycle_nodes = path + [current] | |
| cycle_path = " -> ".join(cycle_nodes) | |
| errors.append( | |
| f"Table '{table_name}': Circular dependency detected in column '{col.name}': {cycle_path}" | |
| ) | |
| break | |
| visited.add(current) | |
| path.append(current) |
|
|
||
| ## Requirements | ||
|
|
||
| - Python 3.8+ |
There was a problem hiding this comment.
The README states a Python 3.8+ requirement, but the project metadata requires Python >=3.10 (pyproject.toml requires-python). Please align this requirement text with the actual supported minimum version to avoid confusing users.
| - Python 3.8+ | |
| - Python 3.10+ |
| import pandas as pd | ||
| from IPython.display import HTML, display | ||
|
|
There was a problem hiding this comment.
generator_spec.py unconditionally imports pandas and IPython.display at module import time, but display_all_tables()’s docstring claims it will fall back in non-notebook environments. With the current imports, environments without IPython/pandas will fail at import-time before any fallback can occur. Consider moving these imports inside display_all_tables() (or guarding them with try/except) so validation/serialization features can be used without notebook-only deps.
| import pandas as pd | |
| from IPython.display import HTML, display | |
| try: | |
| import pandas as pd | |
| except ImportError: # Optional dependency; used only for display utilities. | |
| pd = None # type: ignore[assignment] | |
| try: | |
| from IPython.display import HTML, display | |
| except ImportError: # Optional dependency; used only in notebook environments. | |
| HTML = None # type: ignore[assignment] | |
| display = None # type: ignore[assignment] |
| :param result: ValidationResult to collect errors/warnings | ||
| """ | ||
| column_names = [col.name for col in table_def.columns] | ||
| duplicates = [name for name in set(column_names) if column_names.count(name) > 1] |
There was a problem hiding this comment.
Duplicate column detection builds duplicates by iterating a set(column_names), which makes the resulting error message order non-deterministic across runs. Sorting the duplicates (or preserving first-seen order) will make validation output stable and easier to test/debug.
| duplicates = [name for name in set(column_names) if column_names.count(name) > 1] | |
| seen: set[str] = set() | |
| duplicates: list[str] = [] | |
| duplicate_seen: set[str] = set() | |
| for name in column_names: | |
| if name in seen and name not in duplicate_seen: | |
| duplicates.append(name) | |
| duplicate_seen.add(name) | |
| else: | |
| seen.add(name) |
| else: | ||
| kwargs = col_def.options.copy() if col_def.options is not None else {} | ||
|
|
||
| if col_type: | ||
| kwargs["colType"] = col_type | ||
| if col_def.baseColumn: | ||
| kwargs["baseColumn"] = col_def.baseColumn | ||
| if col_def.baseColumnType: | ||
| kwargs["baseColumnType"] = col_def.baseColumnType | ||
| if col_def.omit is not None: | ||
| kwargs["omit"] = col_def.omit | ||
|
|
There was a problem hiding this comment.
ColumnDefinition.nullable is defined and documented, but _columnSpecToDatagenColumnSpec never forwards it to DataGenerator.withColumn(...) (which supports a nullable kwarg). This means nullable settings in specs are silently ignored. Consider adding nullable=col_def.nullable (or omitting it when None) in the non-primary (and possibly primary) branch to match the spec model.
| max_val: int | ||
|
|
||
| @root_validator() | ||
| def validate_range(cls, values): |
There was a problem hiding this comment.
Normal methods should have 'self', rather than 'cls', as their first parameter.
| def validate_range(cls, values): | |
| def validate_range(self, values): |
|
This is now part of the new PR #381 |
|
Moved all the changes to PR 381 |
Changes
Linked issues
Resolves #..
Requirements