Fix weighted values for boolean and numeric types by anupkalburgi · Pull Request #378 · databrickslabs/dbldatagen

anupkalburgi · 2026-01-27T16:05:27Z

This commit fixes two issues with weighted values generation:

ELSE clause type mismatch (function_builder.py):
The generated CASE expression ELSE clause used a string literal
instead of the properly typed value. This caused a DATATYPE_MISMATCH
error when using boolean or numeric values with weights.
Divide-by-zero with normalized weights (column_generation_spec.py):
When weights sum to 1.0 or less, the modulo operation x % 1 always
returns 0, and the normalization divisor becomes 0.

Fix: Scale up small weights and use try_divide for safe division.
Single quote escaping (function_builder.py):
Values containing single quotes (e.g., "O'Brien") would cause SQL
syntax errors. Now properly escaped using backslash for Spark SQL.
Increase streaming test timeout (test_streaming.py):
Increased time_limit from 10s to 50s to fix intermittent test failures.

Requirements

manually tested
updated documentation
updated demos
updated tests

This commit fixes two issues with weighted values generation: 1. ELSE clause type mismatch (function_builder.py): The generated CASE expression ELSE clause used a string literal instead of the properly typed value. This caused a DATATYPE_MISMATCH error when using boolean or numeric values with weights. 2. Divide-by-zero with normalized weights (column_generation_spec.py): When weights sum to 1.0 or less, the modulo operation x % 1 always returns 0, and the normalization divisor becomes 0. Fix: Scale up small weights and use try_divide for safe division.

This commit fixes three issues with weighted values generation: 1. ELSE clause type mismatch (function_builder.py): The generated CASE expression ELSE clause used a string literal instead of the properly typed value. This caused a DATATYPE_MISMATCH error when using boolean or numeric values with weights. 2. Divide-by-zero with normalized weights (column_generation_spec.py): When weights sum to 1.0 or less, the modulo operation x % 1 always returns 0, and the normalization divisor becomes 0. Fix: Scale up small weights and use try_divide for safe division. 3. Single quote escaping (function_builder.py): Values containing single quotes (e.g., "O'Brien") would cause SQL syntax errors. Now properly escaped using backslash for Spark SQL. 4. Increase streaming test timeout (test_streaming.py): Increased time_limit from 10s to 50s to fix intermittent test failures.

codecov · 2026-01-27T16:40:22Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.12%. Comparing base (e21826a) to head (093ce30).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #378      +/-   ##
==========================================
+ Coverage   92.10%   92.12%   +0.02%     
==========================================
  Files          47       47              
  Lines        4206     4217      +11     
  Branches      763      766       +3     
==========================================
+ Hits         3874     3885      +11     
  Misses        186      186              
  Partials      146      146

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add TestWeightedNonRandom class with two tests that cover the scale_factor branches in column_generation_spec.py: - test_weighted_string_non_random_normalized_weights: covers the scale_factor = 1000000 branch when weights_sum <= 1.0 - test_weighted_string_non_random_integer_weights: covers the scale_factor = weights_sum branch when weights_sum > 1.0 These tests exercise the non-random code path (without random=True) to ensure the weighted values functionality works correctly for both normalized and integer weights.

ghanse

Looking good. Left some small comments.

dbldatagen/column_generation_spec.py

tests/test_weighted_boolean.py

- Convert test_weighted_boolean.py from unittest to pytest style - Add defensive ValueError when scale=1 with normalize=True in _getScaledIntSQLExpression to prevent silent division by zero - Replace try_divide with explicit validation and regular division Co-Authored-By: Claude (databricks-claude-opus-4-5) <noreply@anthropic.com>

ghanse

LGTM

anupkalburgi requested review from a team as code owners January 27, 2026 16:05

anupkalburgi requested review from nfx and suryasaitura-db and removed request for a team January 27, 2026 16:05

anupkalburgi requested a review from ghanse January 27, 2026 16:40

ghanse requested changes Jan 28, 2026

View reviewed changes

dbldatagen/column_generation_spec.py Outdated Show resolved Hide resolved

tests/test_weighted_boolean.py Show resolved Hide resolved

anupkalburgi and others added 2 commits January 28, 2026 09:39

cleaning up the comment

093ce30

anupkalburgi requested a review from ghanse January 28, 2026 15:58

ghanse approved these changes Jan 28, 2026

View reviewed changes

ghanse merged commit 4687b8c into master Jan 28, 2026
5 checks passed

ghanse deleted the ak/fix-bool-data branch January 28, 2026 21:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix weighted values for boolean and numeric types#378

Fix weighted values for boolean and numeric types#378
ghanse merged 5 commits intomasterfrom
ak/fix-bool-data

anupkalburgi commented Jan 27, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 27, 2026 •

edited

Loading

Uh oh!

ghanse left a comment

Uh oh!

Uh oh!

Uh oh!

ghanse left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anupkalburgi commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requirements

Uh oh!

codecov bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ghanse left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ghanse left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anupkalburgi commented Jan 27, 2026 •

edited

Loading

codecov bot commented Jan 27, 2026 •

edited

Loading