Fix weighted values for boolean and numeric types#378
Merged
Conversation
This commit fixes two issues with weighted values generation: 1. ELSE clause type mismatch (function_builder.py): The generated CASE expression ELSE clause used a string literal instead of the properly typed value. This caused a DATATYPE_MISMATCH error when using boolean or numeric values with weights. 2. Divide-by-zero with normalized weights (column_generation_spec.py): When weights sum to 1.0 or less, the modulo operation x % 1 always returns 0, and the normalization divisor becomes 0. Fix: Scale up small weights and use try_divide for safe division.
This commit fixes three issues with weighted values generation: 1. ELSE clause type mismatch (function_builder.py): The generated CASE expression ELSE clause used a string literal instead of the properly typed value. This caused a DATATYPE_MISMATCH error when using boolean or numeric values with weights. 2. Divide-by-zero with normalized weights (column_generation_spec.py): When weights sum to 1.0 or less, the modulo operation x % 1 always returns 0, and the normalization divisor becomes 0. Fix: Scale up small weights and use try_divide for safe division. 3. Single quote escaping (function_builder.py): Values containing single quotes (e.g., "O'Brien") would cause SQL syntax errors. Now properly escaped using backslash for Spark SQL. 4. Increase streaming test timeout (test_streaming.py): Increased time_limit from 10s to 50s to fix intermittent test failures.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #378 +/- ##
==========================================
+ Coverage 92.10% 92.12% +0.02%
==========================================
Files 47 47
Lines 4206 4217 +11
Branches 763 766 +3
==========================================
+ Hits 3874 3885 +11
Misses 186 186
Partials 146 146 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add TestWeightedNonRandom class with two tests that cover the scale_factor branches in column_generation_spec.py: - test_weighted_string_non_random_normalized_weights: covers the scale_factor = 1000000 branch when weights_sum <= 1.0 - test_weighted_string_non_random_integer_weights: covers the scale_factor = weights_sum branch when weights_sum > 1.0 These tests exercise the non-random code path (without random=True) to ensure the weighted values functionality works correctly for both normalized and integer weights.
ghanse
requested changes
Jan 28, 2026
Collaborator
ghanse
left a comment
There was a problem hiding this comment.
Looking good. Left some small comments.
- Convert test_weighted_boolean.py from unittest to pytest style - Add defensive ValueError when scale=1 with normalize=True in _getScaledIntSQLExpression to prevent silent division by zero - Replace try_divide with explicit validation and regular division Co-Authored-By: Claude (databricks-claude-opus-4-5) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit fixes two issues with weighted values generation:
ELSE clause type mismatch (function_builder.py):
The generated CASE expression ELSE clause used a string literal
instead of the properly typed value. This caused a DATATYPE_MISMATCH
error when using boolean or numeric values with weights.
Divide-by-zero with normalized weights (column_generation_spec.py):
When weights sum to 1.0 or less, the modulo operation x % 1 always
returns 0, and the normalization divisor becomes 0.
Fix: Scale up small weights and use try_divide for safe division.
Single quote escaping (function_builder.py):
Values containing single quotes (e.g., "O'Brien") would cause SQL
syntax errors. Now properly escaped using backslash for Spark SQL.
Increase streaming test timeout (test_streaming.py):
Increased time_limit from 10s to 50s to fix intermittent test failures.
Requirements