Skip to content

Fix weighted values for boolean and numeric types#378

Merged
ghanse merged 5 commits intomasterfrom
ak/fix-bool-data
Jan 28, 2026
Merged

Fix weighted values for boolean and numeric types#378
ghanse merged 5 commits intomasterfrom
ak/fix-bool-data

Conversation

@anupkalburgi
Copy link
Collaborator

@anupkalburgi anupkalburgi commented Jan 27, 2026

This commit fixes two issues with weighted values generation:

  1. ELSE clause type mismatch (function_builder.py):
    The generated CASE expression ELSE clause used a string literal
    instead of the properly typed value. This caused a DATATYPE_MISMATCH
    error when using boolean or numeric values with weights.

  2. Divide-by-zero with normalized weights (column_generation_spec.py):
    When weights sum to 1.0 or less, the modulo operation x % 1 always
    returns 0, and the normalization divisor becomes 0.

    Fix: Scale up small weights and use try_divide for safe division.

  3. Single quote escaping (function_builder.py):
    Values containing single quotes (e.g., "O'Brien") would cause SQL
    syntax errors. Now properly escaped using backslash for Spark SQL.

  4. Increase streaming test timeout (test_streaming.py):
    Increased time_limit from 10s to 50s to fix intermittent test failures.

Requirements

  • manually tested
  • updated documentation
  • updated demos
  • updated tests

This commit fixes two issues with weighted values generation:

1. ELSE clause type mismatch (function_builder.py):
   The generated CASE expression ELSE clause used a string literal
   instead of the properly typed value. This caused a DATATYPE_MISMATCH
   error when using boolean or numeric values with weights.

2. Divide-by-zero with normalized weights (column_generation_spec.py):
   When weights sum to 1.0 or less, the modulo operation x % 1 always
   returns 0, and the normalization divisor becomes 0.

Fix: Scale up small weights and use try_divide for safe division.
@anupkalburgi anupkalburgi requested review from a team as code owners January 27, 2026 16:05
@anupkalburgi anupkalburgi requested review from nfx and suryasaitura-db and removed request for a team January 27, 2026 16:05
This commit fixes three issues with weighted values generation:

1. ELSE clause type mismatch (function_builder.py):
   The generated CASE expression ELSE clause used a string literal
   instead of the properly typed value. This caused a DATATYPE_MISMATCH
   error when using boolean or numeric values with weights.

2. Divide-by-zero with normalized weights (column_generation_spec.py):
   When weights sum to 1.0 or less, the modulo operation x % 1 always
   returns 0, and the normalization divisor becomes 0.

   Fix: Scale up small weights and use try_divide for safe division.

3. Single quote escaping (function_builder.py):
   Values containing single quotes (e.g., "O'Brien") would cause SQL
   syntax errors. Now properly escaped using backslash for Spark SQL.

4. Increase streaming test timeout (test_streaming.py):
   Increased time_limit from 10s to 50s to fix intermittent test failures.
@codecov
Copy link

codecov bot commented Jan 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.12%. Comparing base (e21826a) to head (093ce30).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #378      +/-   ##
==========================================
+ Coverage   92.10%   92.12%   +0.02%     
==========================================
  Files          47       47              
  Lines        4206     4217      +11     
  Branches      763      766       +3     
==========================================
+ Hits         3874     3885      +11     
  Misses        186      186              
  Partials      146      146              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@anupkalburgi anupkalburgi requested a review from ghanse January 27, 2026 16:40
Add TestWeightedNonRandom class with two tests that cover the
scale_factor branches in column_generation_spec.py:
- test_weighted_string_non_random_normalized_weights: covers the
  scale_factor = 1000000 branch when weights_sum <= 1.0
- test_weighted_string_non_random_integer_weights: covers the
  scale_factor = weights_sum branch when weights_sum > 1.0

These tests exercise the non-random code path (without random=True)
to ensure the weighted values functionality works correctly for both
normalized and integer weights.
Copy link
Collaborator

@ghanse ghanse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Left some small comments.

anupkalburgi and others added 2 commits January 28, 2026 09:39
- Convert test_weighted_boolean.py from unittest to pytest style
- Add defensive ValueError when scale=1 with normalize=True in
  _getScaledIntSQLExpression to prevent silent division by zero
- Replace try_divide with explicit validation and regular division

Co-Authored-By: Claude (databricks-claude-opus-4-5) <noreply@anthropic.com>
@anupkalburgi anupkalburgi requested a review from ghanse January 28, 2026 15:58
Copy link
Collaborator

@ghanse ghanse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ghanse ghanse merged commit 4687b8c into master Jan 28, 2026
5 checks passed
@ghanse ghanse deleted the ak/fix-bool-data branch January 28, 2026 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants