Skip to content

table: fix CanSkip incorrectly omitting null column ID in row encoding#66603

Open
Benjamin2037 wants to merge 5 commits intopingcap:masterfrom
Benjamin2037:fix/canskip-null-column-encoding-61709
Open

table: fix CanSkip incorrectly omitting null column ID in row encoding#66603
Benjamin2037 wants to merge 5 commits intopingcap:masterfrom
Benjamin2037:fix/canskip-null-column-encoding-61709

Conversation

@Benjamin2037
Copy link
Collaborator

@Benjamin2037 Benjamin2037 commented Feb 28, 2026

What problem does this PR solve?

Issue Number: close #61709

Problem Summary:

CanSkip() in pkg/table/tables/tables.go incorrectly omits null column IDs from the Row Format v2 encoded row data. When a column has DefaultValue=nil, OriginDefaultValue=nil, and the value is NULL, CanSkip returns true, causing the column ID to be absent from both the not-null and null column ID arrays.

This violates the Row Format v2 spec which explicitly states:

"we can not omit the null column ID because if the column ID is not found in the row, we will use the default value in the schema which may not be null"

Root Cause

The Bug

In CanSkip() (pkg/table/tables/tables.go), the following condition was introduced in PR #12634 (merged 2020-01-02) as a storage optimization:

if col.GetDefaultValue() == nil && value.IsNull() && col.GetOriginDefaultValue() == nil {
    return true  // BUG: omits null column ID from encoded row
}

This optimization assumed: "if both defaults are nil and the value is NULL, we can skip encoding this column because the decoder will produce NULL anyway." This assumption holds within TiDB but breaks across components that have different decode-fallback logic.

Trigger Condition

The bug requires a specific DDL sequence:

  1. Table has a NOT NULL DEFAULT <value> column (e.g., c INT NOT NULL DEFAULT 42)
  2. ALTER TABLE changes the column to NULL (NOT NULL → NULL DDL)
  3. After DDL, both DefaultValue and OriginDefaultValue are reset to nil in TiDB's schema
  4. INSERT with NULL value → CanSkip returns true → null column ID not encoded
  5. Downstream component decodes the row → column ID missing → falls back to origin_default_value from its own schema cache → may use the pre-DDL default value (e.g., 42)

Cross-Component Behavior Difference

When a column ID is missing from the encoded row, different components handle it differently:

Component Behavior when column ID is missing Result
TiDB findColIDnotFound=truedefDatum callback → nil → NULL ✅ Correct (by coincidence)
TiFlash decodeRowV2 → column not found → falls back to origin_default_value from schema ❌ May return old default value (e.g., 42)
cloud-storage-engine Same as TiFlash ❌ May return old default value
TiCDC Uses TiDB's rowcodec ✅ Correct

This is the core of the data inconsistency: TiDB happens to produce the correct result through its defDatum callback returning nil, but TiFlash's decode path uses origin_default_value which may still hold the pre-DDL non-NULL default.

Why It Stayed Hidden for ~5 Years

The optimization was "locally correct but globally wrong":

  • Within TiDB's own decode path, the result is always correct (NULL) regardless of whether the column ID is encoded or not
  • The bug only manifests when: (1) a specific DDL sequence occurs, AND (2) a cross-component read happens (TiFlash/cloud-storage-engine)
  • Code review in PR *: use new row-format in tidb #12634 focused on TiDB correctness, not cross-component encoding contract compliance

History

What changed and how does it work?

  1. Removed the null-skip condition from CanSkip() (3 lines deleted):

    // REMOVED:
    if col.GetDefaultValue() == nil && value.IsNull() && col.GetOriginDefaultValue() == nil {
        return true
    }
  2. Updated comments to document why this condition was removed, referencing the Row Format v2 spec and issue NULL_COL_IDS is not filled when encode row to bytes #61709.

  3. Added regression tests in canskip_test.go (7 test cases):

    • TestCanSkipNullColumnAfterNotNullToNullDDL — core regression test for the NOT NULL→NULL DDL scenario
    • TestCanSkipNullColumnAlwaysNullable — nullable column with nil defaults should not skip
    • TestCanSkipPKColumn — PK columns still skip (unchanged behavior)
    • TestCanSkipVirtualGeneratedColumn — virtual generated columns still skip (unchanged behavior)
    • TestCanSkipNonNullValueWithDefault — non-null values don't skip (unchanged behavior)
    • TestCanSkipNullColumnWithNonNilDefault — null value with non-nil default doesn't skip (unchanged behavior)
    • TestCanSkipEndToEndNotNullToNull — end-to-end DDL chain test

After this fix, null column IDs are always encoded in the row data, ensuring all downstream components can correctly distinguish between "value is NULL" and "column not found (use default)".

Impact on TiDB itself: None. All three TiDB decoders (DatumMapDecoder, ChunkDecoder, BytesDecoder) produce identical results before and after the fix. The decode path changes from notFound=true → defDatum callback → NULL to isNil=true → NULL (more direct, same result).

Performance impact: Negligible. Each affected column adds 1-4 bytes to the encoded row. The trigger scenario is narrow (nullable column with nil defaults and NULL value).

Backward compatibility: Safe. Old rows (encoded before fix) continue to decode correctly. New rows are more correct. Mixed reads are safe. Rollback is safe.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Fix a bug that after changing a column from `NOT NULL` to `NULL` via `ALTER TABLE`, inserting `NULL` values could cause data inconsistency between TiKV and TiFlash. The `CanSkip` function incorrectly omitted null column IDs from the Row Format v2 encoded row, violating the row format specification. Downstream components (TiFlash) could misinterpret the missing column ID as "use default value" instead of "value is NULL". Note: this fix only applies to newly written rows. Rows written before the upgrade may still contain the incorrect encoding (missing null column ID). To fix historical data, users should UPDATE the affected rows or recreate the affected tables after upgrading.

Summary by CodeRabbit

  • Tests

    • Added comprehensive unit and end-to-end tests validating encoding and query behavior for NULL vs non-NULL values, primary-key and virtual-generated columns, and NOT NULL→NULL schema changes.
  • Bug Fixes

    • Adjusted skip logic for column encoding so NULL values (including columns altered from NOT NULL to NULL) are encoded correctly, preventing incorrect omission and ensuring accurate query results.

pingcap#61709)

Remove the null-skip condition in CanSkip that violates Row Format v2 spec. When DefaultValue=nil, OriginDefaultValue=nil, and value is NULL, the column ID was omitted from the encoded row. After NOT NULL→NULL DDL changes, this caused TiFlash to misinterpret missing null column IDs as 'use default' instead of 'value is NULL', leading to data inconsistency between TiKV and TiFlash.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Feb 28, 2026
@pantheon-ai
Copy link

pantheon-ai bot commented Feb 28, 2026

Review Complete

Findings: 3 issues
Posted: 2
Duplicates/Skipped: 1

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 28, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Feb 28, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gmhdbjd for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Feb 28, 2026

📝 Walkthrough

Walkthrough

Simplifies CanSkip logic in table row encoding to avoid omitting NULL column IDs after NOT NULL→NULL DDL, and adds a new comprehensive test suite for CanSkip; also updates BUILD to include the test and tweaked test shard_count.

Changes

Cohort / File(s) Summary
CanSkip Test Coverage
pkg/table/tables/canskip_test.go
Adds comprehensive unit and integration tests for CanSkip covering NOT NULL→NULL DDL, always-nullable with nil defaults, PK handle skipping, virtual generated columns, non-null values with defaults, null with non-nil defaults, and end-to-end encoding/decoding validation.
Row Encoding Skip Logic
pkg/table/tables/tables.go
Removes prior optimization that skipped encoding NULL columns when both DefaultValue and OriginDefaultValue were nil; simplifies skip criteria to only primary-key handle, common-handle prefix, and virtual generated columns; adds explanatory comments about Row Format v2 and NOT NULL→NULL effects.
Test Target Build
pkg/table/tables/BUILD.bazel
Registers the new test file in the go_test srcs and increases the go_test shard_count from 43 to 50.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐇 I hop through rows with whiskers bright,
No hidden NULLs in moonlit byte,
Tests patter soft to check each case,
Schemas settle back in place,
A joyful thump — encoding right.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed The PR implementation aligns with both linked issues: removes null-skip condition from CanSkip() per #61709 to ensure NULL_COL_IDS are populated, and includes regression tests validating the NOT NULL→NULL DDL fix addressed by #10682.
Out of Scope Changes check ✅ Passed All changes are in scope: CanSkip() fix in tables.go, comprehensive regression tests in canskip_test.go, and BUILD.bazel updates for test integration align with the stated objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title accurately describes the main change: fixing a bug in CanSkip that incorrectly omits null column IDs in row encoding.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 28, 2026
@Benjamin2037 Benjamin2037 added the component/ddl This issue is related to DDL of TiDB. label Feb 28, 2026
@Benjamin2037 Benjamin2037 changed the title table: fix CanSkip incorrectly omitting null column ID in row encoding table: fix CanSkip incorrectly omitting null column ID in row encoding(WiP) Feb 28, 2026
@Benjamin2037 Benjamin2037 changed the title table: fix CanSkip incorrectly omitting null column ID in row encoding(WiP) table: fix CanSkip incorrectly omitting null column ID in row encoding Feb 28, 2026
@Benjamin2037 Benjamin2037 changed the title table: fix CanSkip incorrectly omitting null column ID in row encoding table: fix CanSkip incorrectly omitting null column ID in row encoding(WIP) Feb 28, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/table/tables/canskip_test.go (1)

229-252: ⚠️ Potential issue | 🟠 Major

End-to-end regression still misses the original trigger and encoding signal.

Line 230 creates c with DEFAULT 0, so this flow does not reflect the nil/nil-default trigger state from the original bug. Also, Lines 239-252 only verify SQL readback, which cannot prove null-column-ID presence in row-format encoding.

Please switch to a schema that can hit the nil/nil-default path and add an assertion on encoded row/null-column-ID behavior (not only SELECT results).

Suggested minimal SQL setup adjustment
- tk.MustExec("CREATE TABLE t_canskip (id INT PRIMARY KEY, c INT NOT NULL DEFAULT 0)")
+ tk.MustExec("CREATE TABLE t_canskip (id INT PRIMARY KEY, c INT NOT NULL)")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/table/tables/canskip_test.go` around lines 229 - 252, The test currently
creates column c with DEFAULT 0 so it never exercises the original
nil/nil-default trigger; change the CREATE TABLE in this test (t_canskip /
column c) to declare c INT NOT NULL with no DEFAULT (so the NOT NULL/no-default
-> ALTER TO NULL path is hit), then modify the inserts to include one row that
omits c (so the storage-level default nil is produced) and one that sets c
explicitly; finally add an assertion that inspects the encoded storage
representation (use the existing test harness helpers around
testkit.MustExec/testkit.MustQuery or the project helper that decodes raw row
bytes) to assert the presence of the null-column-ID in the encoded row (instead
of relying solely on SQL SELECT results).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/table/tables/canskip_test.go`:
- Around line 229-252: The test currently creates column c with DEFAULT 0 so it
never exercises the original nil/nil-default trigger; change the CREATE TABLE in
this test (t_canskip / column c) to declare c INT NOT NULL with no DEFAULT (so
the NOT NULL/no-default -> ALTER TO NULL path is hit), then modify the inserts
to include one row that omits c (so the storage-level default nil is produced)
and one that sets c explicitly; finally add an assertion that inspects the
encoded storage representation (use the existing test harness helpers around
testkit.MustExec/testkit.MustQuery or the project helper that decodes raw row
bytes) to assert the presence of the null-column-ID in the encoded row (instead
of relying solely on SQL SELECT results).

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a4d4c2 and 6764dfe.

📒 Files selected for processing (2)
  • pkg/table/tables/BUILD.bazel
  • pkg/table/tables/canskip_test.go

The previous e2e test used CREATE TABLE (c INT NOT NULL DEFAULT 0) then
ALTER TABLE MODIFY COLUMN c INT NULL. After this DDL, OriginDefaultValue
remains '0' (not nil) because noReorgDataStrict returns true for same-type
INT changes, so generateOriginDefaultValue is never called.

This means the test never hit the actual bug trigger condition
(DefaultValue=nil AND OriginDefaultValue=nil AND value=NULL).

Fix: Use scenarios that actually produce nil/nil state:
- CREATE TABLE (c INT NULL) without explicit DEFAULT
- ALTER TABLE ADD COLUMN d INT NULL without explicit DEFAULT

Also add InfoSchema assertions to verify the schema state after DDL,
ensuring both DefaultValue and OriginDefaultValue are nil.
@codecov
Copy link

codecov bot commented Mar 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.5148%. Comparing base (fba827a) to head (78c9343).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #66603        +/-   ##
================================================
- Coverage   77.6883%   77.5148%   -0.1735%     
================================================
  Files          2007       1930        -77     
  Lines        549650     537831     -11819     
================================================
- Hits         427014     416899     -10115     
+ Misses       120975     120926        -49     
+ Partials       1661          6      -1655     
Flag Coverage Δ
integration 41.3762% <ø> (-6.8079%) ⬇️
unit 76.6459% <ø> (+0.2982%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 56.7974% <ø> (ø)
parser ∅ <ø> (∅)
br 48.8096% <ø> (-12.0806%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Benjamin2037
Copy link
Collaborator Author

/retest

@Benjamin2037 Benjamin2037 changed the title table: fix CanSkip incorrectly omitting null column ID in row encoding(WIP) table: fix CanSkip incorrectly omitting null column ID in row encoding Mar 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/ddl This issue is related to DDL of TiDB. do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. severity/major size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NULL_COL_IDS is not filled when encode row to bytes

1 participant