table: fix CanSkip incorrectly omitting null column ID in row encoding by Benjamin2037 · Pull Request #66603 · pingcap/tidb

Benjamin2037 · 2026-02-28T17:15:58Z

What problem does this PR solve?

Issue Number: close #61709

Problem Summary:

CanSkip() in pkg/table/tables/tables.go incorrectly omits null column IDs from the Row Format v2 encoded row data. When a column has DefaultValue=nil, OriginDefaultValue=nil, and the value is NULL, CanSkip returns true, causing the column ID to be absent from both the not-null and null column ID arrays.

This violates the Row Format v2 spec which explicitly states:

"we can not omit the null column ID because if the column ID is not found in the row, we will use the default value in the schema which may not be null"

Root Cause

The Bug

In CanSkip() (pkg/table/tables/tables.go), the following condition was introduced in PR #12634 (merged 2020-01-02) as a storage optimization:

if col.GetDefaultValue() == nil && value.IsNull() && col.GetOriginDefaultValue() == nil {
    return true  // BUG: omits null column ID from encoded row
}

This optimization assumed: "if both defaults are nil and the value is NULL, we can skip encoding this column because the decoder will produce NULL anyway." This assumption holds within TiDB but breaks across components that have different decode-fallback logic.

Trigger Condition

The bug requires a specific DDL sequence:

Table has a NOT NULL DEFAULT <value> column (e.g., c INT NOT NULL DEFAULT 42)
ALTER TABLE changes the column to NULL (NOT NULL → NULL DDL)
After DDL, both DefaultValue and OriginDefaultValue are reset to nil in TiDB's schema
INSERT with NULL value → CanSkip returns true → null column ID not encoded
Downstream component decodes the row → column ID missing → falls back to origin_default_value from its own schema cache → may use the pre-DDL default value (e.g., 42)

Cross-Component Behavior Difference

When a column ID is missing from the encoded row, different components handle it differently:

Component	Behavior when column ID is missing	Result
TiDB	`findColID` → `notFound=true` → `defDatum` callback → nil → NULL	✅ Correct (by coincidence)
TiFlash	`decodeRowV2` → column not found → falls back to `origin_default_value` from schema	❌ May return old default value (e.g., 42)
cloud-storage-engine	Same as TiFlash	❌ May return old default value
TiCDC	Uses TiDB's rowcodec	✅ Correct

This is the core of the data inconsistency: TiDB happens to produce the correct result through its defDatum callback returning nil, but TiFlash's decode path uses origin_default_value which may still hold the pre-DDL non-NULL default.

Why It Stayed Hidden for ~5 Years

The optimization was "locally correct but globally wrong":

Within TiDB's own decode path, the result is always correct (NULL) regardless of whether the column ID is encoded or not
The bug only manifests when: (1) a specific DDL sequence occurs, AND (2) a cross-component read happens (TiFlash/cloud-storage-engine)
Code review in PR *: use new row-format in tidb #12634 focused on TiDB correctness, not cross-component encoding contract compliance

History

2018-07-19: Row Format v2 design doc , explicitly warning against omitting null column IDs
2020-01-02: PR *: use new row-format in tidb #12634 merged, introducing the null-skip optimization
~5 years latent: Bug required specific DDL sequence + cross-component reads to manifest
2025-06-12: Issue NULL_COL_IDS is not filled when encode row to bytes #61709 filed
TiFlash defensive fix: ddl: Fix default value filling with finer granularity tiflash#10682

What changed and how does it work?

Removed the null-skip condition from CanSkip() (3 lines deleted):

// REMOVED:
if col.GetDefaultValue() == nil && value.IsNull() && col.GetOriginDefaultValue() == nil {
    return true
}

Updated comments to document why this condition was removed, referencing the Row Format v2 spec and issue NULL_COL_IDS is not filled when encode row to bytes #61709.
Added regression tests in canskip_test.go (7 test cases):
- TestCanSkipNullColumnAfterNotNullToNullDDL — core regression test for the NOT NULL→NULL DDL scenario
- TestCanSkipNullColumnAlwaysNullable — nullable column with nil defaults should not skip
- TestCanSkipPKColumn — PK columns still skip (unchanged behavior)
- TestCanSkipVirtualGeneratedColumn — virtual generated columns still skip (unchanged behavior)
- TestCanSkipNonNullValueWithDefault — non-null values don't skip (unchanged behavior)
- TestCanSkipNullColumnWithNonNilDefault — null value with non-nil default doesn't skip (unchanged behavior)
- TestCanSkipEndToEndNotNullToNull — end-to-end DDL chain test

After this fix, null column IDs are always encoded in the row data, ensuring all downstream components can correctly distinguish between "value is NULL" and "column not found (use default)".

Impact on TiDB itself: None. All three TiDB decoders (DatumMapDecoder, ChunkDecoder, BytesDecoder) produce identical results before and after the fix. The decode path changes from notFound=true → defDatum callback → NULL to isNil=true → NULL (more direct, same result).

Performance impact: Negligible. Each affected column adds 1-4 bytes to the encoded row. The trigger scenario is narrow (nullable column with nil defaults and NULL value).

Backward compatibility: Safe. Old rows (encoded before fix) continue to decode correctly. New rows are more correct. Mixed reads are safe. Rollback is safe.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Fix a bug that after changing a column from `NOT NULL` to `NULL` via `ALTER TABLE`, inserting `NULL` values could cause data inconsistency between TiKV and TiFlash. The `CanSkip` function incorrectly omitted null column IDs from the Row Format v2 encoded row, violating the row format specification. Downstream components (TiFlash) could misinterpret the missing column ID as "use default value" instead of "value is NULL". Note: this fix only applies to newly written rows. Rows written before the upgrade may still contain the incorrect encoding (missing null column ID). To fix historical data, users should UPDATE the affected rows or recreate the affected tables after upgrading.

Summary by CodeRabbit

Tests
- Added comprehensive unit and end-to-end tests validating encoding and query behavior for NULL vs non-NULL values, primary-key and virtual-generated columns, and NOT NULL→NULL schema changes.
Bug Fixes
- Adjusted skip logic for column encoding so NULL values (including columns altered from NOT NULL to NULL) are encoded correctly, preventing incorrect omission and ensuring accurate query results.

pingcap#61709) Remove the null-skip condition in CanSkip that violates Row Format v2 spec. When DefaultValue=nil, OriginDefaultValue=nil, and value is NULL, the column ID was omitted from the encoded row. After NOT NULL→NULL DDL changes, this caused TiFlash to misinterpret missing null column IDs as 'use default' instead of 'value is NULL', leading to data inconsistency between TiKV and TiFlash. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

pantheon-ai · 2026-02-28T17:16:01Z

Review Complete

Findings: 3 issues
Posted: 2
Duplicates/Skipped: 1

ti-chi-bot · 2026-02-28T17:16:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gmhdbjd for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/table/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-02-28T17:16:35Z

📝 Walkthrough

Walkthrough

Simplifies CanSkip logic in table row encoding to avoid omitting NULL column IDs after NOT NULL→NULL DDL, and adds a new comprehensive test suite for CanSkip; also updates BUILD to include the test and tweaked test shard_count.

Changes

Cohort / File(s)	Summary
CanSkip Test Coverage `pkg/table/tables/canskip_test.go`	Adds comprehensive unit and integration tests for CanSkip covering NOT NULL→NULL DDL, always-nullable with nil defaults, PK handle skipping, virtual generated columns, non-null values with defaults, null with non-nil defaults, and end-to-end encoding/decoding validation.
Row Encoding Skip Logic `pkg/table/tables/tables.go`	Removes prior optimization that skipped encoding NULL columns when both DefaultValue and OriginDefaultValue were nil; simplifies skip criteria to only primary-key handle, common-handle prefix, and virtual generated columns; adds explanatory comments about Row Format v2 and NOT NULL→NULL effects.
Test Target Build `pkg/table/tables/BUILD.bazel`	Registers the new test file in the go_test srcs and increases the go_test shard_count from 43 to 50.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐇 I hop through rows with whiskers bright,
No hidden NULLs in moonlit byte,
Tests patter soft to check each case,
Schemas settle back in place,
A joyful thump — encoding right.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	The PR implementation aligns with both linked issues: removes null-skip condition from CanSkip() per `#61709` to ensure NULL_COL_IDS are populated, and includes regression tests validating the NOT NULL→NULL DDL fix addressed by `#10682`.
Out of Scope Changes check	✅ Passed	All changes are in scope: CanSkip() fix in tables.go, comprehensive regression tests in canskip_test.go, and BUILD.bazel updates for test integration align with the stated objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title accurately describes the main change: fixing a bug in CanSkip that incorrectly omits null column IDs in row encoding.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

pkg/table/tables/canskip_test.go

coderabbitai

♻️ Duplicate comments (1)

pkg/table/tables/canskip_test.go (1)
229-252: ⚠️ Potential issue | 🟠 Major

End-to-end regression still misses the original trigger and encoding signal.

Line 230 creates c with DEFAULT 0, so this flow does not reflect the nil/nil-default trigger state from the original bug. Also, Lines 239-252 only verify SQL readback, which cannot prove null-column-ID presence in row-format encoding.

Please switch to a schema that can hit the nil/nil-default path and add an assertion on encoded row/null-column-ID behavior (not only SELECT results).
Suggested minimal SQL setup adjustment
- tk.MustExec("CREATE TABLE t_canskip (id INT PRIMARY KEY, c INT NOT NULL DEFAULT 0)")
+ tk.MustExec("CREATE TABLE t_canskip (id INT PRIMARY KEY, c INT NOT NULL)")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/table/tables/canskip_test.go` around lines 229 - 252, The test currently
creates column c with DEFAULT 0 so it never exercises the original
nil/nil-default trigger; change the CREATE TABLE in this test (t_canskip /
column c) to declare c INT NOT NULL with no DEFAULT (so the NOT NULL/no-default
-> ALTER TO NULL path is hit), then modify the inserts to include one row that
omits c (so the storage-level default nil is produced) and one that sets c
explicitly; finally add an assertion that inspects the encoded storage
representation (use the existing test harness helpers around
testkit.MustExec/testkit.MustQuery or the project helper that decodes raw row
bytes) to assert the presence of the null-column-ID in the encoded row (instead
of relying solely on SQL SELECT results).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/table/tables/canskip_test.go`:
- Around line 229-252: The test currently creates column c with DEFAULT 0 so it
never exercises the original nil/nil-default trigger; change the CREATE TABLE in
this test (t_canskip / column c) to declare c INT NOT NULL with no DEFAULT (so
the NOT NULL/no-default -> ALTER TO NULL path is hit), then modify the inserts
to include one row that omits c (so the storage-level default nil is produced)
and one that sets c explicitly; finally add an assertion that inspects the
encoded storage representation (use the existing test harness helpers around
testkit.MustExec/testkit.MustQuery or the project helper that decodes raw row
bytes) to assert the presence of the null-column-ID in the encoded row (instead
of relying solely on SQL SELECT results).

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a4d4c2 and 6764dfe.

📒 Files selected for processing (2)

pkg/table/tables/BUILD.bazel
pkg/table/tables/canskip_test.go

The previous e2e test used CREATE TABLE (c INT NOT NULL DEFAULT 0) then ALTER TABLE MODIFY COLUMN c INT NULL. After this DDL, OriginDefaultValue remains '0' (not nil) because noReorgDataStrict returns true for same-type INT changes, so generateOriginDefaultValue is never called. This means the test never hit the actual bug trigger condition (DefaultValue=nil AND OriginDefaultValue=nil AND value=NULL). Fix: Use scenarios that actually produce nil/nil state: - CREATE TABLE (c INT NULL) without explicit DEFAULT - ALTER TABLE ADD COLUMN d INT NULL without explicit DEFAULT Also add InfoSchema assertions to verify the schema state after DDL, ensuring both DefaultValue and OriginDefaultValue are nil.

codecov · 2026-03-01T00:33:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.5148%. Comparing base (fba827a) to head (78c9343).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #66603        +/-   ##
================================================
- Coverage   77.6883%   77.5148%   -0.1735%     
================================================
  Files          2007       1930        -77     
  Lines        549650     537831     -11819     
================================================
- Hits         427014     416899     -10115     
+ Misses       120975     120926        -49     
+ Partials       1661          6      -1655

Flag	Coverage Δ
integration	`41.3762% <ø> (-6.8079%)`	⬇️
unit	`76.6459% <ø> (+0.2982%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`56.7974% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`48.8096% <ø> (-12.0806%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Benjamin2037 · 2026-03-01T00:37:17Z

/retest

ti-chi-bot bot added do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Feb 28, 2026

ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 28, 2026

table: fix gofmt/gci comment formatting in CanSkip

4a4d4c2

ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 28, 2026

Benjamin2037 added the component/ddl This issue is related to DDL of TiDB. label Feb 28, 2026

pantheon-ai bot reviewed Feb 28, 2026

View reviewed changes

pkg/table/tables/canskip_test.go Outdated Show resolved Hide resolved

pkg/table/tables/canskip_test.go Outdated Show resolved Hide resolved

table: fix invalid import in canskip_test.go (parser/model → parser/ast)

b21d013

Benjamin2037 changed the title ~~table: fix CanSkip incorrectly omitting null column ID in row encoding~~ table: fix CanSkip incorrectly omitting null column ID in row encoding（WiP） Feb 28, 2026

Benjamin2037 changed the title ~~table: fix CanSkip incorrectly omitting null column ID in row encoding（WiP）~~ table: fix CanSkip incorrectly omitting null column ID in row encoding Feb 28, 2026

table: update BUILD.bazel for canskip_test.go and shard_count

6764dfe

Benjamin2037 changed the title ~~table: fix CanSkip incorrectly omitting null column ID in row encoding~~ table: fix CanSkip incorrectly omitting null column ID in row encoding（WIP） Feb 28, 2026

coderabbitai bot reviewed Feb 28, 2026

View reviewed changes

Benjamin2037 added the severity/major label Mar 1, 2026

Benjamin2037 changed the title ~~table: fix CanSkip incorrectly omitting null column ID in row encoding（WIP）~~ table: fix CanSkip incorrectly omitting null column ID in row encoding Mar 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

table: fix CanSkip incorrectly omitting null column ID in row encoding#66603

table: fix CanSkip incorrectly omitting null column ID in row encoding#66603
Benjamin2037 wants to merge 5 commits intopingcap:masterfrom
Benjamin2037:fix/canskip-null-column-encoding-61709

Benjamin2037 commented Feb 28, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

pantheon-ai bot commented Feb 28, 2026 •

edited

Loading

Uh oh!

ti-chi-bot bot commented Feb 28, 2026

Uh oh!

coderabbitai bot commented Feb 28, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

codecov bot commented Mar 1, 2026 •

edited

Loading

Uh oh!

Benjamin2037 commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Benjamin2037 commented Feb 28, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Root Cause

The Bug

Trigger Condition

Cross-Component Behavior Difference

Why It Stayed Hidden for ~5 Years

History

What changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

pantheon-ai bot commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ti-chi-bot bot commented Feb 28, 2026

Uh oh!

coderabbitai bot commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Benjamin2037 commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Benjamin2037 commented Feb 28, 2026 •

edited by coderabbitai bot

Loading

pantheon-ai bot commented Feb 28, 2026 •

edited

Loading

coderabbitai bot commented Feb 28, 2026 •

edited

Loading

codecov bot commented Mar 1, 2026 •

edited

Loading