table: fix CanSkip incorrectly omitting null column ID in row encoding#66603
table: fix CanSkip incorrectly omitting null column ID in row encoding#66603Benjamin2037 wants to merge 5 commits intopingcap:masterfrom
Conversation
pingcap#61709) Remove the null-skip condition in CanSkip that violates Row Format v2 spec. When DefaultValue=nil, OriginDefaultValue=nil, and value is NULL, the column ID was omitted from the encoded row. After NOT NULL→NULL DDL changes, this caused TiFlash to misinterpret missing null column IDs as 'use default' instead of 'value is NULL', leading to data inconsistency between TiKV and TiFlash. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
|
Review Complete Findings: 3 issues |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughSimplifies CanSkip logic in table row encoding to avoid omitting NULL column IDs after NOT NULL→NULL DDL, and adds a new comprehensive test suite for CanSkip; also updates BUILD to include the test and tweaked test shard_count. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/table/tables/canskip_test.go (1)
229-252:⚠️ Potential issue | 🟠 MajorEnd-to-end regression still misses the original trigger and encoding signal.
Line 230 creates
cwithDEFAULT 0, so this flow does not reflect the nil/nil-default trigger state from the original bug. Also, Lines 239-252 only verify SQL readback, which cannot prove null-column-ID presence in row-format encoding.Please switch to a schema that can hit the nil/nil-default path and add an assertion on encoded row/null-column-ID behavior (not only
SELECTresults).Suggested minimal SQL setup adjustment
- tk.MustExec("CREATE TABLE t_canskip (id INT PRIMARY KEY, c INT NOT NULL DEFAULT 0)") + tk.MustExec("CREATE TABLE t_canskip (id INT PRIMARY KEY, c INT NOT NULL)")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/table/tables/canskip_test.go` around lines 229 - 252, The test currently creates column c with DEFAULT 0 so it never exercises the original nil/nil-default trigger; change the CREATE TABLE in this test (t_canskip / column c) to declare c INT NOT NULL with no DEFAULT (so the NOT NULL/no-default -> ALTER TO NULL path is hit), then modify the inserts to include one row that omits c (so the storage-level default nil is produced) and one that sets c explicitly; finally add an assertion that inspects the encoded storage representation (use the existing test harness helpers around testkit.MustExec/testkit.MustQuery or the project helper that decodes raw row bytes) to assert the presence of the null-column-ID in the encoded row (instead of relying solely on SQL SELECT results).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/table/tables/canskip_test.go`:
- Around line 229-252: The test currently creates column c with DEFAULT 0 so it
never exercises the original nil/nil-default trigger; change the CREATE TABLE in
this test (t_canskip / column c) to declare c INT NOT NULL with no DEFAULT (so
the NOT NULL/no-default -> ALTER TO NULL path is hit), then modify the inserts
to include one row that omits c (so the storage-level default nil is produced)
and one that sets c explicitly; finally add an assertion that inspects the
encoded storage representation (use the existing test harness helpers around
testkit.MustExec/testkit.MustQuery or the project helper that decodes raw row
bytes) to assert the presence of the null-column-ID in the encoded row (instead
of relying solely on SQL SELECT results).
The previous e2e test used CREATE TABLE (c INT NOT NULL DEFAULT 0) then ALTER TABLE MODIFY COLUMN c INT NULL. After this DDL, OriginDefaultValue remains '0' (not nil) because noReorgDataStrict returns true for same-type INT changes, so generateOriginDefaultValue is never called. This means the test never hit the actual bug trigger condition (DefaultValue=nil AND OriginDefaultValue=nil AND value=NULL). Fix: Use scenarios that actually produce nil/nil state: - CREATE TABLE (c INT NULL) without explicit DEFAULT - ALTER TABLE ADD COLUMN d INT NULL without explicit DEFAULT Also add InfoSchema assertions to verify the schema state after DDL, ensuring both DefaultValue and OriginDefaultValue are nil.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #66603 +/- ##
================================================
- Coverage 77.6883% 77.5148% -0.1735%
================================================
Files 2007 1930 -77
Lines 549650 537831 -11819
================================================
- Hits 427014 416899 -10115
+ Misses 120975 120926 -49
+ Partials 1661 6 -1655
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
/retest |
What problem does this PR solve?
Issue Number: close #61709
Problem Summary:
CanSkip()inpkg/table/tables/tables.goincorrectly omits null column IDs from the Row Format v2 encoded row data. When a column hasDefaultValue=nil,OriginDefaultValue=nil, and the value isNULL,CanSkipreturnstrue, causing the column ID to be absent from both the not-null and null column ID arrays.This violates the Row Format v2 spec which explicitly states:
Root Cause
The Bug
In
CanSkip()(pkg/table/tables/tables.go), the following condition was introduced in PR #12634 (merged 2020-01-02) as a storage optimization:This optimization assumed: "if both defaults are nil and the value is NULL, we can skip encoding this column because the decoder will produce NULL anyway." This assumption holds within TiDB but breaks across components that have different decode-fallback logic.
Trigger Condition
The bug requires a specific DDL sequence:
NOT NULL DEFAULT <value>column (e.g.,c INT NOT NULL DEFAULT 42)ALTER TABLEchanges the column toNULL(NOT NULL → NULL DDL)DefaultValueandOriginDefaultValueare reset tonilin TiDB's schemaINSERTwithNULLvalue →CanSkipreturnstrue→ null column ID not encodedorigin_default_valuefrom its own schema cache → may use the pre-DDL default value (e.g.,42)Cross-Component Behavior Difference
When a column ID is missing from the encoded row, different components handle it differently:
findColID→notFound=true→defDatumcallback → nil → NULLdecodeRowV2→ column not found → falls back toorigin_default_valuefrom schemaThis is the core of the data inconsistency: TiDB happens to produce the correct result through its
defDatumcallback returning nil, but TiFlash's decode path usesorigin_default_valuewhich may still hold the pre-DDL non-NULL default.Why It Stayed Hidden for ~5 Years
The optimization was "locally correct but globally wrong":
History
NULL_COL_IDSis not filled when encode row to bytes #61709 filedWhat changed and how does it work?
Removed the null-skip condition from
CanSkip()(3 lines deleted):Updated comments to document why this condition was removed, referencing the Row Format v2 spec and issue
NULL_COL_IDSis not filled when encode row to bytes #61709.Added regression tests in
canskip_test.go(7 test cases):TestCanSkipNullColumnAfterNotNullToNullDDL— core regression test for the NOT NULL→NULL DDL scenarioTestCanSkipNullColumnAlwaysNullable— nullable column with nil defaults should not skipTestCanSkipPKColumn— PK columns still skip (unchanged behavior)TestCanSkipVirtualGeneratedColumn— virtual generated columns still skip (unchanged behavior)TestCanSkipNonNullValueWithDefault— non-null values don't skip (unchanged behavior)TestCanSkipNullColumnWithNonNilDefault— null value with non-nil default doesn't skip (unchanged behavior)TestCanSkipEndToEndNotNullToNull— end-to-end DDL chain testAfter this fix, null column IDs are always encoded in the row data, ensuring all downstream components can correctly distinguish between "value is NULL" and "column not found (use default)".
Impact on TiDB itself: None. All three TiDB decoders (
DatumMapDecoder,ChunkDecoder,BytesDecoder) produce identical results before and after the fix. The decode path changes fromnotFound=true → defDatum callback → NULLtoisNil=true → NULL(more direct, same result).Performance impact: Negligible. Each affected column adds 1-4 bytes to the encoded row. The trigger scenario is narrow (nullable column with nil defaults and NULL value).
Backward compatibility: Safe. Old rows (encoded before fix) continue to decode correctly. New rows are more correct. Mixed reads are safe. Rollback is safe.
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
Tests
Bug Fixes