Skip to content

roachtest: randomize volume type and XFS#159010

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
golgeek:ludo/rt-random-xfs
Dec 22, 2025
Merged

roachtest: randomize volume type and XFS#159010
craig[bot] merged 1 commit intocockroachdb:masterfrom
golgeek:ludo/rt-random-xfs

Conversation

@golgeek
Copy link
Contributor

@golgeek golgeek commented Dec 8, 2025

Prior to this patch, tests had three possiblities for volume types:

  1. Specify PreferLocalSSD() DisableLocalSSD() options.
  2. Force a specific volume type via VolumeType() (volume type being provider specific, this was only possible for tests that target a single provider).
  3. Rely on the default behavior for PreferLocalSSD() and use the default volume type in roachprod for the provider in case local SSD was not selected.

The way PreferLocalSSD() was implemented meant that most of the tests were actually running on local SSDs (when available on the machine type), leading to a gap in the testing strategy with regards to volume types.

This patch brings the new ClusterSpec option RandomizeVolumeType(). In case volume type is not forced and neither PreferLocalSSD() or DisableLocalSSD() are specified, specifying RandomizeVolumeType() will take precendence over the --local-ssd argument and will randomly select a volume type from those available in the targeted provider:

  • AWS: gp3, io2, local SSD
  • GCE: pd-ssd, local SSD
  • Azure: premium-ssd, premium-ssd-v2, ultra-disk, local SSD
  • IBM: 10iops-tier

Note: volume type randomization gives the same weight to all options.

This patch also introduces a random chance of provisioning XFS as a filesystem via the option RandomlyUseXfs().
The option is built in the same way as RandomlyUseZfs(), granting a 20% chance of XFS if present.
If both RandomlyUseZfs() and RandomlyUseXfs() are used, they both get a 20% chance, leaving 60% chance for the default ext4.

All/most of the KV tests, and a few admission-control are switched to volume type/filesystem randomization as requested in this issue.

Epic: none
Fixes: 146661
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@golgeek golgeek force-pushed the ludo/rt-random-xfs branch 2 times, most recently from a9e44dc to 32683f0 Compare December 8, 2025 23:19
@golgeek
Copy link
Contributor Author

golgeek commented Dec 9, 2025

TC runs (focus on kv/admission-control):

  • Azure
  • AWS
  • GCE
  • and an extra general run on Azure (cloud with the most variability in the randomization).

@golgeek golgeek added the backport-25.2.x Flags PRs that need to be backported to 25.2 label Dec 9, 2025
@golgeek golgeek force-pushed the ludo/rt-random-xfs branch 3 times, most recently from 68aa2c2 to f9ad6ed Compare December 10, 2025 21:15
@golgeek golgeek added the backport-26.1.x Flags PRs that need to be backported to 26.1 label Dec 11, 2025
@golgeek
Copy link
Contributor Author

golgeek commented Dec 11, 2025

Backport to 25.2 requires this one to be approved first, and chances the automated backport will fail.

@golgeek golgeek marked this pull request as ready for review December 11, 2025 20:11
@golgeek golgeek requested a review from a team as a code owner December 11, 2025 20:11
@golgeek golgeek requested review from herkolategan, williamchoe3 and xxmplus and removed request for a team December 11, 2025 20:11
@github-actions
Copy link

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@github-actions github-actions bot added the o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. label Dec 11, 2025
@github-actions
Copy link

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

Prior to this patch, tests had three possiblities for volume types:
1. Specify `PreferLocalSSD()` `DisableLocalSSD()` options.
2. Force a specific volume type via `VolumeType()` (volume type being
   provider specific, this was only possible for tests that target
   a single provider).
3. Rely on the default behavior for `PreferLocalSSD()` and use the
   default volume type in roachprod for the provider in case local SSD
   was not selected.

The way `PreferLocalSSD()` was implemented meant that most of the tests
were actually running on local SSDs (when available on the machine
type), leading to a gap in the testing strategy with regards to volume
types.

This patch brings the new ClusterSpec option `RandomizeVolumeType()`.
In case volume type is not forced and neither `PreferLocalSSD()` or
`DisableLocalSSD()` are specified, specifying `RandomizeVolumeType()`
will take precendence over the `--local-ssd` argument and will randomly
select a volume type from those available in the targeted provider:
- AWS: gp3, io2, local SSD
- GCE: pd-ssd, local SSD
- Azure: premium-ssd, premium-ssd-v2, ultra-disk, local SSD
- IBM: 10iops-tier

Note: volume type randomization gives the same weight to all options.

This patch also introduces a random chance of provisioning XFS as a
filesystem via the option `RandomlyUseXfs()`.
The option is built in the same way as `RandomlyUseZfs()`, granting a
20% chance of XFS if present.
If both `RandomlyUseZfs()` and `RandomlyUseXfs()` are used, they both
get a 20% chance, leaving 60% chance for the default ext4.

Epic: none
Fixes: 146661
Release note: None
Copy link
Contributor

@nameisbhaskar nameisbhaskar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good! Added a comment to add log lines to the randomise logic code for easy debugging.

if s.RandomlyUseZfs || s.RandomlyUseXfs {
rng, _ := randutil.NewPseudoRand()
if rng.Float64() <= 0.2 {
randFloat := rng.Float64()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may help if we add some log lines in this logic. Given the number of conditions, it may be difficult to debug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a logger at hand in this code path so it's not really easy to log, especially without messing up the -retry_[1-3] logger used at the caller level.

The drawing logic is fairly straightforward, and the outcome is already exposed via ExposedMetamorphicInfo which will be propagated in the GH issue.

I'd be inclined to just merge this without complicating things and reconsider later if needed.

@golgeek
Copy link
Contributor Author

golgeek commented Dec 22, 2025

TFTRs!

bors r=nameisbhaskar,williamchoe3,herkolategan

@craig
Copy link
Contributor

craig bot commented Dec 22, 2025

@craig craig bot merged commit 86f34c6 into cockroachdb:master Dec 22, 2025
24 checks passed
@blathers-crl
Copy link

blathers-crl bot commented Dec 22, 2025

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details.

error creating merge commit from 08b6b00 to blathers/backport-release-25.2-159010: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 25.2.x failed. See errors above.


💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

golgeek added a commit to golgeek/cockroach that referenced this pull request Jan 14, 2026
After cockroachdb#159010 introduced metamorphic volume types, an issue could occur
when a test requested a specific count via `SSD(n)` or `VolumeCount(n)`,
but the randomly selected volume type didn't match the option used.

For example, a test specifying `SSD(8)` would get only 1 disk
when `RandomizeVolumeType()` selected `pd-ssd`, because it only applied
to local SSDs while `VolumeCount` (which controls persistent disk count)
defaulted to 0.

This patch unifies `SSD()` and `VolumeCount()` into a single `Disks()`
option that specifies the number of disks regardless of storage type.
The `DiskCount` field replaces both `SSDs` and `VolumeCount`
in `ClusterSpec`, and `getGCEOpts` now routes this value to either
`SSDCount` or `PDVolumeCount` based on the type of storage used.

Multi-disk configurations with persistent disks work on all clouds.
Multi-disk local SSDs remain GCE-only; when local SSD is selected
on other clouds with DiskCount > 1, the system gracefully falls back
to persistent disks.

Fixes: cockroachdb#160518
Closes: cockroachdb#160187
Closes: cockroachdb#160205
Closes: cockroachdb#160225
Closes: cockroachdb#160542
Closes: cockroachdb#160695
Closes: cockroachdb#161058

Release note: None
golgeek added a commit to golgeek/cockroach that referenced this pull request Jan 14, 2026
After cockroachdb#159010 introduced metamorphic volume types, an issue could occur
when a test requested a specific count via `SSD(n)` or `VolumeCount(n)`,
but the randomly selected volume type didn't match the option used.

For example, a test specifying `SSD(8)` would get only 1 disk
when `RandomizeVolumeType()` selected `pd-ssd`, because it only applied
to local SSDs while `VolumeCount` (which controls persistent disk count)
defaulted to 0.

This patch unifies `SSD()` and `VolumeCount()` into a single `Disks()`
option that specifies the number of disks regardless of storage type.
The `DiskCount` field replaces both `SSDs` and `VolumeCount`
in `ClusterSpec`, and `getGCEOpts` now routes this value to either
`SSDCount` or `PDVolumeCount` based on the type of storage used.

Multi-disk configurations with persistent disks work on all clouds.
Multi-disk local SSDs remain GCE-only; when local SSD is selected
on other clouds with DiskCount > 1, the system gracefully falls back
to persistent disks.

Fixes: cockroachdb#160518
Closes: cockroachdb#160187
Closes: cockroachdb#160205
Closes: cockroachdb#160225
Closes: cockroachdb#160542
Closes: cockroachdb#160695
Closes: cockroachdb#161058
Closes: cockroachdb#161064

Release note: None
craig bot pushed a commit that referenced this pull request Jan 15, 2026
161088: roachtest: unify SSD() and VolumeCount() r=williamchoe3,DarrylWong,srosenberg a=golgeek

After #159010 introduced metamorphic volume types, an issue could occur when a test requested a specific count via `SSD(n)` or `VolumeCount(n)`, but the randomly selected volume type didn't match the option used.

For example, a test specifying `SSD(8)` would get only 1 disk when `RandomizeVolumeType()` selected `pd-ssd`, because it only applied to local SSDs while `VolumeCount` (which controls persistent disk count) defaulted to 0.

This patch unifies `SSD()` and `VolumeCount()` into a single `Disks()` option that specifies the number of disks regardless of storage type. The `DiskCount` field replaces both `SSDs` and `VolumeCount` in `ClusterSpec`, and `getGCEOpts` now routes this value to either `SSDCount` or `PDVolumeCount` based on the type of storage used.

Multi-disk configurations with persistent disks work on all clouds. Multi-disk local SSDs remain GCE-only; when local SSD is selected on other clouds with DiskCount > 1, the system gracefully falls back to persistent disks.

Fixes: #160518
Closes: #160187
Closes: #160205
Closes: #160225
Closes: #160542
Closes: #160695
Closes: #161058
Closes: #161064

Release note: None

Co-authored-by: Ludovic Leroux <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Jan 15, 2026
After #159010 introduced metamorphic volume types, an issue could occur
when a test requested a specific count via `SSD(n)` or `VolumeCount(n)`,
but the randomly selected volume type didn't match the option used.

For example, a test specifying `SSD(8)` would get only 1 disk
when `RandomizeVolumeType()` selected `pd-ssd`, because it only applied
to local SSDs while `VolumeCount` (which controls persistent disk count)
defaulted to 0.

This patch unifies `SSD()` and `VolumeCount()` into a single `Disks()`
option that specifies the number of disks regardless of storage type.
The `DiskCount` field replaces both `SSDs` and `VolumeCount`
in `ClusterSpec`, and `getGCEOpts` now routes this value to either
`SSDCount` or `PDVolumeCount` based on the type of storage used.

Multi-disk configurations with persistent disks work on all clouds.
Multi-disk local SSDs remain GCE-only; when local SSD is selected
on other clouds with DiskCount > 1, the system gracefully falls back
to persistent disks.

Fixes: #160518
Closes: #160187
Closes: #160205
Closes: #160225
Closes: #160542
Closes: #160695
Closes: #161058
Closes: #161064

Release note: None
golgeek added a commit to golgeek/cockroach that referenced this pull request Jan 15, 2026
After cockroachdb#159010 introduced metamorphic volume types, an issue could occur
when a test requested a specific count via `SSD(n)` or `VolumeCount(n)`,
but the randomly selected volume type didn't match the option used.

For example, a test specifying `SSD(8)` would get only 1 disk
when `RandomizeVolumeType()` selected `pd-ssd`, because it only applied
to local SSDs while `VolumeCount` (which controls persistent disk count)
defaulted to 0.

This patch unifies `SSD()` and `VolumeCount()` into a single `Disks()`
option that specifies the number of disks regardless of storage type.
The `DiskCount` field replaces both `SSDs` and `VolumeCount`
in `ClusterSpec`, and `getGCEOpts` now routes this value to either
`SSDCount` or `PDVolumeCount` based on the type of storage used.

Multi-disk configurations with persistent disks work on all clouds.
Multi-disk local SSDs remain GCE-only; when local SSD is selected
on other clouds with DiskCount > 1, the system gracefully falls back
to persistent disks.

Fixes: cockroachdb#160518
Closes: cockroachdb#160187
Closes: cockroachdb#160205
Closes: cockroachdb#160225
Closes: cockroachdb#160542
Closes: cockroachdb#160695
Closes: cockroachdb#161058
Closes: cockroachdb#161064

Release note: None
golgeek added a commit to golgeek/cockroach that referenced this pull request Jan 15, 2026
After cockroachdb#159010 introduced metamorphic volume types, an issue could occur
when a test requested a specific count via `SSD(n)` or `VolumeCount(n)`,
but the randomly selected volume type didn't match the option used.

For example, a test specifying `SSD(8)` would get only 1 disk
when `RandomizeVolumeType()` selected `pd-ssd`, because it only applied
to local SSDs while `VolumeCount` (which controls persistent disk count)
defaulted to 0.

This patch unifies `SSD()` and `VolumeCount()` into a single `Disks()`
option that specifies the number of disks regardless of storage type.
The `DiskCount` field replaces both `SSDs` and `VolumeCount`
in `ClusterSpec`, and `getGCEOpts` now routes this value to either
`SSDCount` or `PDVolumeCount` based on the type of storage used.

Multi-disk configurations with persistent disks work on all clouds.
Multi-disk local SSDs remain GCE-only; when local SSD is selected
on other clouds with DiskCount > 1, the system gracefully falls back
to persistent disks.

Fixes: cockroachdb#160518
Closes: cockroachdb#160187
Closes: cockroachdb#160205
Closes: cockroachdb#160225
Closes: cockroachdb#160542
Closes: cockroachdb#160695
Closes: cockroachdb#161058
Closes: cockroachdb#161064

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-25.2.x Flags PRs that need to be backported to 25.2 backport-26.1.x Flags PRs that need to be backported to 26.1 backport-failed o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. target-release-26.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants