Skip to content

fix: restore compatibility with Ubuntu 26.04 / ansible-core 2.20#3864

Draft
Rico Lin (ricolin) wants to merge 62 commits intomainfrom
rlin-ubuntu2604-support
Draft

fix: restore compatibility with Ubuntu 26.04 / ansible-core 2.20#3864
Rico Lin (ricolin) wants to merge 62 commits intomainfrom
rlin-ubuntu2604-support

Conversation

@ricolin
Copy link
Copy Markdown
Member

@ricolin Rico Lin (ricolin) commented Apr 22, 2026

Companion PR to the Ubuntu 26.04 work on the collection repos. Tracks two atmosphere-side fixes discovered while validating an AIO deploy on Ubuntu 26.04 with ansible-core 2.20 and Python 3.14.

Companion PRs:

Checked:


Test matrix

┌─────────────┬────────┬─────────┬─────────────────────────────────┬─────────────────┬────────────────────────────────┬────────┐
│ OS          │ Python │ Backend │ Scenario                        │ Wallclock       │ Tempest                        │ Result │
├─────────────┼────────┼─────────┼─────────────────────────────────┼─────────────────┼────────────────────────────────┼────────┤
│ 26.04       │ 3.14   │ OVN     │ fresh deploy                    │ ~75 min         │ 163/164 pass                   │ ✅     │
├─────────────┼────────┼─────────┼─────────────────────────────────┼─────────────────┼────────────────────────────────┼────────┤
│ 24.04.1     │ 3.12   │ OVS     │ fresh deploy                    │ 102 min         │ 163/164 pass                   │ ✅     │
├─────────────┼────────┼─────────┼─────────────────────────────────┼─────────────────┼────────────────────────────────┼────────┤
│ 22.04.3     │ 3.10   │ OVN     │ fresh deploy                    │ 93 min          │ 163/164 pass                   │ ✅     │
├─────────────┼────────┼─────────┼─────────────────────────────────┼─────────────────┼────────────────────────────────┼────────┤
│ 22.04.3     │ 3.10   │ OVN     │ previous version → in-place upgrade     │ 99 min + 22 min │ N/A (upgrade, cluster healthy) │ ✅     │
└─────────────┴────────┴─────────┴─────────────────────────────────┴─────────────────┴────────────────────────────────┴────────┘

Todo: update zuul CI to test against 2604

Mohammed Naser (mnaser) and others added 30 commits April 15, 2026 12:44
Add a Go binary (cmd/atmosphere) that deploys Atmosphere components
in parallel waves using a DAG-based dependency graph, reducing
deployment time from ~60 minutes to ~22 minutes.

Key components:
- pkg/dag: Generic Graph[T] library with topological sort, subgraph
  extraction, and parallel wave execution via errgroup
- internal/deploy: Component registry (42 components), Deployer
  interface with AnsibleDeployer, and 3-mode Orchestrator
- cmd/atmosphere: CLI with deploy subcommand (--inventory, --tags,
  --playbook-dir, --concurrency flags)

Three operating modes:
- No tags: full DAG parallel deployment (11 waves)
- Single tag: pass-through to ansible-playbook (backwards compatible)
- Multiple tags: DAG-aware subgraph with parallel waves

The orchestrator spawns concurrent ansible-playbook processes with
generated per-component playbooks piped via /dev/stdin, avoiding
multi-play parsing overhead. Output is streamed with [component]
prefixes for clear CI log interleaving.

Backwards compatibility: existing ansible-playbook usage, tags, and
variables are completely unchanged. The orchestrator is additive.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Mohammed Naser <mnaser@vexxhost.com>
Update molecule converge playbooks to build and use the atmosphere
binary for deployment:

- default: full DAG deploy (no tags)
- csi: multi-tag with ceph,kubernetes,csi (or kubernetes,csi)
- keycloak: multi-tag with all keycloak dependencies
- pxc: single-tag pass-through for percona-xtradb-cluster

The multi-tag mode resolves DAG ordering automatically, running
independent components in parallel where possible.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Mohammed Naser <mnaser@vexxhost.com>
Adjust DAG dependencies based on actual role analysis:

- ingress-nginx: drop cluster-issuer dep (only needs kubernetes)
- pxc, valkey, kube-prometheus-stack, loki: add csi dep (all use PVCs)
- lpfc, multipathd, iscsi, udev: remove kubernetes dep (pure host config)
- rook-ceph: depend on kubernetes only (operator, not storage consumer)
- rook-ceph-cluster: add ceph dep (needs ceph monitors)
- nova: add neutron dep, drop ovn/coredns (transitive via neutron)
- neutron: add coredns dep (dnsmasq_dns_servers uses coredns)
- magnum: depend on octavia, barbican, heat (configures all three clients)
- openstack-exporter: depend on cinder, neutron (only hard runtime deps)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Mohammed Naser <mnaser@vexxhost.com>
- Add ensure-go role (v1.24.4) to molecule pre-run playbook
- Set CGO_ENABLED=0 and explicit Go PATH in all converge build tasks
- Add kubernetes, csi, valkey to keycloak scenario tags (transitive deps)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Mohammed Naser <mnaser@vexxhost.com>
When rendering playbooks piped via /dev/stdin, ansible-playbook has no
collection context. Prefix bare role names with vexxhost.atmosphere. so
Ansible can resolve them from the installed collection.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Mohammed Naser <mnaser@vexxhost.com>
Use vexxhost.atmosphere.* fully-qualified collection names for both
playbooks (PlaybookType) and roles (RoleType). This removes the need
for --playbook-dir since Ansible resolves collection references
directly. Also removes the openstacksdk prerequisite step since
dependent roles already call it and Ansible does atomic writes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Mohammed Naser <mnaser@vexxhost.com>
Add a ResourceCoordinator that serializes components sharing a named
resource (e.g., 'apt'). Components ceph and kubernetes declare the apt
resource since they come from external collections where we cannot add
retries. For all roles within vexxhost.atmosphere that use package
management, add retries (5 attempts, 10s delay) to gracefully handle
dpkg lock contention during parallel deployment.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Mohammed Naser <mnaser@vexxhost.com>
Mark multipathd and iscsi with the 'apt' resource since they install
packages on the same hosts as ceph/kubernetes (external collections
without retries). Also set changed_when: false on all molecule converge
build/deploy tasks to pass idempotence checks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Mohammed Naser <mnaser@vexxhost.com>
Environment values containing Jinja expressions with single quotes
(e.g., ceph container image) broke YAML parsing when wrapped in
single-quoted YAML strings. Switch to Go's %q format which uses
double quotes, safely containing single quotes in the values.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Mohammed Naser <mnaser@vexxhost.com>
ipmi-exporter deploys directly into the monitoring namespace using
kubernetes.core.k8s (not Helm with create_namespace: true), so it
needs the namespace to exist first. kube-prometheus-stack creates it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Mohammed Naser <mnaser@vexxhost.com>
keepalived and percona-xtradb-cluster deploy raw k8s resources into
the openstack namespace without creating it. memcached (via Helm with
create_namespace: true) creates the namespace. Add memcached as a
dependency so the namespace exists before these components run.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Mohammed Naser <mnaser@vexxhost.com>
prometheus-pushgateway enables serviceMonitor which requires the
ServiceMonitor CRD from kube-prometheus-stack. Without this dep,
the Helm install fails with 'no matches for kind ServiceMonitor'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Mohammed Naser <mnaser@vexxhost.com>
The vexxhost.kubernetes collection uses kubernetes.core.k8s modules
in early plays before the Python kubernetes package is installed by
later plays. When running in parallel mode, this race becomes more
visible. Install the package in pre-run to ensure it's available
system-wide before any playbooks execute.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Mohammed Naser <mnaser@vexxhost.com>
libvirt, kube-prometheus-stack, and valkey all create Certificate
resources using cert-manager.io/v1 CRDs directly via kubernetes.core.k8s.
They also reference a ClusterIssuer named 'self-signed' created by
the cluster-issuer role. Add cluster-issuer as a dependency so the
CRDs and issuer exist before these components deploy.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Mohammed Naser <mnaser@vexxhost.com>
The kube_prometheus_stack role starts by waiting for the Keycloak
StatefulSet to be ready and then creates realms/clients. Without
keycloak in its dependency list, it can start before keycloak is
deployed, causing 'list object has no element 0' errors when checking
the StatefulSet status.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Mohammed Naser <mnaser@vexxhost.com>
The rook_ceph_cluster role creates Keystone users, services, and
endpoints for Swift/RGW integration using openstack.cloud modules.
Without keystone being deployed first, these calls fail with SSL
connection errors to the identity endpoint.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Mohammed Naser <mnaser@vexxhost.com>
Manila creates compute flavors (needs Nova endpoint), uploads images
(needs Glance via Nova chain), and its Helm values reference endpoints
for nova, neutron, and cinder. Without these services deployed first,
manila fails with EndpointNotFound for the compute service.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Mohammed Naser <mnaser@vexxhost.com>
rook-ceph-cluster creates an OpenStack user in the 'service' domain
using openstack.cloud.identity_user. The 'service' domain is created
by OpenStack-Helm's ks-user bootstrap jobs (via helm-toolkit). By
depending on barbican (the first core service deployed), we ensure
the service domain exists before rook-ceph-cluster tries to use it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <copilot@github.com>
Move Go binary build to pre-run and add a custom Zuul run playbook
that runs molecule prepare, atmosphere deploy, br-ex networking (AIO),
and molecule verify as separate plays. This replaces the parent job's
molecule test invocation so deploy output streams directly to Zuul
logs instead of being buffered through molecule.

Also adds atmosphere_deploy_tags to CSI and keycloak job definitions
so each scenario deploys only its required components.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <copilot@github.com>
All converge playbooks now use the atmosphere_deploy_tags variable
instead of hardcoded tags. The Zuul run.yml imports the molecule
converge playbook directly, so the same converge logic runs both
locally (molecule converge) and in CI (Zuul run playbook).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <copilot@github.com>
Cinder's Helm chart creates PVCs that need the Ceph CSI provisioner
to be running. Add ceph-provisioners as a dependency so the storage
class and provisioner are ready before cinder deploys.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <copilot@github.com>
…build

ceph-provisioners only needs ceph monitors and CSI driver, not
rook-ceph-cluster. Also removes duplicate Go binary build from
pre.yml since converge.yml already handles it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <copilot@github.com>
Each Zuul job now sets run: to the scenario's converge.yml followed
by a verify playbook, so Zuul streams deploy output directly. Molecule
prepare and inventory setup move to pre.yml. Converge playbooks use
hosts: all with delegate_to/run_once so they work in both molecule
(localhost) and Zuul (remote node) contexts.

Also fixes ceph-provisioners to depend only on ceph (not
rook-ceph-cluster) since it only needs ceph monitors.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <copilot@github.com>
In Zuul, Go is installed on the remote instance (via ensure-go) not
on the executor (localhost). Remove delegate_to: localhost so the go
build and atmosphere deploy commands run where Go and the collection
are available.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <copilot@github.com>
The `zuul.project.src_dir` variable is a relative path (e.g.
`src/github.com/vexxhost/atmosphere`). When the deploy task uses it as a
prefix in `cmd` while also having `chdir` set to the same relative path,
Ansible resolves the binary path as if it were relative to the new cwd,
doubling the path and causing a FileNotFoundError.

Fix by using `./bin/atmosphere` and `./inventory.yaml` in the cmd field
since `chdir` already navigates to the correct directory.

Also fix pre-commit end-of-file issue in orchestrator.go and add a
release note for the parallel deployment orchestrator feature.

Agent-Logs-Url: https://github.com/vexxhost/atmosphere/sessions/60b11e18-b92e-476f-86db-2a6c2ac4db06

Co-authored-by: mnaser <435815+mnaser@users.noreply.github.com>
The atmosphere deploy binary calls ansible-playbook internally. Add
.venv/bin to PATH in the deploy task so the binary can find
ansible-playbook installed in the uv virtual environment.

Agent-Logs-Url: https://github.com/vexxhost/atmosphere/sessions/60b11e18-b92e-476f-86db-2a6c2ac4db06

Co-authored-by: mnaser <435815+mnaser@users.noreply.github.com>
…ok path

Go 1.19+ refuses to execute binaries found via relative PATH entries
(CVE-2022-30580). Using `PATH=.venv/bin:...` fails because `.venv/bin`
is a relative entry.

Switch to `ansible.builtin.shell` with `. .venv/bin/activate` so that
the shell activation script adds the ABSOLUTE path of `.venv/bin` to
PATH before invoking `./bin/atmosphere deploy`. The atmosphere binary
then finds ansible-playbook via an absolute path, satisfying Go 1.19+
security requirements.

Agent-Logs-Url: https://github.com/vexxhost/atmosphere/sessions/60b11e18-b92e-476f-86db-2a6c2ac4db06

Co-authored-by: mnaser <435815+mnaser@users.noreply.github.com>
In the original serialized playbook, nova was deployed before neutron.
Neutron's post-install network creation requires the nova availability
zone to exist. Swap the dependency so nova deploys first, then neutron.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Copilot <copilot@github.com>
Replace the generic OpenStack quota task with service-specific quota
commands for compute, volume, and network resources. This avoids
querying load-balancer quotas during Manila deployment, which can fail
when the Octavia endpoint uses an untrusted certificate.

Signed-off-by: Yaguang Tang <yaguang.tang@vexxhost.com>
The parallel orchestrator generates minimal single-role playbooks for
RoleType components, which bypasses pre_tasks defined in the original
sequential playbooks (e.g., playbooks/openstack.yml). This means the
atmosphere_ceph_enabled deprecation guard was silently skipped.

Add a runPreflightChecks() method that runs the same validation checks
before any component deployment begins, called from both deployFullDAG
and deployMultipleTags. The deploySingleTag path is unaffected since it
passes through to the full site.yml which already includes pre_tasks.

Change-Id: If068daa27a3f4475e570f08ab6d2cd52effb2914
Signed-off-by: Dong Ma <dong.ma@vexxhost.com>
Rico Lin (ricolin) and others added 19 commits April 17, 2026 11:55
Magnum's Helm install doesn't require octavia to be running. The only
octavia reference is the octavia_client endpoint URL in helm values,
which is a deterministic string generated from openstack_helm_endpoints.
Octavia is only needed at runtime when users create Kubernetes clusters.

This allows magnum to start after barbican and heat complete (~13:30)
instead of waiting for octavia (~13:44), saving ~4.5 minutes on the
critical path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: ricolin <rlin@vexxhost.com>
When running molecule locally (outside Zuul), the verify playbook
cannot find workspace-generated variables (endpoints, secrets)
because the inventory fallback was /dev/null. Set
ATMOSPHERE_ZUUL_INVENTORY in tox.ini to point at the project root
inventory.yaml so Ansible discovers group_vars for all playbooks
(prepare, converge, verify). Touch the file before molecule runs
to ensure it exists for the prepare step.

In Zuul, molecule_environment overrides this env var with the
Zuul-generated inventory path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change-Id: I8098d8bbc2d5617bc5dd137d5dad17920bf73d69
Signed-off-by: ricolin <rlin@vexxhost.com>
Both roles have built-in retry logic (retries: 5, delay: 10) that
handles transient dpkg lock contention. Removing the apt resource
serialization allows them to run in parallel with ceph and kubernetes
during Wave 0, saving ~2 minutes of serial wait time.

Signed-off-by: ricolin <rlin@vexxhost.com>
The ceph component previously held the apt resource lock for its entire
~8 minute duration, but only used apt for ~30 seconds (Docker, cephadm
packages). The remaining time (bootstrap, mon, mgr, OSD creation) does
not touch apt.

Split into two components:
- ceph-packages: installs Docker and cephadm deps (holds apt lock ~1-2m)
- ceph: runs the full ceph playbook (no apt lock, depends on ceph-packages)

The main ceph playbook re-runs the cephadm role dependencies
idempotently (packages already installed = fast skip). This allows
kubernetes to start installing as soon as ceph-packages finishes,
rather than waiting for the entire ceph bootstrap to complete.

Signed-off-by: ricolin <rlin@vexxhost.com>
Agent-Logs-Url: https://github.com/vexxhost/atmosphere/sessions/21aacaca-4069-450a-a09d-0a1cddca9963

Co-authored-by: ricolin <7250045+ricolin@users.noreply.github.com>
The apt resource declarations were removed from multipathd and iscsi
components assuming their built-in retry logic would handle dpkg lock
contention. However, the kubernetes component (which runs in the same
wave) uses the external vexxhost.kubernetes.kubelet role that does NOT
have retry logic on its apt tasks. When multipathd or iscsi held the
dpkg lock, kubelet failed immediately with rc:100.

Re-add the apt resource to serialize these components with kubernetes
and ceph, preventing dpkg lock contention entirely.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change-Id: If6fca34a1d6f84d3213ea7b777a8b6cf9c35a126
Signed-off-by: ricolin <rlin@vexxhost.com>
Revert the re-added apt resource lock on multipathd and iscsi. Both
roles already have built-in retry logic (retries: 5, delay: 10s) that
handles dpkg lock contention gracefully.

The original concern was that the kubelet role in
vexxhost.kubernetes lacks retry logic on its apt tasks. This is being
addressed upstream in vexxhost/ansible-collection-kubernetes#262 by
adding retry logic directly to the kubelet role.

With retry logic on both sides, serializing multipathd and iscsi behind
the apt resource is no longer necessary, recovering ~1-3 minutes of
Wave 0 parallelism.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: ricolin <rlin@vexxhost.com>
The Go deployer spawns ansible-playbook subprocesses that need:
- PATH: to find ansible-playbook in the venv
- ANSIBLE_COLLECTIONS_PATH: to find collections when running as root
  via become:true (root defaults to /root/.ansible/collections)

Signed-off-by: ricolin <rlin@vexxhost.com>
In Zuul CI, molecule/ansible-compat installs the collection to a cache
directory and sets ANSIBLE_COLLECTIONS_PATH accordingly. The previous
commit unconditionally overrode this with ~/.ansible/collections, causing
ansible-playbook to fail finding the vexxhost.atmosphere.* roles.

Make PATH and ANSIBLE_COLLECTIONS_PATH conditional on zuul is not defined,
so CI inherits the correct paths from .venv/bin/activate and molecule's
prerun while local runs still get the paths they need.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: ricolin <rlin@vexxhost.com>
The ceph component previously held the apt resource lock for its entire
~8 minute duration, but only used apt for ~30 seconds (Docker, cephadm
packages). The remaining time (bootstrap, mon, mgr, OSD creation) does
not touch apt.

Split into two components:
- ceph-packages: installs Docker and cephadm deps (holds apt lock ~1-2m)
- ceph: runs the full ceph playbook (no apt lock, depends on ceph-packages)

The main ceph playbook re-runs the cephadm role dependencies
idempotently (packages already installed = fast skip). This allows
kubernetes to start installing as soon as ceph-packages finishes,
rather than waiting for the entire ceph bootstrap to complete.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: ricolin <rlin@vexxhost.com>
Re-add the apt resource lock to multipathd and iscsi to prevent dpkg
lock contention with ceph (containerd AppArmor install) and kubernetes
(kubelet package install), which lack retry logic in their upstream
collections.

Once the following upstream PRs merge and atmosphere pins the new
collection versions, this lock can be safely removed:
- vexxhost/ansible-collection-kubernetes#262
- vexxhost/ansible-collection-containers#114

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: ricolin <rlin@vexxhost.com>
Change-Id: I23272c63155e8d4f323a278baa638cbb3073559d
Address multiple deployment failures on fresh Ubuntu 26.04 installs:

1. The generate_workspace playbook and the Nova/Manila
   generate_public_key tasks failed to generate SSH keys because
   systemd mounts /tmp as tmpfs on Ubuntu 24.04+, and
   community.crypto.openssh_keypair calls chattr on the generated
   files, which tmpfs does not support. Switch those tasks to a
   disk-backed tempfile location.

2. Ubuntu 26.04 ships Python 3.14 and a newer ansible-core. The
   pinned community.general 7.3.0 (and friends) break with
   JMESPathError under the 2.19+ template engine. Bump the pinned
   Ansible collections to recent major versions and lift the
   ansible-core pin so everything runs natively on Python 3.14.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change-Id: I561fb8ba2e52c1f26d86a3e9be5d0615c735d46d
Signed-off-by: Rico Lin <rico@vexxhost.com>
Ansible 2.20+ deprecates INJECT_FACTS_AS_VARS defaulting to true and
warns when top-level ansible_* fact variables are used. Switch the
prepare.yml snapd purge condition to ansible_facts['distribution'].

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change-Id: Idfe0733773fa6aab60b7da00050d32547d772fe9
Signed-off-by: Rico Lin <rico@vexxhost.com>
Switch ceph key lookups in the ceph_provisioners, ceph_csi_rbd and
rook_ceph_cluster roles to the new vexxhost.ceph.key_info module, since
recent versions of the vexxhost.ceph collection removed state: info from
the vexxhost.ceph.key module.

Teach the storage_to_ceph_provisioners_helm_values filter plugin to
unwrap ansible-core 2.20 lazy value and lazy container wrappers before
validating atmosphere_storage, so that Pydantic's discriminated-union
resolution receives plain strings rather than _LazyValue instances.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change-Id: I3ed1ad21eef19d3251267b88a06b2002348a4d46
Signed-off-by: Rico Lin <rico@vexxhost.com>
On Ubuntu 26.04 systemd resolves `LimitNOFILE=infinity` to 2147483584
(INT_MAX/2). Every container started by containerd v2.x inherits that
value. Workloads that iterate over inherited file descriptors before
`execve` — for example HAProxy external-check scripts spawned by the
Percona XtraDB cluster — spend tens of seconds in the close loop and
get killed by their own timeout, which in turn crash-loops the HAProxy
pod and blocks Keycloak and the rest of the deploy.

Pin `containerd_limit_open_file_num` to 1048576 when importing the
Kubernetes and Ceph playbooks so the containerd role renders the
systemd unit with a sane limit on every distribution. Matches the
value previously used on Red Hat systems and the effective limit on
older Ubuntu kernels.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change-Id: Ic71fd7fbd9118c4d5a7b4d5cec2009ab062b5a19
Signed-off-by: Rico Lin <rico@vexxhost.com>
Keycloak 24 runs the Quarkus augmentation step at first boot. The
upstream Bitnami chart defaults to `resourcesPreset: small`, which
caps memory at 768 MiB and triggered an `OOMKilled` before the server
could open its HTTP port, failing the Helm install with a startup
probe timeout.

Use the `medium` preset (up to 1536 MiB) as the Atmosphere default.
Operators can still override through `keycloak_helm_values`.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change-Id: I1874ebaf1e12228ee09114c8f4e72e374a4e45f4
Signed-off-by: Rico Lin <rico@vexxhost.com>
Rico Lin and others added 7 commits April 22, 2026 23:16
Ansible-core 2.20 wraps rendered default values in lazy containers that
the loop keyword rejects with 'must resolve to a list, not str'. Define
the default list directly in the magnum and magnum_pre role defaults to
sidestep the wrapper.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change-Id: I1b0f09820b5452ee058e5c8fb61cc7edc7443e4b
Signed-off-by: Rico Lin <rico@vexxhost.com>
The molecule AIO override referenced '_magnum_images' which was removed
from role vars in the previous commit, and used string template syntax
that ansible-core 2.20 rejects for 'loop:' consumers. Inline the single
test image directly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change-Id: I9067305529696f6440e72ace4944b4ca7a9c3225
Signed-off-by: Rico Lin <rico@vexxhost.com>
The neutron-db-sync post-install hook replays the full Alembic migration
chain on a fresh install, which regularly exceeds the default 5-minute
Helm hook timeout on Ubuntu 26.04 test hosts and leaves the release in
the 'failed' state.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change-Id: Ib266c8a1b36e214a9c737d8284d22f39738476e5
Signed-off-by: Rico Lin <rico@vexxhost.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change-Id: I9e64043f9049c13d241e75e5129c64aa5e90ca3a
Signed-off-by: Rico Lin <rico@vexxhost.com>
ansible-core 2.20 deprecates INJECT_FACTS_AS_VARS defaulting to true
and warns whenever a top-level ansible_* fact variable is referenced.
Switch the remaining molecule prepare/converge/scenario files to
ansible_facts['fact_name'] for default_ipv4, distribution, and env.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change-Id: I70de7369acfb06dec9eb3e2a8e500bf54ca40bf0
Signed-off-by: Rico Lin <rico@vexxhost.com>
The Tempest suite on Ubuntu 26.04 takes longer than the previous
20-minute Helm wait, so the kubernetes.core.helm task gives up and
the subsequent k8s_info call samples the Job before the Kubernetes
Job controller has finalised .status.succeeded. Even with every test
passing the role then reports Tempest failed.

Raise wait_timeout to 30 minutes and add a polling retry on the Job
lookup so the role waits until the Job actually reaches a terminal
state.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change-Id: I419323efd5827163a93b300a4411593c8900806a
Signed-off-by: Rico Lin <rico@vexxhost.com>
Commit d1f405a raised the ansible-core requirement to >=2.20 as part
of Ubuntu 26.04 support. That release requires Python 3.12, which
Ubuntu 22.04 does not ship, so pip refuses to install Atmosphere at
all on a 22.04 host.

Lower the floor back to >=2.15.9. On Ubuntu 22.04 pip resolves to the
latest compatible release in the 2.17.x series, which is sufficient
for every collection pinned in galaxy.yml. On Ubuntu 26.04 pip still
picks up 2.20 or newer, preserving the Python 3.14 deployment path.

Validated end-to-end on Ubuntu 22.04.3: tempest 163/164 pass,
1 skip, 0 fail.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change-Id: Ic338274618d79c76f1162cbfd240e94d97da3547
Signed-off-by: Rico Lin <rico@vexxhost.com>
@ricolin
Copy link
Copy Markdown
Member Author

Rico Lin (ricolin) commented Apr 23, 2026

 Ubuntu 26.04 + 22.04 Cross-Compatibility — Final Report

  Mission

  Enable Atmosphere deployment on Ubuntu 26.04 (Resolute) while ensuring existing Ubuntu 22.04 users can upgrade to the new
  Atmosphere release without being forced onto 26.04.

  Result: ✅ VALIDATED on both OSes


PRs Included in This Validation

1. vexxhost/atmosphere #3864 — rlin-ubuntu2604-support

Title: fix: support Ubuntu 26.04 while keeping 22.04 supported Final HEAD: 5b365711 URL: 
https://github.com/vexxhost/atmosphere/pull/3864

The main cloud-platform PR. 10 discrete fixes addressing deployment failures on fresh Ubuntu 26.04 installs, plus the 22.04
compatibility softens. Commits on the branch:

┌──────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ SHA          │ Message                                                                                                      │
├──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ d1f405ac     │ fix(deps): support Ubuntu 26.04 / Python 3.14 (bump ansible-core, collections, tmpfs fix)                    │
├──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ (multiple)   │ containerd NOFILE cap, lazy-value unwrap, magnum_images inline, Keycloak large, Neutron 15m timeout, ceph    │
│              │ key_info migration, facts migration                                                                          │
├──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ a6c055d9     │ fix(tempest): bump Helm wait timeout and retry job lookup                                                    │
├──────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 5b365711     │ fix(deps): lower ansible-core floor to keep Ubuntu 22.04 supported (this session)                            │
└──────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

------------------------------------------------------------------------------------------------------------------------------

2. vexxhost/ansible-collection-containers #118 — rlin-ubuntu2604-support

Title: feat: Ubuntu 26.04 support Final HEAD: 31d5c81 URL: https://github.com/vexxhost/ansible-collection-containers/pull/118

Container/containerd role updates. Key change: make the LimitNOFILE cap conditional so 22.04 keeps the historic infinity, while
26.04 and RedHat families get 1048576 (avoids HAProxy external-check 7s close-fd hang triggered by 26.04 resolving infinity to
2,147,483,584).

------------------------------------------------------------------------------------------------------------------------------

3. vexxhost/ansible-collection-ceph #105 — rlin-ubuntu2604-support

Title: feat: Ubuntu 26.04 support Final HEAD: cc79c72 URL: https://github.com/vexxhost/ansible-collection-ceph/pull/105

Ceph collection. Softened Python floor (>=3.11) and ansible-core (>=2.18) so 22.04 can still install it, reverted an unrelated
default image bump (Reef 18.2.1 → Tentacle 20.2.1) that snuck into the 26.04 commit. The new key_info module added here is what
atmosphere #3864 migrates to.

------------------------------------------------------------------------------------------------------------------------------

4. vexxhost/ansible-collection-kubernetes #268 — rlin-ubuntu2604-support

Title: feat: Ubuntu 26.04 support Final HEAD: b5f02da URL: https://github.com/vexxhost/ansible-collection-kubernetes/pull/268

Kubernetes collection. Softened Python floor (>=3.10) and ansible-core (>=2.15.9), restored minimum (not exact) floors on 
ansible.posix, community.crypto, community.general, kubernetes.core so 22.04 can install these.

------------------------------------------------------------------------------------------------------------------------------

How they interact

 atmosphere PR #3864
   └── depends on (galaxy.yml):
         ├── vexxhost.containers 1.6.6  ← PR #118
         ├── vexxhost.ceph >=3.2.0       ← PR #105
         └── vexxhost.kubernetes 3.0.1   ← PR #268

Test environment wires them together via an uncommitted molecule/aio/collections.yml that overrides Galaxy lookups with
git-branch installs:

 collections:
   - { name: https://github.com/vexxhost/ansible-collection-containers.git, type: git, version: rlin-ubuntu2604-support }
   - { name: https://github.com/vexxhost/ansible-collection-ceph.git,       type: git, version: rlin-ubuntu2604-support }
   - { name: https://github.com/vexxhost/ansible-collection-kubernetes.git, type: git, version: rlin-ubuntu2604-support }

All four must land together — the atmosphere PR's role code depends on the new module signatures introduced in the three
collection PRs.
  ┌───────────────┬─────────┬──────────────────┬──────────────────────┐
  │ OS      │ Result           │ Tempest              │
  ├───────────────┼─────────┼──────────────────┼──────────────────────┤
  │ 26.04   │ ✅ PASS          │ 163/164 pass, 1 skip │
  ├───────────────┼─────────┼──────────────────┼──────────────────────┤
  │ 22.04.3 │ ✅ PASS (93 min) │ 163/164 pass, 1 skip │
  └───────────────┴─────────┴──────────────────┴──────────────────────┘

  PRs merged / pushed this session

  ┌────────────────────────────────────────┬───────┬────────────┬───────────────────────────────────────────────────────────────┐
  │ Repo                                   │ PR    │ Final HEAD │ Change                                                        │
  ├────────────────────────────────────────┼───────┼────────────┼───────────────────────────────────────────────────────────────┤
  │ vexxhost/atmosphere                    │ #3864 │ 5b365711   │ 10 fixes: containerd NOFILE, lazy-value unwrap, magnum_images │
  │                                        │       │            │ inline, Keycloak resources, Neutron timeout, ceph key_info,   │
  │                                        │       │            │ tmpfs chattr, facts migration, tempest race, ansible-core     │
  │                                        │       │            │ floor soften                                                  │
  ├────────────────────────────────────────┼───────┼────────────┼───────────────────────────────────────────────────────────────┤
  │ vexxhost/ansible-collection-containers │ #118  │ 31d5c81    │ Soften galaxy deps, conditional NOFILE cap (22.04 keeps       │
  │                                        │       │            │ infinity)                                                     │
  ├────────────────────────────────────────┼───────┼────────────┼───────────────────────────────────────────────────────────────┤
  │ vexxhost/ansible-collection-ceph       │ #105  │ cc79c72    │ Soften python/ansible floors, revert Reef→Tentacle default    │
  │                                        │       │            │ bump                                                          │
  ├────────────────────────────────────────┼───────┼────────────┼───────────────────────────────────────────────────────────────┤
  │ vexxhost/ansible-collection-kubernetes │ #268  │ b5f02da    │ Soften python/ansible/k8s.core floors                         │
  └────────────────────────────────────────┴───────┴────────────┴───────────────────────────────────────────────────────────────┘

  Errors encountered & fixed (log entries 1–10)

   1. lazy-value Pydantic break → _deep_unwrap filter helper
   2. _magnum_images loop type error → inline list in defaults
   3. Neutron db-sync 5m helm timeout → 15m
   4. Keycloak OOMKilled medium → large preset
   5. containerd LimitNOFILE=infinity → 1,048,576 cap (HAProxy close-fd hang)
   6. tmpfs + chattr SSH keygen → disk-backed tempfile
   7. Ceph vexxhost.ceph.key state=info removed → key_info module
   8. ansible_default_ipv4.* deprecation → ansible_facts['default_ipv4'] (12 files)
   9. Tempest Job race → Helm wait 30m + retries: 30, delay: 10
   10. ansible-core>=2.20 unsatisfiable on py3.10 → floor back to >=2.15.9

  Key insight: the softening strategy

  The original 26.04 support commit (d1f405ac) hard-raised every floor to 26.04-native versions, accidentally locking 22.04 out
  entirely. The fix for each PR was the same pattern:

   - pyproject.toml: raise only to the minimum Python supports (>=3.10/>=3.11), drop pins on requires-python that exceed
    22.04
   - galaxy.yml: change exact pins (e.g., community.general:
    12.6.0) to floors (>=4.5.0) — pip/galaxy still resolve upward to newest compatible
   - Conditional behavior where unavoidable (NOFILE cap): gate on os_family == 'RedHat' or (Ubuntu >=
    26.04)

  Result: a single Atmosphere codebase runs on both OSes, with each picking its native Ansible/Python stack.

oslo.middleware 8.0 (shipped in the Magnum main image) removes the
filter-style Healthcheck middleware and raises NotImplementedError on
import, crashing magnum-api on startup. Switch the api-paste.ini to a
composite root that mounts the healthcheck as an app under
/healthcheck, matching the pattern already used by the Glance chart.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change-Id: Icb3f691e4292f051f0d33b66f362e3b176a5205a
Signed-off-by: Rico Lin <rico@vexxhost.com>
@ricolin
Copy link
Copy Markdown
Member Author

┌─────────────┬────────┬─────────┬─────────────────────────────────┬─────────────────┬────────────────────────────┬────────┐
│ OS          │ Python │ Backend │ Scenario                        │ Converge        │ Tempest                    │ Result │
├─────────────┼────────┼─────────┼─────────────────────────────────┼─────────────────┼────────────────────────────┼────────┤
│ 26.04       │ 3.14   │ OVN     │ fresh deploy                    │ ~75 min         │ 163/164 pass               │ ✅     │
├─────────────┼────────┼─────────┼─────────────────────────────────┼─────────────────┼────────────────────────────┼────────┤
│ 26.04       │ 3.14   │ OVS     │ fresh deploy (this run)         │ 62 min          │ 129/131 pass, 0 failed     │ ✅     │
├─────────────┼────────┼─────────┼─────────────────────────────────┼─────────────────┼────────────────────────────┼────────┤
│ 24.04.1     │ 3.12   │ OVS     │ fresh deploy                    │ 102 min         │ 163/164 pass               │ ✅     │
├─────────────┼────────┼─────────┼─────────────────────────────────┼─────────────────┼────────────────────────────┼────────┤
│ 22.04.3     │ 3.10   │ OVN     │ fresh deploy                    │ 93 min          │ 163/164 pass               │ ✅     │
├─────────────┼────────┼─────────┼─────────────────────────────────┼─────────────────┼────────────────────────────┼────────┤
│ 22.04.3     │ 3.10   │ OVN     │ baseline → in-place upgrade     │ 99 min + 22 min │ cluster healthy            │ ✅     │
└─────────────┴────────┴─────────┴─────────────────────────────────┴─────────────────┴────────────────────────────┴────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants