Skip to content

Commit 7f472ec

Browse files
authored
feat: Add support for enabling EFA resources (#2936)
* feat: Add support for enabling EFA resources * feat: Add support for creating placement group and ensuring subnet ID used supports the instance type provided * chore: Update README and examples * feat: Update AWS provider MSV to support `maximum_network_cards` attribute * fix: Update self-managed example after last round of testing; improve EFA support wording
1 parent 6a1e124 commit 7f472ec

File tree

30 files changed

+366
-50
lines changed

30 files changed

+366
-50
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
repos:
22
- repo: https://github.com/antonbabenko/pre-commit-terraform
3-
rev: v1.87.1
3+
rev: v1.88.0
44
hooks:
55
- id: terraform_fmt
66
- id: terraform_validate

README.md

Lines changed: 57 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,60 @@ On clusters that were created prior to CAM support, there will be an existing ac
113113

114114
Setting the `bootstrap_cluster_creator_admin_permissions` is a one time operation when the cluster is created; it cannot be modified later through the EKS API. In this project we are hardcoding this to `false`. If users wish to achieve the same functionality, we will do that through an access entry which can be enabled or disabled at any time of their choosing using the variable `enable_cluster_creator_admin_permissions`
115115

116+
### Enabling EFA Support
117+
118+
When enabling EFA support via `enable_efa_support = true`, there are two locations this can be specified - one at the cluster level, and one at the nodegroup level. Enabling at the cluster level will add the EFA required ingress/egress rules to the shared security group created for the nodegroup(s). Enabling at the nodegroup level will do the following (per nodegroup where enabled):
119+
120+
1. All EFA interfaces supported by the instance will be exposed on the launch template used by the nodegroup
121+
2. A placement group with `strategy = "clustered"` per EFA requirements is created and passed to the launch template used by the nodegroup
122+
3. Data sources will reverse lookup the availability zones that support the instance type selected based on the subnets provided, ensuring that only the associated subnets are passed to the launch template and therefore used by the placement group. This avoids the placement group being created in an availability zone that does not support the instance type selected.
123+
124+
> [!TIP]
125+
> Use the [aws-efa-k8s-device-plugin](https://github.com/aws/eks-charts/tree/master/stable/aws-efa-k8s-device-plugin) Helm chart to expose the EFA interfaces on the nodes as an extended resource, and allow pods to request the interfaces be mounted to their containers.
126+
>
127+
> The EKS AL2 GPU AMI comes with the necessary EFA components pre-installed - you just need to expose the EFA devices on the nodes via their launch templates, ensure the required EFA security group rules are in place, and deploy the `aws-efa-k8s-device-plugin` in order to start utilizing EFA within your cluster. Your application container will need to have the necessary libraries and runtime in order to utilize communication over the EFA interfaces (NCCL, aws-ofi-nccl, hwloc, libfabric, aws-neuornx-collectives, CUDA, etc.).
128+
129+
If you disable the creation and use of the managed nodegroup custom launch template (`create_launch_template = false` and/or `use_custom_launch_template = false`), this will interfere with the EFA functionality provided. In addition, if you do not supply an `instance_type` for self-managed nodegroup(s), or `instance_types` for the managed nodegroup(s), this will also interfere with the functionality. In order to support the EFA functionality provided by `enable_efa_support = true`, you must utilize the custom launch template created/provided by this module, and supply an `instance_type`/`instance_types` for the respective nodegroup.
130+
131+
The logic behind supporting EFA uses a data source to lookup the instance type to retrieve the number of interfaces that the instance supports in order to enumerate and expose those interfaces on the launch template created. For managed nodegroups where a list of instance types are supported, the first instance type in the list is used to calculate the number of EFA interfaces supported. Mixing instance types with varying number of interfaces is not recommended for EFA (or in some cases, mixing instance types is not supported - i.e. - p5.48xlarge and p4d.24xlarge). In addition to exposing the EFA interfaces and updating the security group rules, a placement group is created per the EFA requirements and only the availability zones that support the instance type selected are used in the subnets provided to the nodegroup.
132+
133+
In order to enable EFA support, you will have to specify `enable_efa_support = true` on both the cluster and each nodegroup that you wish to enable EFA support for:
134+
135+
```hcl
136+
module "eks" {
137+
source = "terraform-aws-modules/eks/aws"
138+
version = "~> 20.0"
139+
140+
# Truncated for brevity ...
141+
142+
# Adds the EFA required security group rules to the shared
143+
# security group created for the nodegroup(s)
144+
enable_efa_support = true
145+
146+
eks_managed_node_groups = {
147+
example = {
148+
instance_types = ["p5.48xlarge"]
149+
150+
# Exposes all EFA interfaces on the launch template created by the nodegroup(s)
151+
# This would expose all 32 EFA interfaces for the p5.48xlarge instance type
152+
enable_efa_support = true
153+
154+
pre_bootstrap_user_data = <<-EOT
155+
# Mount NVME instance store volumes since they are typically
156+
# available on instance types that support EFA
157+
setup-local-disks raid0
158+
EOT
159+
160+
# EFA should only be enabled when connecting 2 or more nodes
161+
# Do not use EFA on a single node workload
162+
min_size = 2
163+
max_size = 10
164+
desired_size = 2
165+
}
166+
}
167+
}
168+
```
169+
116170
## Examples
117171

118172
- [EKS Managed Node Group](https://github.com/terraform-aws-modules/terraform-aws-eks/tree/master/examples/eks_managed_node_group): EKS Cluster using EKS managed node groups
@@ -135,15 +189,15 @@ We are grateful to the community for contributing bugfixes and improvements! Ple
135189
| Name | Version |
136190
|------|---------|
137191
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 1.3 |
138-
| <a name="requirement_aws"></a> [aws](#requirement\_aws) | >= 5.34 |
192+
| <a name="requirement_aws"></a> [aws](#requirement\_aws) | >= 5.38 |
139193
| <a name="requirement_time"></a> [time](#requirement\_time) | >= 0.9 |
140194
| <a name="requirement_tls"></a> [tls](#requirement\_tls) | >= 3.0 |
141195

142196
## Providers
143197

144198
| Name | Version |
145199
|------|---------|
146-
| <a name="provider_aws"></a> [aws](#provider\_aws) | >= 5.34 |
200+
| <a name="provider_aws"></a> [aws](#provider\_aws) | >= 5.38 |
147201
| <a name="provider_time"></a> [time](#provider\_time) | >= 0.9 |
148202
| <a name="provider_tls"></a> [tls](#provider\_tls) | >= 3.0 |
149203

@@ -240,6 +294,7 @@ We are grateful to the community for contributing bugfixes and improvements! Ple
240294
| <a name="input_eks_managed_node_group_defaults"></a> [eks\_managed\_node\_group\_defaults](#input\_eks\_managed\_node\_group\_defaults) | Map of EKS managed node group default configurations | `any` | `{}` | no |
241295
| <a name="input_eks_managed_node_groups"></a> [eks\_managed\_node\_groups](#input\_eks\_managed\_node\_groups) | Map of EKS managed node group definitions to create | `any` | `{}` | no |
242296
| <a name="input_enable_cluster_creator_admin_permissions"></a> [enable\_cluster\_creator\_admin\_permissions](#input\_enable\_cluster\_creator\_admin\_permissions) | Indicates whether or not to add the cluster creator (the identity used by Terraform) as an administrator via access entry | `bool` | `false` | no |
297+
| <a name="input_enable_efa_support"></a> [enable\_efa\_support](#input\_enable\_efa\_support) | Determines whether to enable Elastic Fabric Adapter (EFA) support | `bool` | `false` | no |
243298
| <a name="input_enable_irsa"></a> [enable\_irsa](#input\_enable\_irsa) | Determines whether to create an OpenID Connect Provider for EKS to enable IRSA | `bool` | `true` | no |
244299
| <a name="input_enable_kms_key_rotation"></a> [enable\_kms\_key\_rotation](#input\_enable\_kms\_key\_rotation) | Specifies whether key rotation is enabled | `bool` | `true` | no |
245300
| <a name="input_fargate_profile_defaults"></a> [fargate\_profile\_defaults](#input\_fargate\_profile\_defaults) | Map of Fargate Profile default configurations | `any` | `{}` | no |

examples/eks_managed_node_group/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,13 +30,13 @@ Note that this example may create resources which cost money. Run `terraform des
3030
| Name | Version |
3131
|------|---------|
3232
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 1.3 |
33-
| <a name="requirement_aws"></a> [aws](#requirement\_aws) | >= 5.34 |
33+
| <a name="requirement_aws"></a> [aws](#requirement\_aws) | >= 5.38 |
3434

3535
## Providers
3636

3737
| Name | Version |
3838
|------|---------|
39-
| <a name="provider_aws"></a> [aws](#provider\_aws) | >= 5.34 |
39+
| <a name="provider_aws"></a> [aws](#provider\_aws) | >= 5.38 |
4040

4141
## Modules
4242

examples/eks_managed_node_group/main.tf

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ data "aws_availability_zones" "available" {}
77

88
locals {
99
name = "ex-${replace(basename(path.cwd), "_", "-")}"
10-
cluster_version = "1.27"
10+
cluster_version = "1.29"
1111
region = "eu-west-1"
1212

1313
vpc_cidr = "10.0.0.0/16"
@@ -37,6 +37,10 @@ module "eks" {
3737

3838
enable_cluster_creator_admin_permissions = true
3939

40+
# Enable EFA support by adding necessary security group rules
41+
# to the shared node security group
42+
enable_efa_support = true
43+
4044
cluster_addons = {
4145
coredns = {
4246
most_recent = true
@@ -241,6 +245,26 @@ module "eks" {
241245
ExtraTag = "EKS managed node group complete example"
242246
}
243247
}
248+
249+
efa = {
250+
# Disabling automatic creation due to instance type/quota availability
251+
# Can be enabled when appropriate for testing/validation
252+
create = false
253+
254+
instance_types = ["trn1n.32xlarge"]
255+
ami_type = "AL2_x86_64_GPU"
256+
257+
enable_efa_support = true
258+
pre_bootstrap_user_data = <<-EOT
259+
# Mount NVME instance store volumes since they are typically
260+
# available on instances that support EFA
261+
setup-local-disks raid0
262+
EOT
263+
264+
min_size = 2
265+
max_size = 2
266+
desired_size = 2
267+
}
244268
}
245269

246270
access_entries = {

examples/eks_managed_node_group/versions.tf

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ terraform {
44
required_providers {
55
aws = {
66
source = "hashicorp/aws"
7-
version = ">= 5.34"
7+
version = ">= 5.38"
88
}
99
}
1010
}

examples/fargate_profile/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,13 @@ Note that this example may create resources which cost money. Run `terraform des
2020
| Name | Version |
2121
|------|---------|
2222
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 1.3 |
23-
| <a name="requirement_aws"></a> [aws](#requirement\_aws) | >= 5.34 |
23+
| <a name="requirement_aws"></a> [aws](#requirement\_aws) | >= 5.38 |
2424

2525
## Providers
2626

2727
| Name | Version |
2828
|------|---------|
29-
| <a name="provider_aws"></a> [aws](#provider\_aws) | >= 5.34 |
29+
| <a name="provider_aws"></a> [aws](#provider\_aws) | >= 5.38 |
3030

3131
## Modules
3232

examples/fargate_profile/versions.tf

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ terraform {
44
required_providers {
55
aws = {
66
source = "hashicorp/aws"
7-
version = ">= 5.34"
7+
version = ">= 5.38"
88
}
99
}
1010
}

examples/karpenter/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -55,16 +55,16 @@ Note that this example may create resources which cost money. Run `terraform des
5555
| Name | Version |
5656
|------|---------|
5757
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 1.3 |
58-
| <a name="requirement_aws"></a> [aws](#requirement\_aws) | >= 5.34 |
58+
| <a name="requirement_aws"></a> [aws](#requirement\_aws) | >= 5.38 |
5959
| <a name="requirement_helm"></a> [helm](#requirement\_helm) | >= 2.7 |
6060
| <a name="requirement_kubectl"></a> [kubectl](#requirement\_kubectl) | >= 2.0 |
6161

6262
## Providers
6363

6464
| Name | Version |
6565
|------|---------|
66-
| <a name="provider_aws"></a> [aws](#provider\_aws) | >= 5.34 |
67-
| <a name="provider_aws.virginia"></a> [aws.virginia](#provider\_aws.virginia) | >= 5.34 |
66+
| <a name="provider_aws"></a> [aws](#provider\_aws) | >= 5.38 |
67+
| <a name="provider_aws.virginia"></a> [aws.virginia](#provider\_aws.virginia) | >= 5.38 |
6868
| <a name="provider_helm"></a> [helm](#provider\_helm) | >= 2.7 |
6969
| <a name="provider_kubectl"></a> [kubectl](#provider\_kubectl) | >= 2.0 |
7070

examples/karpenter/versions.tf

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ terraform {
44
required_providers {
55
aws = {
66
source = "hashicorp/aws"
7-
version = ">= 5.34"
7+
version = ">= 5.38"
88
}
99
helm = {
1010
source = "hashicorp/helm"

examples/outposts/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,14 +49,14 @@ terraform destroy
4949
| Name | Version |
5050
|------|---------|
5151
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 1.3 |
52-
| <a name="requirement_aws"></a> [aws](#requirement\_aws) | >= 5.34 |
52+
| <a name="requirement_aws"></a> [aws](#requirement\_aws) | >= 5.38 |
5353
| <a name="requirement_kubernetes"></a> [kubernetes](#requirement\_kubernetes) | >= 2.20 |
5454

5555
## Providers
5656

5757
| Name | Version |
5858
|------|---------|
59-
| <a name="provider_aws"></a> [aws](#provider\_aws) | >= 5.34 |
59+
| <a name="provider_aws"></a> [aws](#provider\_aws) | >= 5.38 |
6060
| <a name="provider_kubernetes"></a> [kubernetes](#provider\_kubernetes) | >= 2.20 |
6161

6262
## Modules

0 commit comments

Comments
 (0)