Evaluate & Support using dedicated disks for WAL to mitigate IOPS contention

**How to categorize this issue?**

/area control-plane
/area performance
/area scalability
/kind enhancement

**What would you like to be added**:
Today etcd-druid deploys an etcd cluster with a single SSD that is shared to store WAL and snapshot files. All these SSDs come with IOPS limits. For clusters where the etcd read/write activity is LOT, there is possibility that etcd slows down significantly which then causes timeouts from kube-apiserver.
```
Trace[1428471700]:  ---"Txn call failed" err:etcdserver: request timed out 7015ms (06:23:13.521)]
E0729 06:23:13.532618       1 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:\"etcdserver: request timed out\"}: etcdserver: request timed out" logger="UnhandledError"
```
Details of one such occurrence can be seen in Live Issue#7539.

It is a recommendation from upstream etcd to have dedicated disk for WAL (https://etcd.io/docs/v2.3/admin_guide/).  Since these SSDs have an associated cost this should be made configurable via `Etcd` resource.

**Why is this needed**:
etcd clusters rely heavily on extremely fast SSDs and their response times are sensitive to disk performance. For large/busy etcd clusters the IOPS can easily cross the limits for the SSD used. In order to prevent timeouts from the kube-apiserver which results in an outage it is essential to provide an option to use multiple SSDs by individual etcd members.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluate & Support using dedicated disks for WAL to mitigate IOPS contention #1147

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluate & Support using dedicated disks for WAL to mitigate IOPS contention #1147

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions