- My lab's computing setup; link
- Cloud GPU and whole machine price comparison and notes (云GPU与整机价格对比); link
- Free compute available from companies(教授可申请的免费计算资源); link
- Useful learning material on GPUs and setting up your clusters(如何搭建计算集群); link
In my lab, the Precognition Lab, using the start-up funds provided by the university, I have built 9 stand-alone machines equipped with a total of 32 RTX 3090/4090 GPUs (including 4-GPU and some 2-GPU machines). Additionally, I have established a cluster with 3 compute nodes, comprising a total of 24 RTX A6000 GPUs, and a 100TB NAS.
The rationale behind this hybrid setup is twofold: the stand-alone machines cost only 40% of what the cluster does, and they can be acquired quickly without necessitating additional machine room space.
As for the cluster, I've found that a 100 GB Ethernet suffices for the computing network, eliminating the need to invest in an Infiniband switch, which can cost two to three times more. With 3 nodes on this network, I can essentially achieve linear scaling with multi-node training (6 hours for 1-node training and 2 hours for 3-node training, etc.).
Vendors in mainland, China (Updated 07/2022):
| Machine | Duration | Price (RMB) | Note | |
|---|---|---|---|---|
| 阿里 | 8xV100 (16GB) | 一年 | 80万 | 只有CentOS |
| 一个月 | 7.1万 | |||
| 一小时 | 248.42 | |||
| 华为云 | 8xV100 (32GB) | 一年 | 63万 | |
| 一个月 | 6.3万 | |||
| 一小时 | 131.5 | |||
| 腾讯云 | 8xV100 (32GB) | 一年 | 45.8万(8.3折) | link |
| 一个月 | 4.6万 | |||
| 一小时 (TIONE) | 147 | |||
| 8xA100 (40GB) | 一年 | 113.5万(8.3折) | ||
| 一个月 | 11.4万 | |||
| 百度云 | 8xA100 (40GB) | 一年 | 99.7万(8.3折) | link |
| 一个月 | 10万 | |||
| 8xV100 (32GB) | 一年 | 59.3万 | ||
| 一个月 | 5.9万 | |||
| 一小时 | 124.14 | |||
| 矩池云 | 8xV100 (16GB) | 一小时 | 48 | |
| 智星云 | 8x3090 (24GB) | 一个月 | 2.1万 | |
| 一小时 | 36 | |||
| 8xA100 (40GB) | 一个月 | 4.5万 | ||
| 一小时 | 76 | |||
| 8xV100 (32GB) | 一个月 | 2.8万 | ||
| 一小时 | 48 | |||
| 极链AI云 | ||||
| 恒源云 | ||||
| AutoDL | link , Most Popular | |||
| OpenBayes | link |
整机购买 (08/2022咨询)
| 机器 | ||
|---|---|---|
| dbcloud深脑云 (淘宝) | 8x3090 | 20万左右起 |
| 程明明教授的经验 | 8xV100 | link |
Junwei: 近期(09/2022)GPU价格大跌,明显是整机购买比较划算,而3090的算力相当于V100,是性价比最高的卡,所以我认为多个8x3090整机+网络硬盘NAS+kubeflow是最划算、scalable的设置,可以参考一下后面如何自建计算集群。
Vendors in NA (Updated 07/2022):
| Machine | Duration | Price | |
|---|---|---|---|
| Google Cloud asia-Taiwan | 8xV100 (32GB) | 1 month | $12,837.30 |
| 1 hour | $17 | ||
| Google Cloud asia-Tokyo | 8xA100 (40GB) | 1 month | $18,216.98 |
| vast.ai NA | 8xV100 (16GB) | 1 hour | $2.80 |
| 8xA100 (40GB) | 1 hour | $8.80 | |
| 8xA6000 (48GB) | 1 hour | $4.40 | |
| 10x1080Ti (11GB) | 1 hour | $2 | |
| 8xA5000 (24GB) | 1 hour | $2.40 | |
| 4x3090 (24GB) | 1 hour | $1.20 | |
| lambda NA | 8xV100 (16GB) | 1 hour | $4.40 |
| 8xV100 (16GB) | 1 hour (>3 months) | $3.20 | |
| 8xA100 (40GB) | 1 hour (>3 months) | $8.00 | |
| 1xA100 (40GB) | 1 hour | $1.10 link |
| note | link | |
|---|---|---|
| 幻方AI | 万卡算力,免费申请,酣畅科研的夏天 | link |
| NVIDIA | 有一张免费卡的资助项目 | |
| AWS | 在CMU上课的时候,每门课教授都可以给每个学生申请100刀左右的cloud credit | |
| Google Cloud | 类似AWS |
| note | link |
|---|---|
| GPU guide from Lambda | link |
| understanding GPU and DL | link |
| 腾讯TEG星辰和机智团队 | link |
| MPIJob | link |
| 机器学习平台 | link |
| 集群硬盘,ceph cluster | link |
| A discussion on machine price on Twitter (for NA) | link |
| A discussion on 1xA100 vs 6x3090 知乎 | link |
| 程明明教授的GPU集群经验 | link |
| Good cluster building guide from Lambda | link |
| How to decide on cloud GPUs vs. on-perm vs. hybrid | link |
