fix --use-cpu-initialization error when expert is not tensor-parallel#413
fix --use-cpu-initialization error when expert is not tensor-parallel#413taozhiwei wants to merge 1 commit intodeepspeedai:mainfrom
Conversation
Signed-off-by: taozhiwei <taozhiweigis@163.com>
|
Hi @taozhiwei , just curious, what is the case that expert not TP but other layers are TP? given experts usually have more aggregated weights compared with other parts. |
when not set the parameter --enable-expert-tensor-parallelism, that expert is not TP. for example, ds_pretrain_gpt_125M_MoE64.sh is not set the parameter, When adding the parameter --use-cpu-initialization directly, an error will be reported.When I need to compare whether the convergence curves are completely consistent, I will add --use-cpu-initialization @GuanhuaWang |
Hi @taozhiwei , I think I should rephrase my question since I am not asking configurations: What are the application scenarios for expert not using TP but rest using TP (i.e. expert not TP but non-expert TP)? To me, there is no such application given experts usually much larger than non-expert, thus if TP is enabled, it will always apply on expert first. To me, for TP enabled cases, there are only two:
|
Use --use-cpu-initialization will be failed When non-expert is tensor-parallel and expert is not tensor-parallel.
because per_partition_size is equal to master_weight.shape[partition_dim], so my_weight_list len is 0 except in 0 rank , so we can not use torch.cat, We should use assign.
Please help review @GuanhuaWang @tjruwase, thanks.