question about parallelism for embedding #2119

imh966 · 2024-06-17T02:45:52Z

It seems torchrec does not support the combination of data parallelism and row-wise parallelism for embedding. I want to know is there a plan on it? Or is row-wise parallelism efficient enough when it comes to multi-node training?

iamzainhuda · 2024-06-26T17:19:07Z

If I understand correctly you want to data parallel row wise shards for an embedding? AFAIU, this is seems like a niche case and not sure as to if it brings gains over the current supported sharding schemes. Usually RW/CW sharding is efficient for multi node training

imh966 · 2024-07-04T09:23:43Z

Thanks for your reply and I've got what you mean. But I think when it comes to massive training, such as hundreds of GPUs, RW/CW probably make the embedding tables in a single GPU too small. In this case, could DP+RW/CW be a better way? Or just use TW+RW/CW?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about parallelism for embedding #2119

question about parallelism for embedding #2119

imh966 commented Jun 17, 2024

iamzainhuda commented Jun 26, 2024

imh966 commented Jul 4, 2024

question about parallelism for embedding #2119

question about parallelism for embedding #2119

Comments

imh966 commented Jun 17, 2024

iamzainhuda commented Jun 26, 2024

imh966 commented Jul 4, 2024