English 中文(简体)
How to process large dataset in pytorch DDP mode?
原标题:

I have a large Dataset about 900G. The memory of my machine is 1T. I want to train a model in distributed training mode. I have 10 gpus. I used to use tensorflow and horovod to make it. I split the dataset into 10 parts. Each process only load part of the dataset. Now I want to try this in pytorch. I know that I can use the DDP of pytorch to do it. I try it but I meet some problems. I still let each process load part of the dataset. But I find the speed is very slow, about 1/10 of the speed of tf+horovod. And during training, the memory is overflow. The tutorials I have read are let each process load the whole dataset and use the distribute samper to manage the data to different gpus. But if I do this, it will cost 10x memory. I don t know if I should use the distribute samper when different process load part of the dataset and is this the reason why the speed is slow.

I want to know when the dataset is large, can I let different process load part of the dateset just like in tf+horovod? And if different process load part of the dataset, shall I have to use the distribute samper? And after the dataset is loaded, how to avoid the extra memory use over the dataset and speed up the training.

问题回答

暂无回答




相关问题
How to process large dataset in pytorch DDP mode?

I have a large Dataset about 900G. The memory of my machine is 1T. I want to train a model in distributed training mode. I have 10 gpus. I used to use tensorflow and horovod to make it. I split the ...

Tensorflow cannot detect CUDA enabled device

I have an RTX 4070 on my Dell XPS laptop that also comes with an Intel IRIS Xe Graphics card. I am using Windows 11. I have NVIDIA Graphics Driver Version 535.98 installed on my system and has support ...

No module named mmcv._ext

Tried to train the model but got the mmcv error No module named mmcv._ext mmcv library is already installed and imported mmcv version = 1.4.0 Cuda version = 10.0 Any suggestions to fix the issue??

热门标签