I have a large Dataset about 900G. The memory of my machine is 1T. I want to train a model in distributed training mode. I have 10 gpus. I used to use tensorflow and horovod to make it. I split the dataset into 10 parts. Each process only load part of the dataset. Now I want to try this in pytorch. I know that I can use the DDP of pytorch to do it. I try it but I meet some problems. I still let each process load part of the dataset. But I find the speed is very slow, about 1/10 of the speed of tf+horovod. And during training, the memory is overflow. The tutorials I have read are let each process load the whole dataset and use the distribute samper to manage the data to different gpus. But if I do this, it will cost 10x memory. I don t know if I should use the distribute samper when different process load part of the dataset and is this the reason why the speed is slow.
I want to know when the dataset is large, can I let different process load part of the dateset just like in tf+horovod? And if different process load part of the dataset, shall I have to use the distribute samper? And after the dataset is loaded, how to avoid the extra memory use over the dataset and speed up the training.