Question infra and design choices during training

Suppose you were building a computer vision training pipeline for millions of images.

What design choice would you make to access these images during training?

Assume all your images on S3.

How would you access these images during training?How would you do it if the server performing the GPU training is local, not in the cloud?

Some possible design choices to consider…

  1. Access images “locally” by mounting the s3 using smt like S3FS-FUSE

  2. Each dataloader worker accesses images on the fly by modifying the dataset module to use boto3 to download the image to prepare each training batch. Similar to the S3Dataset module proposed by AWS

  3. Before training, download all images then access them locally. But with millions of images this is a problem as it cannot fit on disk (note that buying a large disk is cheap so this solution can still be worth considering…)

  4. Wait you are totally wrong, S3 is not a good idea, do not use it! Put all images on a standalone server and use a NFS mount

What do you think? How would you do this?

Theres some CVOps folks in here: @kausthubk @richmond @lu.riera @chris @chris @yashikajain201 that I would love to hear from


[3] at home and [4] if there is bunch of servers, powerful NAS with caching SSD & 1Gbps networking :slight_smile:


3 and 4 seem simpler but my only apprehension with using a “download it locally” strategy (even with a huge hard-disk around) is that it’s sort of difficult to have a large team working that way.

If you’re working alone then 3 and 4 seem alright.


Have to agree with the others…

Nothing beats local.

However, options like 1., or even 2., can work in a pinch - or if you simply can’t get additional capacity locally, or your standalone isn’t equipped with BEEFY transfer rates/read rates.