Suppose you were building a computer vision training pipeline for millions of images.
What design choice would you make to access these images during training?
Assume all your images on S3.
How would you access these images during training?How would you do it if the server performing the GPU training is local, not in the cloud?
Some possible design choices to consider…
-
Access images “locally” by mounting the s3 using smt like S3FS-FUSE
-
Each dataloader worker accesses images on the fly by modifying the dataset module to use boto3 to download the image to prepare each training batch. Similar to the S3Dataset module proposed by AWS
-
Before training, download all images then access them locally. But with millions of images this is a problem as it cannot fit on disk (note that buying a large disk is cheap so this solution can still be worth considering…)
-
Wait you are totally wrong, S3 is not a good idea, do not use it! Put all images on a standalone server and use a NFS mount
What do you think? How would you do this?
Theres some CVOps folks in here: @kausthubk @richmond @lu.riera @chris @chris @yashikajain201 that I would love to hear from