In my new project at work I had to process a sufficiently large set of image data for a multi-label multi-class classification task. Despite the GPU utilization being close to 100%, a single training epoch over 2 million images took close to 3.5 hrs to run. This is a big issue if you’re running your baseline experiments and want quick results. I first thought that since I was processing original size images each of which were at least a few MBs the bottleneck was disk I/O. I used Imagemagick mogrify to resize all 2 million images which took a long time. To my astonishment resizing images didn’t reduce the training time at all! Well, not noticeably. So, I went through the code and found out that the major bottleneck were the image augmentation operations in Pytorch.
While stumbling on github I found that people working at Nvidia had recently released a library - DALI that is supposed to tackle exactly this issue. The library is still under active development and supports fast data augmentation for all major ML development libraries out there - Pytorch, Tensorflow, MXNet.
Using Nvidia DALI, the above data pipeline can be optimized by moving appropriate operations to GPU. After using DALI, the pipeline looks something like -
For more details about features of DALI, please see this beginner friendly post by Nvidia developers titled Fast AI Data Preprocessing with NVIDIA DALI. In the rest of this post, I’ll show how to incorporate Nvidia DALI in your Pytorch code. The readers are welcome to offer possible improvements to the code below.
We start by installing the required dependencies.
By now you have completed installation of nvidia-dali that we’ll now integrate into our Pytorch code. To create a dummy data set, we download the flower classification data provided by Udacity. The dataset contains two folders - train and valid. We use the images in the train folder and flatten the directory which comes organized as a hierarchical folder containing images by label with one sub-folder per label. We don’t use the provided labels and generate dummy labels for demonstration.
Next we create a space separated file that fits the example given on official Nvidia DALI documentation pages.
Next we create an ExternalInputIterator that batches our data and is used by DALI Pipeline to input the data and feed it to respective devices for processing. The code below has been adapted from the official code here to work for multiple labels. Thanks to @Siddha Ganju for pointing to the official tutorial.
Next we instantiate this iterator and feed it as an input to ExternalSourcePipeline that extends the Pipeline class and feeds data to respective devices for augmentation operations.
We are almost done and now we instantiate a DALIGenericIterator that helps us iterate over the dataset just the way we do typically in Pytorch.
I’m yet to benchmark DALI in my code and will update this post once I’ve the results.