Data Generators with Keras and Tensorflow on Google Colab
View example on GitHub
1. Introduction
This blog post is a tutorial on using data generators with Keras on Google Colab. Data generators allow you to feed data into Keras in real-time while training the model. This way, you can make modifications to the data before feeding it to the neural network or even load it from the secondary memory. Data generators have two use cases — (1) Data augmentation and (2) loading a dataset that does not fit into the RAM.
There are many posts out there that explain the use of data generators. Most of them explain in the context of using a local computer. Recently, many people have started using Google Colab for machine learning projects. Using data generators with Google Colab was trickier than I expected. For example, the delay while directly loading files from Google drive. Therefore, this post explains some of the dos and don’ts while using data generators with Google Colab. My research is focused on audio classification. So, some of my opinions might be biased towards an audio-context. Below is the definition of a data generator.
2. Save individual examples as NumPy arrays
Format each example in the dataset as a separate NumPy array. Store the train, validation, and test data into separate directories. If you have thousands of examples, the ‘glob’ module is not good at loading the files. Hence, it is a good idea to divide your dataset into sub-folders or blocks. For example, ‘/content/train data/block-id-1/id-1.npy’, … , ‘/content/train data/block-id-3/id-642.npy’, etc. I generally store 320 examples per block.
The default sort function in Python is not a natural sort. For instance, ‘id-100.npy’ would be placed before ‘id-2.npy’. Therefore, below is a function to perform natural sort.
You can store the labels and training as NumPy files within the same folder. For examples, ‘/content/train data/block-id-1/id-label-1.npy’. Below is a code snippet to load the train and validation data using the glob module.
As you can see in the code snippet, the train and validation set is defined inside a dictionary called ‘partition’. The partition for the test set would be similar to the validation set. The generators for train and validation are declared as shown below.
3. Save the database as a zipped file
As per my experience, the best way to store data in Google drive is as a .zip file. When you want to import the database into the Colab notebook, there are two ways of doing it. (1) Extract the zip file directly from the drive, as shown below, or (2) use a !wget command to download it to the notebook and subsequently extract it. In order to do this, you will have to create a shareable link of the zip file in drive. The first method is most convenient and works fine in most cases. However, if your zip file is >10 GB and you extract it directly from the drive multiple times, then your Google Drive API begins to fail. There are some posts in stack overflow that speak about the Read/Write limits imposed by the Google Drive API. However, this is a problem that I have personally encountered as well. Hence, if you have too much data, think about using the !wget command.
It is convenient to store the train, validation, and test sets as separate .zip files. This way you can easily extract them into separate directories in the virtual machine.
4. Do not load the dataset directly from Google drive
I learnt this the hard way. It might be tempting to store your dataset in Google Drive and load examples directly from it. In Google Colab, files are loaded from drive through the Google Drive API. The files are not stored on the virtual machine. Therefore, loading files for the data generator directly from Google Drive would lead to a major bottleneck while training your neural network.
5. Some Failed Approaches
Initially, when I was trying to load the NumPy arrays directly from Google Drive. A hack that I came up with was to load batches instead of individual files. This would reduce the number of files that Google Drive API has to load. For example, if the batch size was 128, each NumPy file would contain 128 examples. The problem with this approach is that there is no straightforward way to shuffle the data while training.