Implementation of Adversarial Training for Free!
Maxime's fork of the official repo with tensorflow 2 migration is here, the fashion mnist example is in the small_batch branch.
This is a reimplementation from scratch of the paper "Adversarial Training for Free!" in the context of the course IFT 6756 - Game Theory and ML course.
The implementation is in the src folder; the notebooks test the paper slightly.
The notebooks contain a few tests.
The cifar
notebook is a test on CIFAR-10 with a wide resnet where we test paper. It compares training with replay, adversarial training, and training for free.
The fashion mnist test case is in the small_batch
branch. Due to some parallel work, it uses an older version of the code hence why it's in a branch. The goal was to train a model for "free" with small batches since the authors use large batches.
Last but not least, the audio
notebook is an example of audio classification based on the one from the pytorch tutorials with a network that resembles the M5 network with two of the convolution layers replaced with resnet blocks. I used this last notebook to do more in-depth testing because testing with CIFAR is time-consuming.
The PyTorch layer AdversarialForFree
can be added to any model with nn.Sequential
to get adversarial training for free. Honestly, this just felt like a wasted opportunity by the authors. Just remember to call .step()
after each gradient calculation; I've wasted too many hours wondering why no training is getting done.
I ran into a few problems coding this up. For instance, I read the neurips version, which does not have the pseudocode. Wide Reset 32 layers, 10 times wider, does not seem to exist. According to the default implementation, it should have
This layer makes the training a bit harder. In fact, I claim that it makes training more complicated than regular adversarial training. Surprisingly, even though training with replay reduces the CPU-GPU data transfer, it doesn't make it faster. I guess this is due to PyTorch's (and Cuda's?) async nature. The biggest downside of this method is the lack of control over how good the model should be against adversarial examples and the cost of training with replays.
The results on the cifar 10 dataset can be seen below. FSGM (large step) uses a cifar
notebook. All of the times are for a gtx-3090. This was done with 200 iterations (at most) with decays at 60, 120 and 180 iterations (devided by
As it can be seen, it's not clear in this example whether training for free is worth using or not. The most interesting case is those two examples that take less than an hour to train. It's possible to reach interesting performance levels with both training methods in a short amount of time. But even in the best case for the training for free model (m=10), the performance and time are pretty close to the time needed to train the PGD-2 model. To me, it's unclear which is better, especially for short training times. If time is not a problem, the obvious choice is PGD.
Since testing on the cifar set took around 40 minutes, i didn't test it a lot but on the audio example it was cheap enough to test alot. free-1
on the top means that the at 14.4s the model free-1
reached it's best performance and on the values before the best performace of the other models in that period is written. All of the numbers are either second or percentage. The config column represents the K in PGD-K or m in case of free training. As it can be seen PGD-K is the definite winner except for the first case confirming the results of Fast is better than free. The figures below plot the loss progression attention the the logarithmic scale of time. In case the audios don't play, they can be found in the audio
folder. The varians with s
at the end use either half the step size in the case of free or one fifth of the step size in the case of PGD. The attack was only PGD-100 with early stopping(when all predictions are wrong). Click on the table if it looks blurry.