What happens when you apply Dropout to a bad initialized Deep Neural Network

Introduction:

This is a post that explains a problem that I found during the development of Deep Neural Network (with Tensorflow) in order to do multi-label classification and the applicaton of the popular regularization technique called Dropout.

In fact, it tells the story of what was the problem I faced, how I asked for help in Stackoverflow (with a bad formulated question) and how I were capable of solve my own problem after with the help of experimentation and the great tool Tensorboard.

Source: My own Stackoverflow question and answer

The story begins:

I have implemented a pretty simple Deep Neural Network to perform multi-label classification. The overview of the model is (bias omitted for the sake of simple visualization):

Model in matrix-vector form

That is, a 3-layer deep neural network with ReLU units and Sigmoid as output unit.

The loss function is Sigmoid Cross Entropy and the used optimizer is Adam.

Code available on the Stackoverflow question that I don't post here to not make the post too long

Without dropout

When I train this NN without Dropout I get the following results on test data:

ROC AUC - micro average: 0.6474180196222774
ROC AUC - macro average: 0.6261438437099212

Precision - micro average: 0.5112489722699753
Precision - macro average: 0.48922193879411413
Precision - weighted average: 0.5131092162035961

Recall - micro average: 0.584640369246549
Recall - macro average: 0.55746897003228
Recall - weighted average: 0.584640369246549

With dropout

When I train this NN with Dropout I get the following results on test data:

ROC AUC - micro average: 0.5
ROC AUC - macro average: 0.5

Precision - micro average: 0.34146163499985405
Precision - macro average: 0.34146163499985405
Precision - weighted average: 0.3712475781926326

Recall - micro average: 1.0
Recall - macro average: 1.0
Recall - weighted average: 1.0

As you can see with the Recall values in the Dropout version, the NN output is always 1, always positive class for every class of every sample.

Question

It's true that it's not an easy problem, but after applying Dropout I expected at least similar results as without Dropout, not worse result and of course not this saturated output.

Why could be this happening? How could I avoid this behaviour? Do you see something strange or bad done in the code?

Hyperparameters:

Dropout rate: 0.5 @ training / 1.0 @ inference

Epochs: 500

Learning rate: 0.0001

Dataset information:

Number of instances: +22.000

Number of classes: 6

Answer

And then I found the answer...

Finally I've managed to solve my own question with some more experimentation, so this is what I figured out.

I exported the Tensorboad graph and weights, bias and activations data in order to explore them on TB.

Then I realized that something wasn't going fine with the weights.

As you could observe, the weights wasn't changing at all. In other words, the layer "wasn't learning" anything.

But then the exaplanation was in front of my eyes. The distribution of the weights was too broad. Look at that histogram range, from [-2,2] that's too much.

Then I realized that I were initializing the weights matrices with

truncated_normal(mean=0.0, std=1.0)

which is a really high std.dev for a correct init. The obvious trick was to initialize the weights with a more correct initialization. Then, I chose the "Xavier Glorot Initialization" and then the weights becomes to this:

And the predictions stopped being all positive to became mixed predictions again. And, of course, with a better performance on test set thanks to the Dropout.

In summary, the net without Dropout was capable of learn something with that too broad initialization, but the net with Dropout wasn't, and needed a better initialization in order to not get stuck.

Conclusion

I've written this post in this way instead doing a post about the "importance of the correct initialization of DNN with Dropout" because I think that the learned concept or idea would be the same but knowing the whole story exactly as I did could be even more helpful.

This post relates exactly how I encountered the problem, how I did not found the answer because I was doing the wrong question (blaming Dropout) and then how I were able to find the right answer by exploring and analyzing the data I had with the available tools.

In general, it's a good strategy to solve some problems. That's why I think this post could help people struggling with Dropout and initializations of DNNs and also people struggling with other problems of the same kind

I hope you find this useful!