Open In Colab

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
In [2]:
#@title MIT License
#
# Copyright (c) 2017 François Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.

Explore overfit and underfit

As always, the code in this example will use the tf.keras API, which you can learn more about in the TensorFlow Keras guide.

In both of the previous examples—classifying movie reviews and predicting fuel efficiency—we saw that the accuracy of our model on the validation data would peak after training for a number of epochs, and would then start decreasing.

In other words, our model would overfit to the training data. Learning how to deal with overfitting is important. Although it's often possible to achieve high accuracy on the training set, what we really want is to develop models that generalize well to a testing set (or data they haven't seen before).

The opposite of overfitting is underfitting. Underfitting occurs when there is still room for improvement on the test data. This can happen for a number of reasons: If the model is not powerful enough, is over-regularized, or has simply not been trained long enough. This means the network has not learned the relevant patterns in the training data.

If you train for too long though, the model will start to overfit and learn patterns from the training data that don't generalize to the test data. We need to strike a balance. Understanding how to train for an appropriate number of epochs as we'll explore below is a useful skill.

To prevent overfitting, the best solution is to use more training data. A model trained on more data will naturally generalize better. When that is no longer possible, the next best solution is to use techniques like regularization. These place constraints on the quantity and type of information your model can store. If a network can only afford to memorize a small number of patterns, the optimization process will force it to focus on the most prominent patterns, which have a better chance of generalizing well.

In this notebook, we'll explore two common regularization techniques—weight regularization and dropout—and use them to improve our IMDB movie review classification notebook.

In [3]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
from tensorflow import keras

import numpy as np
import matplotlib.pyplot as plt

print(tf.__version__)
2024-07-15 18:44:15.797594: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-15 18:44:15.797643: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-15 18:44:15.798956: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-15 18:44:15.806856: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-15 18:44:16.618340: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2.15.0

Download the IMDB dataset

Rather than using an embedding as in the previous notebook, here we will multi-hot encode the sentences. This model will quickly overfit to the training set. It will be used to demonstrate when overfitting occurs, and how to fight it.

Multi-hot-encoding our lists means turning them into vectors of 0s and 1s. Concretely, this would mean for instance turning the sequence [3, 5] into a 10,000-dimensional vector that would be all-zeros except for indices 3 and 5, which would be ones.

In [4]:
NUM_WORDS = 10000

(train_data, train_labels), (test_data, test_labels) = keras.datasets.imdb.load_data(num_words=NUM_WORDS)

def multi_hot_sequences(sequences, dimension):
    # Create an all-zero matrix of shape (len(sequences), dimension)
    results = np.zeros((len(sequences), dimension))
    for i, word_indices in enumerate(sequences):
        results[i, word_indices] = 1.0  # set specific indices of results[i] to 1s
    return results


train_data = multi_hot_sequences(train_data, dimension=NUM_WORDS)
test_data = multi_hot_sequences(test_data, dimension=NUM_WORDS)
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
17464789/17464789 [==============================] - 2s 0us/step

Let's look at one of the resulting multi-hot vectors. The word indices are sorted by frequency, so it is expected that there are more 1-values near index zero, as we can see in this plot:

In [5]:
plt.plot(train_data[0])
Out[5]:
[<matplotlib.lines.Line2D at 0x7fbd857eca60>]

Demonstrate overfitting

The simplest way to prevent overfitting is to reduce the size of the model, i.e. the number of learnable parameters in the model (which is determined by the number of layers and the number of units per layer). In deep learning, the number of learnable parameters in a model is often referred to as the model's "capacity". Intuitively, a model with more parameters will have more "memorization capacity" and therefore will be able to easily learn a perfect dictionary-like mapping between training samples and their targets, a mapping without any generalization power, but this would be useless when making predictions on previously unseen data.

Always keep this in mind: deep learning models tend to be good at fitting to the training data, but the real challenge is generalization, not fitting.

On the other hand, if the network has limited memorization resources, it will not be able to learn the mapping as easily. To minimize its loss, it will have to learn compressed representations that have more predictive power. At the same time, if you make your model too small, it will have difficulty fitting to the training data. There is a balance between "too much capacity" and "not enough capacity".

Unfortunately, there is no magical formula to determine the right size or architecture of your model (in terms of the number of layers, or the right size for each layer). You will have to experiment using a series of different architectures.

To find an appropriate model size, it's best to start with relatively few layers and parameters, then begin increasing the size of the layers or adding new layers until you see diminishing returns on the validation loss. Let's try this on our movie review classification network.

We'll create a simple model using only Dense layers as a baseline, then create smaller and larger versions, and compare them.

Create a baseline model

In [6]:
baseline_model = keras.Sequential([
    # `input_shape` is only required here so that `.summary` works.
    keras.layers.Dense(16, activation='relu', input_shape=(NUM_WORDS,)),
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

baseline_model.compile(optimizer='adam',
                       loss='binary_crossentropy',
                       metrics=['accuracy', 'binary_crossentropy'])

baseline_model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 16)                160016    
                                                                 
 dense_1 (Dense)             (None, 16)                272       
                                                                 
 dense_2 (Dense)             (None, 1)                 17        
                                                                 
=================================================================
Total params: 160305 (626.19 KB)
Trainable params: 160305 (626.19 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
2024-07-15 18:44:26.727817: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10525 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1
2024-07-15 18:44:26.728843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 10525 MB memory:  -> device: 1, name: NVIDIA GeForce GTX 1080 Ti, pci bus id: 0000:07:00.0, compute capability: 6.1
2024-07-15 18:44:26.729861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 10525 MB memory:  -> device: 2, name: NVIDIA GeForce GTX 1080 Ti, pci bus id: 0000:0b:00.0, compute capability: 6.1
2024-07-15 18:44:26.730858: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 10525 MB memory:  -> device: 3, name: NVIDIA GeForce GTX 1080 Ti, pci bus id: 0000:0f:00.0, compute capability: 6.1
In [7]:
baseline_history = baseline_model.fit(train_data,
                                      train_labels,
                                      epochs=20,
                                      batch_size=512,
                                      validation_data=(test_data, test_labels),
                                      verbose=2)
Epoch 1/20
2024-07-15 18:44:30.395487: I external/local_tsl/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-07-15 18:44:31.145930: I external/local_xla/xla/service/service.cc:168] XLA service 0x7fb9f83fa440 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-07-15 18:44:31.145980: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce GTX 1080 Ti, Compute Capability 6.1
2024-07-15 18:44:31.145994: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (1): NVIDIA GeForce GTX 1080 Ti, Compute Capability 6.1
2024-07-15 18:44:31.146003: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (2): NVIDIA GeForce GTX 1080 Ti, Compute Capability 6.1
2024-07-15 18:44:31.146013: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (3): NVIDIA GeForce GTX 1080 Ti, Compute Capability 6.1
2024-07-15 18:44:31.158398: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-07-15 18:44:31.198075: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8904
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1721069071.309373     584 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
49/49 - 6s - loss: 0.4752 - accuracy: 0.8166 - binary_crossentropy: 0.4752 - val_loss: 0.3334 - val_accuracy: 0.8757 - val_binary_crossentropy: 0.3334 - 6s/epoch - 125ms/step
Epoch 2/20
49/49 - 1s - loss: 0.2472 - accuracy: 0.9120 - binary_crossentropy: 0.2472 - val_loss: 0.2828 - val_accuracy: 0.8876 - val_binary_crossentropy: 0.2828 - 1s/epoch - 23ms/step
Epoch 3/20
49/49 - 1s - loss: 0.1779 - accuracy: 0.9374 - binary_crossentropy: 0.1779 - val_loss: 0.2916 - val_accuracy: 0.8851 - val_binary_crossentropy: 0.2916 - 1s/epoch - 25ms/step
Epoch 4/20
49/49 - 1s - loss: 0.1398 - accuracy: 0.9519 - binary_crossentropy: 0.1398 - val_loss: 0.3139 - val_accuracy: 0.8795 - val_binary_crossentropy: 0.3139 - 1s/epoch - 26ms/step
Epoch 5/20
49/49 - 1s - loss: 0.1114 - accuracy: 0.9643 - binary_crossentropy: 0.1114 - val_loss: 0.3442 - val_accuracy: 0.8741 - val_binary_crossentropy: 0.3442 - 1s/epoch - 27ms/step
Epoch 6/20
49/49 - 1s - loss: 0.0890 - accuracy: 0.9739 - binary_crossentropy: 0.0890 - val_loss: 0.3798 - val_accuracy: 0.8695 - val_binary_crossentropy: 0.3798 - 1s/epoch - 29ms/step
Epoch 7/20
49/49 - 1s - loss: 0.0703 - accuracy: 0.9810 - binary_crossentropy: 0.0703 - val_loss: 0.4224 - val_accuracy: 0.8650 - val_binary_crossentropy: 0.4224 - 1s/epoch - 27ms/step
Epoch 8/20
49/49 - 1s - loss: 0.0549 - accuracy: 0.9879 - binary_crossentropy: 0.0549 - val_loss: 0.4659 - val_accuracy: 0.8620 - val_binary_crossentropy: 0.4659 - 1s/epoch - 29ms/step
Epoch 9/20
49/49 - 1s - loss: 0.0425 - accuracy: 0.9912 - binary_crossentropy: 0.0425 - val_loss: 0.5064 - val_accuracy: 0.8598 - val_binary_crossentropy: 0.5064 - 1s/epoch - 27ms/step
Epoch 10/20
49/49 - 1s - loss: 0.0318 - accuracy: 0.9950 - binary_crossentropy: 0.0318 - val_loss: 0.5430 - val_accuracy: 0.8588 - val_binary_crossentropy: 0.5430 - 1s/epoch - 25ms/step
Epoch 11/20
49/49 - 1s - loss: 0.0234 - accuracy: 0.9970 - binary_crossentropy: 0.0234 - val_loss: 0.5830 - val_accuracy: 0.8581 - val_binary_crossentropy: 0.5830 - 1s/epoch - 25ms/step
Epoch 12/20
49/49 - 1s - loss: 0.0174 - accuracy: 0.9985 - binary_crossentropy: 0.0174 - val_loss: 0.6278 - val_accuracy: 0.8560 - val_binary_crossentropy: 0.6278 - 1s/epoch - 27ms/step
Epoch 13/20
49/49 - 1s - loss: 0.0129 - accuracy: 0.9992 - binary_crossentropy: 0.0129 - val_loss: 0.6637 - val_accuracy: 0.8564 - val_binary_crossentropy: 0.6637 - 1s/epoch - 25ms/step
Epoch 14/20
49/49 - 1s - loss: 0.0098 - accuracy: 0.9996 - binary_crossentropy: 0.0098 - val_loss: 0.7038 - val_accuracy: 0.8555 - val_binary_crossentropy: 0.7038 - 1s/epoch - 25ms/step
Epoch 15/20
49/49 - 1s - loss: 0.0075 - accuracy: 0.9998 - binary_crossentropy: 0.0075 - val_loss: 0.7371 - val_accuracy: 0.8552 - val_binary_crossentropy: 0.7371 - 1s/epoch - 24ms/step
Epoch 16/20
49/49 - 1s - loss: 0.0059 - accuracy: 0.9999 - binary_crossentropy: 0.0059 - val_loss: 0.7652 - val_accuracy: 0.8545 - val_binary_crossentropy: 0.7652 - 1s/epoch - 24ms/step
Epoch 17/20
49/49 - 1s - loss: 0.0048 - accuracy: 1.0000 - binary_crossentropy: 0.0048 - val_loss: 0.7929 - val_accuracy: 0.8546 - val_binary_crossentropy: 0.7929 - 1s/epoch - 25ms/step
Epoch 18/20
49/49 - 1s - loss: 0.0039 - accuracy: 1.0000 - binary_crossentropy: 0.0039 - val_loss: 0.8237 - val_accuracy: 0.8547 - val_binary_crossentropy: 0.8237 - 1s/epoch - 24ms/step
Epoch 19/20
49/49 - 1s - loss: 0.0033 - accuracy: 1.0000 - binary_crossentropy: 0.0033 - val_loss: 0.8493 - val_accuracy: 0.8535 - val_binary_crossentropy: 0.8493 - 1s/epoch - 25ms/step
Epoch 20/20
49/49 - 1s - loss: 0.0028 - accuracy: 1.0000 - binary_crossentropy: 0.0028 - val_loss: 0.8671 - val_accuracy: 0.8532 - val_binary_crossentropy: 0.8671 - 1s/epoch - 29ms/step

Create a smaller model

Let's create a model with less hidden units to compare against the baseline model that we just created:

In [8]:
smaller_model = keras.Sequential([
    keras.layers.Dense(4, activation='relu', input_shape=(NUM_WORDS,)),
    keras.layers.Dense(4, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

smaller_model.compile(optimizer='adam',
                      loss='binary_crossentropy',
                      metrics=['accuracy', 'binary_crossentropy'])

smaller_model.summary()
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_3 (Dense)             (None, 4)                 40004     
                                                                 
 dense_4 (Dense)             (None, 4)                 20        
                                                                 
 dense_5 (Dense)             (None, 1)                 5         
                                                                 
=================================================================
Total params: 40029 (156.36 KB)
Trainable params: 40029 (156.36 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

And train the model using the same data:

In [9]:
smaller_history = smaller_model.fit(train_data,
                                    train_labels,
                                    epochs=20,
                                    batch_size=512,
                                    validation_data=(test_data, test_labels),
                                    verbose=2)
Epoch 1/20
49/49 - 5s - loss: 0.5880 - accuracy: 0.7570 - binary_crossentropy: 0.5880 - val_loss: 0.4639 - val_accuracy: 0.8434 - val_binary_crossentropy: 0.4639 - 5s/epoch - 110ms/step
Epoch 2/20
49/49 - 1s - loss: 0.3648 - accuracy: 0.8840 - binary_crossentropy: 0.3648 - val_loss: 0.3424 - val_accuracy: 0.8781 - val_binary_crossentropy: 0.3424 - 1s/epoch - 24ms/step
Epoch 3/20
49/49 - 1s - loss: 0.2697 - accuracy: 0.9120 - binary_crossentropy: 0.2697 - val_loss: 0.3023 - val_accuracy: 0.8860 - val_binary_crossentropy: 0.3023 - 1s/epoch - 24ms/step
Epoch 4/20
49/49 - 2s - loss: 0.2225 - accuracy: 0.9262 - binary_crossentropy: 0.2225 - val_loss: 0.2871 - val_accuracy: 0.8878 - val_binary_crossentropy: 0.2871 - 2s/epoch - 38ms/step
Epoch 5/20
49/49 - 2s - loss: 0.1920 - accuracy: 0.9363 - binary_crossentropy: 0.1920 - val_loss: 0.2836 - val_accuracy: 0.8871 - val_binary_crossentropy: 0.2836 - 2s/epoch - 31ms/step
Epoch 6/20
49/49 - 1s - loss: 0.1698 - accuracy: 0.9446 - binary_crossentropy: 0.1698 - val_loss: 0.2860 - val_accuracy: 0.8856 - val_binary_crossentropy: 0.2860 - 1s/epoch - 26ms/step
Epoch 7/20
49/49 - 1s - loss: 0.1520 - accuracy: 0.9513 - binary_crossentropy: 0.1520 - val_loss: 0.2908 - val_accuracy: 0.8849 - val_binary_crossentropy: 0.2908 - 1s/epoch - 26ms/step
Epoch 8/20
49/49 - 1s - loss: 0.1378 - accuracy: 0.9564 - binary_crossentropy: 0.1378 - val_loss: 0.2986 - val_accuracy: 0.8827 - val_binary_crossentropy: 0.2986 - 1s/epoch - 24ms/step
Epoch 9/20
49/49 - 1s - loss: 0.1253 - accuracy: 0.9611 - binary_crossentropy: 0.1253 - val_loss: 0.3092 - val_accuracy: 0.8798 - val_binary_crossentropy: 0.3092 - 1s/epoch - 23ms/step
Epoch 10/20
49/49 - 1s - loss: 0.1148 - accuracy: 0.9654 - binary_crossentropy: 0.1148 - val_loss: 0.3203 - val_accuracy: 0.8781 - val_binary_crossentropy: 0.3203 - 1s/epoch - 24ms/step
Epoch 11/20
49/49 - 1s - loss: 0.1050 - accuracy: 0.9684 - binary_crossentropy: 0.1050 - val_loss: 0.3333 - val_accuracy: 0.8760 - val_binary_crossentropy: 0.3333 - 1s/epoch - 24ms/step
Epoch 12/20
49/49 - 1s - loss: 0.0966 - accuracy: 0.9720 - binary_crossentropy: 0.0966 - val_loss: 0.3466 - val_accuracy: 0.8730 - val_binary_crossentropy: 0.3466 - 1s/epoch - 27ms/step
Epoch 13/20
49/49 - 2s - loss: 0.0888 - accuracy: 0.9750 - binary_crossentropy: 0.0888 - val_loss: 0.3621 - val_accuracy: 0.8710 - val_binary_crossentropy: 0.3621 - 2s/epoch - 34ms/step
Epoch 14/20
49/49 - 1s - loss: 0.0822 - accuracy: 0.9774 - binary_crossentropy: 0.0822 - val_loss: 0.3771 - val_accuracy: 0.8701 - val_binary_crossentropy: 0.3771 - 1s/epoch - 24ms/step
Epoch 15/20
49/49 - 1s - loss: 0.0756 - accuracy: 0.9810 - binary_crossentropy: 0.0756 - val_loss: 0.3974 - val_accuracy: 0.8669 - val_binary_crossentropy: 0.3974 - 1s/epoch - 29ms/step
Epoch 16/20
49/49 - 1s - loss: 0.0696 - accuracy: 0.9830 - binary_crossentropy: 0.0696 - val_loss: 0.4100 - val_accuracy: 0.8664 - val_binary_crossentropy: 0.4100 - 1s/epoch - 28ms/step
Epoch 17/20
49/49 - 1s - loss: 0.0638 - accuracy: 0.9856 - binary_crossentropy: 0.0638 - val_loss: 0.4286 - val_accuracy: 0.8648 - val_binary_crossentropy: 0.4286 - 1s/epoch - 25ms/step
Epoch 18/20
49/49 - 1s - loss: 0.0586 - accuracy: 0.9882 - binary_crossentropy: 0.0586 - val_loss: 0.4475 - val_accuracy: 0.8626 - val_binary_crossentropy: 0.4475 - 1s/epoch - 26ms/step
Epoch 19/20
49/49 - 1s - loss: 0.0540 - accuracy: 0.9897 - binary_crossentropy: 0.0540 - val_loss: 0.4640 - val_accuracy: 0.8612 - val_binary_crossentropy: 0.4640 - 1s/epoch - 30ms/step
Epoch 20/20
49/49 - 1s - loss: 0.0498 - accuracy: 0.9912 - binary_crossentropy: 0.0498 - val_loss: 0.4825 - val_accuracy: 0.8600 - val_binary_crossentropy: 0.4825 - 1s/epoch - 26ms/step

Create a bigger model

As an exercise, you can create an even larger model, and see how quickly it begins overfitting. Next, let's add to this benchmark a network that has much more capacity, far more than the problem would warrant:

In [10]:
bigger_model = keras.models.Sequential([
    keras.layers.Dense(512, activation='relu', input_shape=(NUM_WORDS,)),
    keras.layers.Dense(512, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

bigger_model.compile(optimizer='adam',
                     loss='binary_crossentropy',
                     metrics=['accuracy','binary_crossentropy'])

bigger_model.summary()
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_6 (Dense)             (None, 512)               5120512   
                                                                 
 dense_7 (Dense)             (None, 512)               262656    
                                                                 
 dense_8 (Dense)             (None, 1)                 513       
                                                                 
=================================================================
Total params: 5383681 (20.54 MB)
Trainable params: 5383681 (20.54 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

And, again, train the model using the same data:

In [11]:
bigger_history = bigger_model.fit(train_data, train_labels,
                                  epochs=20,
                                  batch_size=512,
                                  validation_data=(test_data, test_labels),
                                  verbose=2)
Epoch 1/20
49/49 - 6s - loss: 0.3423 - accuracy: 0.8476 - binary_crossentropy: 0.3423 - val_loss: 0.2902 - val_accuracy: 0.8811 - val_binary_crossentropy: 0.2902 - 6s/epoch - 115ms/step
Epoch 2/20
49/49 - 1s - loss: 0.1362 - accuracy: 0.9513 - binary_crossentropy: 0.1362 - val_loss: 0.3250 - val_accuracy: 0.8711 - val_binary_crossentropy: 0.3250 - 1s/epoch - 24ms/step
Epoch 3/20
49/49 - 1s - loss: 0.0398 - accuracy: 0.9891 - binary_crossentropy: 0.0398 - val_loss: 0.4419 - val_accuracy: 0.8719 - val_binary_crossentropy: 0.4419 - 1s/epoch - 24ms/step
Epoch 4/20
49/49 - 1s - loss: 0.0045 - accuracy: 0.9996 - binary_crossentropy: 0.0045 - val_loss: 0.6167 - val_accuracy: 0.8692 - val_binary_crossentropy: 0.6167 - 1s/epoch - 27ms/step
Epoch 5/20
49/49 - 2s - loss: 4.8458e-04 - accuracy: 1.0000 - binary_crossentropy: 4.8458e-04 - val_loss: 0.6891 - val_accuracy: 0.8707 - val_binary_crossentropy: 0.6891 - 2s/epoch - 32ms/step
Epoch 6/20
49/49 - 1s - loss: 1.9646e-04 - accuracy: 1.0000 - binary_crossentropy: 1.9646e-04 - val_loss: 0.7298 - val_accuracy: 0.8699 - val_binary_crossentropy: 0.7298 - 1s/epoch - 27ms/step
Epoch 7/20
49/49 - 1s - loss: 1.2845e-04 - accuracy: 1.0000 - binary_crossentropy: 1.2845e-04 - val_loss: 0.7565 - val_accuracy: 0.8703 - val_binary_crossentropy: 0.7565 - 1s/epoch - 26ms/step
Epoch 8/20
49/49 - 1s - loss: 9.4439e-05 - accuracy: 1.0000 - binary_crossentropy: 9.4439e-05 - val_loss: 0.7777 - val_accuracy: 0.8708 - val_binary_crossentropy: 0.7777 - 1s/epoch - 25ms/step
Epoch 9/20
49/49 - 1s - loss: 7.3420e-05 - accuracy: 1.0000 - binary_crossentropy: 7.3420e-05 - val_loss: 0.7961 - val_accuracy: 0.8702 - val_binary_crossentropy: 0.7961 - 1s/epoch - 25ms/step
Epoch 10/20
49/49 - 2s - loss: 5.9020e-05 - accuracy: 1.0000 - binary_crossentropy: 5.9020e-05 - val_loss: 0.8114 - val_accuracy: 0.8709 - val_binary_crossentropy: 0.8114 - 2s/epoch - 31ms/step
Epoch 11/20
49/49 - 1s - loss: 4.8626e-05 - accuracy: 1.0000 - binary_crossentropy: 4.8626e-05 - val_loss: 0.8254 - val_accuracy: 0.8707 - val_binary_crossentropy: 0.8254 - 1s/epoch - 24ms/step
Epoch 12/20
49/49 - 1s - loss: 4.0785e-05 - accuracy: 1.0000 - binary_crossentropy: 4.0785e-05 - val_loss: 0.8379 - val_accuracy: 0.8710 - val_binary_crossentropy: 0.8379 - 1s/epoch - 23ms/step
Epoch 13/20
49/49 - 1s - loss: 3.4767e-05 - accuracy: 1.0000 - binary_crossentropy: 3.4767e-05 - val_loss: 0.8492 - val_accuracy: 0.8711 - val_binary_crossentropy: 0.8492 - 1s/epoch - 24ms/step
Epoch 14/20
49/49 - 1s - loss: 2.9978e-05 - accuracy: 1.0000 - binary_crossentropy: 2.9978e-05 - val_loss: 0.8598 - val_accuracy: 0.8712 - val_binary_crossentropy: 0.8598 - 1s/epoch - 26ms/step
Epoch 15/20
49/49 - 2s - loss: 2.6123e-05 - accuracy: 1.0000 - binary_crossentropy: 2.6123e-05 - val_loss: 0.8697 - val_accuracy: 0.8712 - val_binary_crossentropy: 0.8697 - 2s/epoch - 36ms/step
Epoch 16/20
49/49 - 1s - loss: 2.2977e-05 - accuracy: 1.0000 - binary_crossentropy: 2.2977e-05 - val_loss: 0.8791 - val_accuracy: 0.8710 - val_binary_crossentropy: 0.8791 - 1s/epoch - 30ms/step
Epoch 17/20
49/49 - 1s - loss: 2.0338e-05 - accuracy: 1.0000 - binary_crossentropy: 2.0338e-05 - val_loss: 0.8877 - val_accuracy: 0.8712 - val_binary_crossentropy: 0.8877 - 1s/epoch - 26ms/step
Epoch 18/20
49/49 - 1s - loss: 1.8135e-05 - accuracy: 1.0000 - binary_crossentropy: 1.8135e-05 - val_loss: 0.8959 - val_accuracy: 0.8714 - val_binary_crossentropy: 0.8959 - 1s/epoch - 29ms/step
Epoch 19/20
49/49 - 1s - loss: 1.6266e-05 - accuracy: 1.0000 - binary_crossentropy: 1.6266e-05 - val_loss: 0.9040 - val_accuracy: 0.8713 - val_binary_crossentropy: 0.9040 - 1s/epoch - 26ms/step
Epoch 20/20
49/49 - 1s - loss: 1.4654e-05 - accuracy: 1.0000 - binary_crossentropy: 1.4654e-05 - val_loss: 0.9115 - val_accuracy: 0.8713 - val_binary_crossentropy: 0.9115 - 1s/epoch - 24ms/step

Plot the training and validation loss

The solid lines show the training loss, and the dashed lines show the validation loss (remember: a lower validation loss indicates a better model). Here, the smaller network begins overfitting later than the baseline model (after 6 epochs rather than 4) and its performance degrades much more slowly once it starts overfitting.

In [12]:
def plot_history(histories, key='binary_crossentropy'):
  plt.figure(figsize=(16,10))

  for name, history in histories:
    val = plt.plot(history.epoch, history.history['val_'+key],
                   '--', label=name.title()+' Val')
    plt.plot(history.epoch, history.history[key], color=val[0].get_color(),
             label=name.title()+' Train')

  plt.xlabel('Epochs')
  plt.ylabel(key.replace('_',' ').title())
  plt.legend()

  plt.xlim([0,max(history.epoch)])


plot_history([('baseline', baseline_history),
              ('smaller', smaller_history),
              ('bigger', bigger_history)])

Notice that the larger network begins overfitting almost right away, after just one epoch, and overfits much more severely. The more capacity the network has, the quicker it will be able to model the training data (resulting in a low training loss), but the more susceptible it is to overfitting (resulting in a large difference between the training and validation loss).

Strategies to prevent overfitting

Add weight regularization

You may be familiar with Occam's Razor principle: given two explanations for something, the explanation most likely to be correct is the "simplest" one, the one that makes the least amount of assumptions. This also applies to the models learned by neural networks: given some training data and a network architecture, there are multiple sets of weights values (multiple models) that could explain the data, and simpler models are less likely to overfit than complex ones.

A "simple model" in this context is a model where the distribution of parameter values has less entropy (or a model with fewer parameters altogether, as we saw in the section above). Thus a common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights only to take small values, which makes the distribution of weight values more "regular". This is called "weight regularization", and it is done by adding to the loss function of the network a cost associated with having large weights. This cost comes in two flavors:

  • L1 regularization, where the cost added is proportional to the absolute value of the weights coefficients (i.e. to what is called the "L1 norm" of the weights).

  • L2 regularization, where the cost added is proportional to the square of the value of the weights coefficients (i.e. to what is called the squared "L2 norm" of the weights). L2 regularization is also called weight decay in the context of neural networks. Don't let the different name confuse you: weight decay is mathematically the exact same as L2 regularization.

L1 regularization introduces sparsity to make some of your weight parameters zero. L2 regularization will penalize the weights parameters without making them sparse—one reason why L2 is more common.

In tf.keras, weight regularization is added by passing weight regularizer instances to layers as keyword arguments. Let's add L2 weight regularization now.

In [13]:
l2_model = keras.models.Sequential([
    keras.layers.Dense(16, kernel_regularizer=keras.regularizers.l2(0.001),
                       activation='relu', input_shape=(NUM_WORDS,)),
    keras.layers.Dense(16, kernel_regularizer=keras.regularizers.l2(0.001),
                       activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

l2_model.compile(optimizer='adam',
                 loss='binary_crossentropy',
                 metrics=['accuracy', 'binary_crossentropy'])

l2_model_history = l2_model.fit(train_data, train_labels,
                                epochs=20,
                                batch_size=512,
                                validation_data=(test_data, test_labels),
                                verbose=2)
Epoch 1/20
49/49 - 6s - loss: 0.5222 - accuracy: 0.8167 - binary_crossentropy: 0.4800 - val_loss: 0.3775 - val_accuracy: 0.8781 - val_binary_crossentropy: 0.3316 - 6s/epoch - 120ms/step
Epoch 2/20
49/49 - 1s - loss: 0.3053 - accuracy: 0.9092 - binary_crossentropy: 0.2549 - val_loss: 0.3369 - val_accuracy: 0.8877 - val_binary_crossentropy: 0.2835 - 1s/epoch - 24ms/step
Epoch 3/20
49/49 - 1s - loss: 0.2551 - accuracy: 0.9303 - binary_crossentropy: 0.1992 - val_loss: 0.3430 - val_accuracy: 0.8851 - val_binary_crossentropy: 0.2857 - 1s/epoch - 31ms/step
Epoch 4/20
49/49 - 1s - loss: 0.2335 - accuracy: 0.9398 - binary_crossentropy: 0.1746 - val_loss: 0.3564 - val_accuracy: 0.8817 - val_binary_crossentropy: 0.2963 - 1s/epoch - 26ms/step
Epoch 5/20
49/49 - 1s - loss: 0.2178 - accuracy: 0.9462 - binary_crossentropy: 0.1565 - val_loss: 0.3688 - val_accuracy: 0.8777 - val_binary_crossentropy: 0.3069 - 1s/epoch - 24ms/step
Epoch 6/20
49/49 - 1s - loss: 0.2059 - accuracy: 0.9528 - binary_crossentropy: 0.1432 - val_loss: 0.3832 - val_accuracy: 0.8755 - val_binary_crossentropy: 0.3201 - 1s/epoch - 26ms/step
Epoch 7/20
49/49 - 1s - loss: 0.1993 - accuracy: 0.9534 - binary_crossentropy: 0.1354 - val_loss: 0.3995 - val_accuracy: 0.8737 - val_binary_crossentropy: 0.3349 - 1s/epoch - 25ms/step
Epoch 8/20
49/49 - 1s - loss: 0.1960 - accuracy: 0.9560 - binary_crossentropy: 0.1306 - val_loss: 0.4160 - val_accuracy: 0.8687 - val_binary_crossentropy: 0.3497 - 1s/epoch - 25ms/step
Epoch 9/20
49/49 - 1s - loss: 0.1874 - accuracy: 0.9598 - binary_crossentropy: 0.1206 - val_loss: 0.4249 - val_accuracy: 0.8672 - val_binary_crossentropy: 0.3583 - 1s/epoch - 25ms/step
Epoch 10/20
49/49 - 1s - loss: 0.1818 - accuracy: 0.9619 - binary_crossentropy: 0.1149 - val_loss: 0.4363 - val_accuracy: 0.8668 - val_binary_crossentropy: 0.3690 - 1s/epoch - 25ms/step
Epoch 11/20
49/49 - 1s - loss: 0.1789 - accuracy: 0.9613 - binary_crossentropy: 0.1110 - val_loss: 0.4653 - val_accuracy: 0.8627 - val_binary_crossentropy: 0.3970 - 1s/epoch - 25ms/step
Epoch 12/20
49/49 - 2s - loss: 0.1739 - accuracy: 0.9640 - binary_crossentropy: 0.1050 - val_loss: 0.4608 - val_accuracy: 0.8629 - val_binary_crossentropy: 0.3920 - 2s/epoch - 31ms/step
Epoch 13/20
49/49 - 1s - loss: 0.1686 - accuracy: 0.9667 - binary_crossentropy: 0.0997 - val_loss: 0.4695 - val_accuracy: 0.8635 - val_binary_crossentropy: 0.4005 - 1s/epoch - 28ms/step
Epoch 14/20
49/49 - 1s - loss: 0.1669 - accuracy: 0.9670 - binary_crossentropy: 0.0978 - val_loss: 0.4849 - val_accuracy: 0.8612 - val_binary_crossentropy: 0.4154 - 1s/epoch - 24ms/step
Epoch 15/20
49/49 - 1s - loss: 0.1652 - accuracy: 0.9694 - binary_crossentropy: 0.0953 - val_loss: 0.4965 - val_accuracy: 0.8622 - val_binary_crossentropy: 0.4262 - 1s/epoch - 27ms/step
Epoch 16/20
49/49 - 1s - loss: 0.1636 - accuracy: 0.9679 - binary_crossentropy: 0.0932 - val_loss: 0.5055 - val_accuracy: 0.8591 - val_binary_crossentropy: 0.4348 - 1s/epoch - 25ms/step
Epoch 17/20
49/49 - 1s - loss: 0.1598 - accuracy: 0.9699 - binary_crossentropy: 0.0889 - val_loss: 0.5130 - val_accuracy: 0.8587 - val_binary_crossentropy: 0.4420 - 1s/epoch - 25ms/step
Epoch 18/20
49/49 - 2s - loss: 0.1563 - accuracy: 0.9716 - binary_crossentropy: 0.0854 - val_loss: 0.5330 - val_accuracy: 0.8542 - val_binary_crossentropy: 0.4619 - 2s/epoch - 33ms/step
Epoch 19/20
49/49 - 2s - loss: 0.1600 - accuracy: 0.9685 - binary_crossentropy: 0.0884 - val_loss: 0.5344 - val_accuracy: 0.8584 - val_binary_crossentropy: 0.4625 - 2s/epoch - 38ms/step
Epoch 20/20
49/49 - 1s - loss: 0.1522 - accuracy: 0.9724 - binary_crossentropy: 0.0803 - val_loss: 0.5401 - val_accuracy: 0.8571 - val_binary_crossentropy: 0.4680 - 1s/epoch - 25ms/step

l2(0.001) means that every coefficient in the weight matrix of the layer will add 0.001 * weight_coefficient_value**2 to the total loss of the network. Note that because this penalty is only added at training time, the loss for this network will be much higher at training than at test time.

Here's the impact of our L2 regularization penalty:

In [14]:
plot_history([('baseline', baseline_history),
              ('l2', l2_model_history)])

As you can see, the L2 regularized model has become much more resistant to overfitting than the baseline model, even though both models have the same number of parameters.

Add dropout

Dropout is one of the most effective and most commonly used regularization techniques for neural networks, developed by Hinton and his students at the University of Toronto. Dropout, applied to a layer, consists of randomly "dropping out" (i.e. set to zero) a number of output features of the layer during training. Let's say a given layer would normally have returned a vector [0.2, 0.5, 1.3, 0.8, 1.1] for a given input sample during training; after applying dropout, this vector will have a few zero entries distributed at random, e.g. [0, 0.5, 1.3, 0, 1.1]. The "dropout rate" is the fraction of the features that are being zeroed-out; it is usually set between 0.2 and 0.5. At test time, no units are dropped out, and instead the layer's output values are scaled down by a factor equal to the dropout rate, so as to balance for the fact that more units are active than at training time.

In tf.keras you can introduce dropout in a network via the Dropout layer, which gets applied to the output of layer right before.

Let's add two Dropout layers in our IMDB network to see how well they do at reducing overfitting:

In [15]:
dpt_model = keras.models.Sequential([
    keras.layers.Dense(16, activation='relu', input_shape=(NUM_WORDS,)),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(1, activation='sigmoid')
])

dpt_model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy','binary_crossentropy'])

dpt_model_history = dpt_model.fit(train_data, train_labels,
                                  epochs=20,
                                  batch_size=512,
                                  validation_data=(test_data, test_labels),
                                  verbose=2)
Epoch 1/20
49/49 - 6s - loss: 0.6325 - accuracy: 0.6302 - binary_crossentropy: 0.6325 - val_loss: 0.5132 - val_accuracy: 0.8480 - val_binary_crossentropy: 0.5132 - 6s/epoch - 127ms/step
Epoch 2/20
49/49 - 2s - loss: 0.4656 - accuracy: 0.7949 - binary_crossentropy: 0.4656 - val_loss: 0.3483 - val_accuracy: 0.8798 - val_binary_crossentropy: 0.3483 - 2s/epoch - 33ms/step
Epoch 3/20
49/49 - 1s - loss: 0.3558 - accuracy: 0.8668 - binary_crossentropy: 0.3558 - val_loss: 0.2904 - val_accuracy: 0.8854 - val_binary_crossentropy: 0.2904 - 1s/epoch - 25ms/step
Epoch 4/20
49/49 - 1s - loss: 0.2922 - accuracy: 0.8982 - binary_crossentropy: 0.2922 - val_loss: 0.2716 - val_accuracy: 0.8896 - val_binary_crossentropy: 0.2716 - 1s/epoch - 24ms/step
Epoch 5/20
49/49 - 1s - loss: 0.2440 - accuracy: 0.9170 - binary_crossentropy: 0.2440 - val_loss: 0.2802 - val_accuracy: 0.8864 - val_binary_crossentropy: 0.2802 - 1s/epoch - 24ms/step
Epoch 6/20
49/49 - 1s - loss: 0.2117 - accuracy: 0.9297 - binary_crossentropy: 0.2117 - val_loss: 0.2790 - val_accuracy: 0.8862 - val_binary_crossentropy: 0.2790 - 1s/epoch - 24ms/step
Epoch 7/20
49/49 - 1s - loss: 0.1877 - accuracy: 0.9407 - binary_crossentropy: 0.1877 - val_loss: 0.2911 - val_accuracy: 0.8837 - val_binary_crossentropy: 0.2911 - 1s/epoch - 28ms/step
Epoch 8/20
49/49 - 2s - loss: 0.1659 - accuracy: 0.9465 - binary_crossentropy: 0.1659 - val_loss: 0.3263 - val_accuracy: 0.8836 - val_binary_crossentropy: 0.3263 - 2s/epoch - 31ms/step
Epoch 9/20
49/49 - 1s - loss: 0.1465 - accuracy: 0.9548 - binary_crossentropy: 0.1465 - val_loss: 0.3257 - val_accuracy: 0.8821 - val_binary_crossentropy: 0.3257 - 1s/epoch - 25ms/step
Epoch 10/20
49/49 - 1s - loss: 0.1364 - accuracy: 0.9576 - binary_crossentropy: 0.1364 - val_loss: 0.3548 - val_accuracy: 0.8802 - val_binary_crossentropy: 0.3548 - 1s/epoch - 27ms/step
Epoch 11/20
49/49 - 1s - loss: 0.1193 - accuracy: 0.9622 - binary_crossentropy: 0.1193 - val_loss: 0.3700 - val_accuracy: 0.8802 - val_binary_crossentropy: 0.3700 - 1s/epoch - 27ms/step
Epoch 12/20
49/49 - 1s - loss: 0.1105 - accuracy: 0.9648 - binary_crossentropy: 0.1105 - val_loss: 0.4042 - val_accuracy: 0.8783 - val_binary_crossentropy: 0.4042 - 1s/epoch - 25ms/step
Epoch 13/20
49/49 - 1s - loss: 0.1019 - accuracy: 0.9667 - binary_crossentropy: 0.1019 - val_loss: 0.4202 - val_accuracy: 0.8780 - val_binary_crossentropy: 0.4202 - 1s/epoch - 25ms/step
Epoch 14/20
49/49 - 2s - loss: 0.0944 - accuracy: 0.9688 - binary_crossentropy: 0.0944 - val_loss: 0.4415 - val_accuracy: 0.8781 - val_binary_crossentropy: 0.4415 - 2s/epoch - 33ms/step
Epoch 15/20
49/49 - 2s - loss: 0.0852 - accuracy: 0.9724 - binary_crossentropy: 0.0852 - val_loss: 0.4552 - val_accuracy: 0.8766 - val_binary_crossentropy: 0.4552 - 2s/epoch - 31ms/step
Epoch 16/20
49/49 - 1s - loss: 0.0832 - accuracy: 0.9730 - binary_crossentropy: 0.0832 - val_loss: 0.5005 - val_accuracy: 0.8780 - val_binary_crossentropy: 0.5005 - 1s/epoch - 26ms/step
Epoch 17/20
49/49 - 1s - loss: 0.0766 - accuracy: 0.9760 - binary_crossentropy: 0.0766 - val_loss: 0.4924 - val_accuracy: 0.8764 - val_binary_crossentropy: 0.4924 - 1s/epoch - 25ms/step
Epoch 18/20
49/49 - 1s - loss: 0.0734 - accuracy: 0.9764 - binary_crossentropy: 0.0734 - val_loss: 0.5197 - val_accuracy: 0.8748 - val_binary_crossentropy: 0.5197 - 1s/epoch - 29ms/step
Epoch 19/20
49/49 - 2s - loss: 0.0688 - accuracy: 0.9784 - binary_crossentropy: 0.0688 - val_loss: 0.5287 - val_accuracy: 0.8739 - val_binary_crossentropy: 0.5287 - 2s/epoch - 31ms/step
Epoch 20/20
49/49 - 1s - loss: 0.0667 - accuracy: 0.9777 - binary_crossentropy: 0.0667 - val_loss: 0.5471 - val_accuracy: 0.8746 - val_binary_crossentropy: 0.5471 - 1s/epoch - 24ms/step
In [16]:
plot_history([('baseline', baseline_history),
              ('dropout', dpt_model_history)])

Adding dropout is a clear improvement over the baseline model.

To recap: here are the most common ways to prevent overfitting in neural networks:

  • Get more training data.
  • Reduce the capacity of the network.
  • Add weight regularization.
  • Add dropout.

And two important approaches not covered in this guide are data-augmentation and batch normalization.