Activation functions: ReLU vs. Leaky ReLU (2024)

Srikari Rallabandi

7 min read

Mar 26, 2023

It’s never too late to board the ‘Learning and discussing the insights’ train, and here are my two cents on my recent learnings and dwellings.

Activation functions: ReLU vs. Leaky ReLU (2)

Before deep-diving into my specific insights, let’s get some foundation laid out with generic explanations of a few concepts, so everyone is on the same page. Leessss goooo!!

What is a Neural Network?

A neural network is a machine learning algorithm “inspired by the structure and function of the human brain” {Imitation of nature — quoting my previous article on GAN}. It consists of a collection of interconnected processing nodes or “neurons” that work together to learn patterns in data and make predictions or decisions based on that learning.

Here is a high-level overview of how a neural network functions:

Data is fed into the input layer of the neural network. This data can be in images, text, audio, or any other information that can be represented numerically.
Each input is multiplied by a set of weights and passed through an activation function to produce an output value. The activation function determines whether the neuron should “fire” and pass on the signal to the next network layer.
The output values from the first layer of neurons become the input values for the next layer of neurons. This process is repeated until the data reaches the final layer of the network, which produces the network’s prediction or decision.
During training, the network adjusts the neurons' weights to minimize the error between its predicted output and the actual output. This is typically done using an optimization algorithm such as gradient descent.
Once trained, the network can predict or decide on new, unseen data.

Neural networks can be used for various tasks, including image and speech recognition, natural language processing, and predictive analytics. They are particularly well-suited to tasks with complex data patterns that are difficult to capture using traditional machine learning algorithms.

Activation functions: ReLU vs. Leaky ReLU (3)

Diving in further, specifically.

What is Activation Function?

An activation function is a mathematical function applied to the output of a neuron in a neural network. The purpose of an activation function is to introduce nonlinearity into the output of a neuron, which enables the neural network to model complex relationships between inputs and outputs.

The output of a neuron is calculated by multiplying the inputs by their respective weights, summing the results, and adding a bias term. The result of this calculation is then passed through an activation function, which transforms the output into a nonlinear form. Without an activation function, a neural network would be limited to modeling linear relationships between inputs and outputs.

Different activation functions are used in neural networks, including the sigmoid function, the hyperbolic tangent function, the rectified linear unit (ReLU) function, and many others. The choice of activation function depends on the specific requirements of the problem being solved and the characteristics of the data being used.

Activation functions: ReLU vs. Leaky ReLU (4)

Now, to the specific functions we want to discuss.

What is ReLU?

ReLU stands for Rectified Linear Unit. The function is defined as f(x) = max(0, x), which returns the input value if it is positive and zero if it is negative. The output of the ReLU function is, therefore, always non-negative.

The ReLU function has become a popular choice for activation functions in neural networks because it is computationally efficient and does not suffer from the vanishing gradient problem that can occur with other activation functions like the sigmoid or hyperbolic tangent functions.

Activation functions: ReLU vs. Leaky ReLU (5)

The vanishing gradient problem can occur when the gradients of the activation function become very small for large or small input values, making it difficult to train the neural network effectively.

The ReLU function also has the desirable property of introducing sparsity into the network, meaning that many of the neurons in the network will be inactive for a given input, which can help to reduce overfitting and improve generalization performance.

However, ReLU can suffer from the “dying ReLU” problem, where a neuron with a negative bias may never activate, resulting in a “dead” neuron. To avoid this, variants of ReLU have been proposed, such as leaky ReLU, exponential ReLU, and others {moving to the next part}.

What is, and why, Leaky ReLU?

The Leaky ReLU function is f(x) = max(ax, x), where x is the input to the neuron, and a is a small constant, typically set to a value like 0.01. When x is positive, the Leaky ReLU function behaves like the ReLU function, returning x. However, when x is negative, the Leaky ReLU function returns a small negative value proportional to the input x.

Activation functions: ReLU vs. Leaky ReLU (6)

This helps to avoid the “dying ReLU” problem with the standard ReLU function, where a neuron with a negative bias may never activate and become “dead.”

The main advantage of Leaky ReLU over the standard ReLU function is that it can help to improve the performance of deep neural networks by addressing the dying ReLU problem. By introducing a small slope for negative values of x, Leaky ReLU ensures that all neurons in the network can contribute to the output, even if their inputs are negative.

However, it is worth noting that the choice of the leakage constant a is a hyperparameter that needs to be tuned carefully, as setting it too high may cause the Leaky ReLU function to behave too much like a linear function, while setting it too low may not be enough to address the dying ReLU problem effectively.

Just an overall picture of the Dying ReLU problem:

The Dying ReLU problem can occur {while using ReLU activation function} when the weights of a neuron are adjusted so that the bias term becomes very negative. When this happens, the neuron will always receive negative inputs, and its output will always be zero, which means it will not contribute to the network's output. Suppose many neurons in the network suffer from the Dying ReLU problem. In that case, it can significantly reduce the network’s overall capacity, which can limit its ability to learn complex representations of the data.

To address the Dying ReLU problem, several variants of the ReLU activation function have been proposed, such as Leaky ReLU, Exponential ReLU, and Parametric ReLU, among others. These variants introduce a small slope for negative input values, which allows the neuron to be active even when it receives negative inputs, helping to prevent the Dying ReLU problem.

Activation functions: ReLU vs. Leaky ReLU (7)

Causes of dying ReLU being ‘high learning rate’ in the backpropagation step while updating the weights or ‘large negative bias.’ More on this particular point here.

ReLU over Leaky ReLU:

When the neural network has a shallow architecture: ReLU is computationally efficient and simpler than Leaky ReLU, which makes it more suitable for shallow architectures.
When the data is relatively clean and has few outliers: ReLU is less likely to introduce noise into the network since it only activates on positive input values. Therefore, it is suitable for datasets that have a limited amount of noise or outliers.
When speed is a critical factor: Since ReLU has a simpler structure and requires fewer computations than Leaky ReLU, it can be faster to train and deploy. Therefore, it is preferred in scenarios where speed is critical, such as real-time applications.
When the neural network is used for feature learning: ReLU can be more effective at learning features than Leaky ReLU, especially when used in the context of deep learning architectures. This is because ReLU encourages sparse representations, which can help to capture more informative features in the data.

Leaky ReLU over ReLU:

When the neural network has a deep architecture: Leaky ReLU can help to prevent the “Dying ReLU” problem, where some neurons may stop activating because they always receive negative input values, which is more likely to occur in deeper networks.
When the data has a lot of noise or outliers: Leaky ReLU can provide a non-zero output for negative input values, which can help to avoid discarding potentially important information, and thus perform better than ReLU in scenarios where the data has a lot of noise or outliers.
When generalization performance is a priority: Leaky ReLU can introduce some noise into the network, which can help to reduce overfitting and improve generalization performance. Therefore, it is preferred when generalization performance is a priority.
When the neural network is used for regression tasks: Leaky ReLU can be more effective than ReLU for regression tasks, especially when the output range is not restricted to positive values since it can provide both positive and negative output values.

The choice between Leaky ReLU and ReLU depends on the specifics of the task, and it is recommended to experiment with both activation functions to determine which one works best for the particular scenario.

Additionally, other variants of ReLU and Leaky ReLU, such as Exponential ReLU, Parametric ReLU, and others, may be better suited for certain scenarios.

This article doesn’t follow my usual style and is more informative than an insights-discussion style. The main consideration I wanted to stress was penalization — whether we use it or not — depending on the scenario. This doesn’t just apply here but everywhere — as a researcher and learner, one has to always keep track of multiple parameters which will affect the performance of the overall algorithm.

Let us have a continuation article as part two — on Exponential ReLU, Partition ReLU — which will lead to our original style of discussion of insights — in the future. Until then, Adios!!

BECOME a WRITER at MLearning.ai

Mlearning.ai Submission SuggestionsHow to become a writer on Mlearning.aimedium.com