Manual Of Activations in Deep Learning (2024)

List of all Activation Functions you’d ever need

Published in

Towards Data Science

7 min read

Nov 3, 2020

Manual Of Activations in Deep Learning (3)

For any Machine Learning model, one of the most critical decisions is the choice of which activation to use. Let’s go through all the activation functions you’d ever wanna know about. The order of this list is in the Increasing Usage.

Here’s a quick list if you wanna jump right into one.

Identity function and binary step
Sigmoid
Softmax
Hyperbolic tangent (tanH)
SoftPlus
Exponential Linear Unit (ELU) and its Scaled version, SELU
Rectified Linear Unit (ReLU) and its variations.

With so much to cover, let’s start now.

The identity function is one that practically gives the input back as the output. Therefore,

Manual Of Activations in Deep Learning (4)

This function is very rarely used in recent times, and it is extremely doubtful that you’d ever want to use this.

The graph of the Identity function is simply like below.

Manual Of Activations in Deep Learning (5)

The binary step function returns 1 if the number is positive and zero otherwise. It is a rarely used function. Thus,

Manual Of Activations in Deep Learning (6)

Where the graph is like the below,

Manual Of Activations in Deep Learning (7)

Manual Of Activations in Deep Learning (8)

This function was previously the most used in all of Machine Learning. This function maps a number to between 1 and 0, with a graph looking like the above.

Manual Of Activations in Deep Learning (9)

The sigmoid function is now limited to Logistic Regression and Neural Nets’ output nodes for binary classification problems (outputs 0 or 1), although earlier it was also used in hidden units.

The softmax function applies one-sum probabilities to individual components of a vector. Mathematically,

Manual Of Activations in Deep Learning (10)

It is also, thus called the Normalized Exponential operation.

Individually, the following exponential graph is applied.

Manual Of Activations in Deep Learning (11)

It finds application in the output node of Neural Nets for multi-class classification problems.

Manual Of Activations in Deep Learning (12)

The TanH function was the successor of sigmoid as it consistently gave a better performance in the hidden layers of the Neural Nets. You may observe that the tanH graph is very similar to sigmoid, except that it stabilizes at -1 and 1, and centres at 0.

Mathematically it is the ratio of the hyperbolic sine and cosine,

Manual Of Activations in Deep Learning (13)

Nowadays, the tanh function is also lesser-used for hidden layers, although some specific models use the tanh function too. Some problems with the output needs of range -1 and 1 use tanh in the output node.

Manual Of Activations in Deep Learning (14)

The SoftPlus function is the softer, or smoother version of the ReLU that you’ll see later.
Mathematically,

Manual Of Activations in Deep Learning (15)

When the ReLU gives zero gradients, the SoftPlus allows smooth gradients. Thus, in that case, you may use SoftPlus instead of ReLU. It should be noted that the SoftPlus function is computationally much more expensive.

Manual Of Activations in Deep Learning (16)

ELU or exponential linear unit is a new and highly accurate well-used activation function in hidden layers. It is a parameterized function, i.e. it has a parameter (it is technically a hyper-parameter or tunable parameter) called alpha, symbol α. The ELU returns the number itself if it’s positive, and gives alpha multiplied by the exponentiated input subtracted by 1.

Mathematically,

Manual Of Activations in Deep Learning (17)

where α is usually a number between 0.1 and 0.3

The ELU has the potential of getting better accuracy than the ReLU. However, it is more computationally expensive.

Manual Of Activations in Deep Learning (18)

The SELU or the Scaled Exponential Linear Unit is the modification of the ELU, which better aids in improving accuracy and normalizing. An additional hyperparameter lambda is added, symbol λ. The SELU is given as,

Manual Of Activations in Deep Learning (19)

The values are α = 1.67326 and λ = 1.0507

The SELU works better than the ELU, yet it is, due to another added multiplication, even more computationally expensive.

Manual Of Activations in Deep Learning (20)

One of the most commonly used activation functions nowadays is the Rectified Linear Unit or ReLU function. The thing that makes it so attractive is the sheer simplicity and its effectivity. This function simply eliminates a negative value by making its value zero. It retains the values of positive inputs.

Mathematically,

Manual Of Activations in Deep Learning (21)

The ReLU, although having lesser performance than ELU, SELU or its modifications, is highly computationally efficient, and is thus the most used activation function.

Manual Of Activations in Deep Learning (22)

The ReLU activation function has the undesirable attribute of zeroing negative gradients, which lead to a problem called Dying ReLU. To solve this, the Leaky ReLU or LReLU diminishes negative values by multiplying them with a value of 0.01.

Mathematically,

Manual Of Activations in Deep Learning (23)

Although the Leaky ReLU usually finds better optima, it is computationally more expensive, and thus takes more time. Therefore, it is lesser used for tasks with speed being the more critical criterion and where there are less computational resources.

The PReLU or Parameterized ReLU turns the coefficient of x in the Leaky ReLU to a parameter, instead of a fixed number, which can be learned through backpropagation. This is written as alpha with the symbol α. When formulated,

Manual Of Activations in Deep Learning (24)

The PReLU usually finds even better optima than either of ReLU or Leaky ReLU, yet takes more epochs and more time than either. As it has an additional parameter alpha, it is computationally expensive.

The SReLU or S-Shaped ReLU can learn both concave and convex functions. Specifically, it consists of three piecewise linear functions, which are formulated by four learnable parameters.

The SReLU is a complicated function, thus it can learn complexities.

SReLU is formulated as

Manual Of Activations in Deep Learning (25)

The SReLU is the most computationally expensive function in this list of activations. It is also one of the best activation accuracy-wise on the list. The SReLU may be used when computational resources are high.

Manual Of Activations in Deep Learning (26)

The Sine ReLU is probably the newest function on the list. This was invented by Wilder Rodriguez. To learn about this function, you can see the Medium article. It gets great results, so it’s worth trying. Its formulated as

Manual Of Activations in Deep Learning (27)

The epsilon mentioned can carry base values of 0.0025, although he uses 0.25 for Dense layers.

In my opinion, the SineReLU, at some datasets, can get better results than all of the previous functions. This is another one of the functions you can try out. Also, it is computationally more expensive than Leaky-ReLU and ReLU, although it can outperform them.

There are so many choices! Which one to choose is finally up to you, and its effectiveness would also vary with your application, so test them out! You’d probably get some gut feelings, for what function to use when, with experience…

I really hope this helped you. If you have any suggestions, do give feedback in the comments.

Note: All LateX equation images are by Author