Multilayer perceptron activation function

Question

Multilayer perceptron activation function

I tried to assemble a simple inverse structure of a neural network using the xor function. When I use tanh(x) as an activation function, with the derivative 1-tanh(x)^2 , I get the correct result after about 1000 iterations. However, when I use g(x) = 1/(1+e^(-x)) as an activation function, with the derivative g(x)*(1-g(x)) , I need about 50,000 iterations to get correct result. What could be the reason?

Thanks.

+4

machine-learning neural-network perceptron backpropagation

user1767774 Jan 24 '13 at 17:28

source share

1 answer

greeness · Answer 1 · 2013-01-30T03:23:57+0000

Yes, what you are observing is true. I have similar observations when training neural networks using backpropagations. For the XOR problem, I used to configure a 2x20x2 network, the logistic function takes 3000+ episodes to get the result below:

 [0, 0] -> [0.049170633762142486] [0, 1] -> [0.947292007836417] [1, 0] -> [0.9451808598939389] [1, 1] -> [0.060643862846171494]

When using tanh as an activation function, here is the result after 800 episodes. tanh converges consistently faster than logistic .

 [0, 0] -> [-0.0862215901296476] [0, 1] -> [0.9777578145233919] [1, 0] -> [0.9777632805205176] [1, 1] -> [0.12637838259658932]

The form of the two functions looks below (credit: effective backprop ):

activation funcs

The left one is the standard logistic function: 1/(1+e^(-x)) .
The rule is the tanh function, also known as the hyperbolic tangent.

It is easy to see that tanh antisymmetric with respect to the origin.

According to effective backprop ,

Symmetric sigmoids , such as tanh , often converge faster than the standard logistic function.

Also from the wiki Logistic Regression :

Practitioners warn that sigmoidal functions that are antisymmetric with respect to origin (for example, hyperbolic tangent ) lead to convergence faster when learning backpropagation networks.

See the effective backprop for a more detailed explanation of intuition.

See elliott for an alternative to tanh with easier calculations. It is shown below as a black curve (blue is the original tanh ).

From the above graph, two things should be distinguished. First, TANH usually needed fewer iterations to train than Elliot. Thus, the accuracy of training is not so good with Elliott, for Encoder. However, pay attention to the training time. Elliot completed his entire task, even with the additional iterations that he had to do, in half the TANH cases. This is a huge improvement and literally means that in this case, Elliott will reduce the training time in half and make the same mistake of the final training. Although this requires more training iterations, the speed of iteration is much faster, and this still leads to a reduction in training time, which is halved.

Multilayer perceptron activation function

More articles: