Various sigmoid equations and their implementation

Question

Various sigmoid equations and their implementation

When viewed through the Sigmoid function, which is used in neural networks, we found this equation from https://en.wikipedia.org/wiki/Softmax_function#Softmax_Normalization :

Unlike the standard sigmoid equation:

The first equation above is somehow related to the mean and standard deviation (I hope I misread the characters), while the second equation generalized the minus value and divided it into the standard deviation as a constant, since it is the same in all terms inside the vector / matrix / tensor.

Therefore, when implementing the equations, I get different results.

With the second equation (standard sigmoid function):

def sigmoid(x): return 1. / (1 + np.exp(-x))

I get this output:

 >>> x = np.array([1,2,3]) >>> print sigmoid(x) [ 0.73105858 0.88079708 0.95257413]

I would expect the 1st function to be similar, but the gap between the first and second elements will expand significantly (although the rating of the elements remains:

 def get_statistics(x): n = float(len(x)) m = x.sum() / n s2 = sum((x - m)**2) / (n-1.) s = s2**0.5 return m, s2, s m, s, s2 = get_statistics(x) sigmoid_x1 = 1 / (1 + np.exp(-(x[0] - m) / s2)) sigmoid_x2 = 1 / (1 + np.exp(-(x[1] - m) / s2)) sigmoid_x3 = 1 / (1 + np.exp(-(x[2] - m) / s2)) sigmoid_x1, sigmoid_x2, sigmoid_x3

[exit]:

 (0.2689414213699951, 0.5, 0.7310585786300049)

Perhaps this is due to the fact that the first equation contains some normal normalization of softmax, but if it is a universal softmax, then the elements should be summed up with one as such:

 def softmax(x): exp_x = np.exp(x) return exp_x / exp_x.sum()

[exit]:

 >>> x = np.array([1,2,3]) >>> print softmax(x) [ 0.09003057 0.24472847 0.66524096]

But the conclusion from the first equation does not stack with one and is not similar / the same as the standard sigmoid equation. So the question is:

Am I doing the wrong function for equation 1?
Is equation 1 on the wikipedia page wrong? Or does this relate to something else, rather than a sigmoid / logistic function?
Why is there a difference in the first and second equations?

+7

python math neural-network logistic-regression softmax

alvas Apr 27 '16 at 22:32

source share

2 answers

daniel451 · Answer 1 · 2016-04-27T22:54:46+0000

You have applied the equations correctly. Your problem is that you mix softmax and sigmoid function definitions.

The softmax function is a way to normalize your data, making outliers "less interesting." In addition, it “penetrates” your input vector in such a way that it guarantees that the sum of the vector is 1.

In your example:

 > np.sum([ 0.09003057, 0.24472847, 0.66524096]) > 1.0

This is just a generalization of the logistic function with an additional “restriction” to obtain each element of the vector in the interval (0, 1) and its sum to 1.0.

The sigmoid function is another special case of logistic functions. It is just a real, differentiable function with a bell-shaped shape. It is interesting for neural networks because it is quite easy to calculate, non-linear and has negative and positive boundaries, so your activation cannot diverge, but it becomes saturated if it becomes "too high".

However, the sigmoid function does not guarantee that the input vector is summed to 1.0.

In neural networks, sigmoid functions are often used as an activation function for single neurons, while the sigmoid / softmax normalization function is quite used at the output level, so that the entire layer is added to 1. You simply mixed up the sigmoid function (for single neurons) compared to normalization functions sigmoid / soft max (for the whole layer).

EDIT: To do this for you, I will give you a simple example with outliers, this demonstrates the behavior of two different functions for you.

Let the sigmoid function be realized:

 import numpy as np def s(x): return 1.0 / (1.0 + np.exp(-x))

And the normalized version (in small steps, which makes reading easier):

 def sn(x): numerator = x - np.mean(x) denominator = np.std(x) fraction = numerator / denominator return 1.0 / (1.0 + np.exp(-fraction))

Now we define some dimensions of something with huge emissions:

 measure = np.array([0.01, 0.2, 0.5, 0.6, 0.7, 1.0, 2.5, 5.0, 50.0, 5000.0])

Now let's look at the results that s (sigmoid) and sn (normalized sigmoid) give:

 > s(measure) > array([ 0.50249998, 0.549834 , 0.62245933, 0.64565631, 0.66818777, 0.73105858, 0.92414182, 0.99330715, 1. , 1. ]) > sn(measure) > array([ 0.41634425, 0.41637507, 0.41642373, 0.41643996, 0.41645618, 0.41650485, 0.41674821, 0.41715391, 0.42447515, 0.9525677 ])

As you can see, s only translates the values “one after another” through the logistic function, therefore the emissions completely freeze from 0.999, 1.0, 1.0. The distance between the other values changes.

When we look at sn , we see that the function actually normalized our values. Everything is now absolutely identical, except for 0.95, which was 5000.0.

What is it useful for or how to interpret it?

Think about the output level in a neural network: activating 5000.0 in one class at the output level (compared to other other values) means that the network is really sure that this is the “right” class for your given input. If you used s there, you would get 0.99, 1.0, and 1.0 and could not tell which class is the right guess for your input.

Marcin Możejko · Answer 2 · 2016-05-03T11:24:21+0000

In this case, you must distinguish three things: a sigmoid function, a sigmoid function with normalization softmax, and a softmax function.

A sigmoid function is a real-valued function that is simple, defined by the equation f(x) = 1 / (1 + exp(-x)) . For many years it was used in the domain of machine learning because it squeezed a real input into the interval (0,1) , which could be interpreted as, for example, the value of probability. Right now - many experts advise against using it due to its saturation and non-zero problems. You can read about it (as long as you solve the problem, for example here http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf ).
A sigmoid with normalization softmax is used to solve two important problems that may arise when using the sigmoid function. Firstly, it concerns emissions (it pushes your x to 0 and makes it sd = 1 , which normalizes your data), and the second (and, in my opinion, more important) is to make different variables equally important in further analysis. To understand this phenomenon, imagine that you have two variables age and income , where age varies from 20 to 70, and income varies from 2000 to 60,000. Without data normalization, both variables will be compressed to almost one by sigmoid transformation. Moreover - due to the larger average absolute value, the income variable will be significantly more important for your analysis without any rational explanation.
I believe that standardization is even more important for understanding softmax normalization than for controlling emissions. To understand why to present a variable that is 0 in 99% of cases and 1 in others. In this case, your normalization sd ~ 0.01 , mean ~ 0 and softmax will further reduce 1 .
Something completely different is the softmax function. The softmax function is a mathematical transformation from R^k to R^k , which twists a real vector into a positively estimated vector of the same size that sums to 1 . It is given by the equation softmax(v) = exp(v)/sum(exp(v)) . This is something completely different than normalization of softmax and is usually used in the case of multiclass classification.

Various sigmoid equations and their implementation

More articles: