You have applied the equations correctly. Your problem is that you mix softmax and sigmoid function definitions.
The softmax function is a way to normalize your data, making outliers "less interesting." In addition, it “penetrates” your input vector in such a way that it guarantees that the sum of the vector is 1.
In your example:
> np.sum([ 0.09003057, 0.24472847, 0.66524096]) > 1.0
This is just a generalization of the logistic function with an additional “restriction” to obtain each element of the vector in the interval (0, 1) and its sum to 1.0.
The sigmoid function is another special case of logistic functions. It is just a real, differentiable function with a bell-shaped shape. It is interesting for neural networks because it is quite easy to calculate, non-linear and has negative and positive boundaries, so your activation cannot diverge, but it becomes saturated if it becomes "too high".
However, the sigmoid function does not guarantee that the input vector is summed to 1.0.
In neural networks, sigmoid functions are often used as an activation function for single neurons, while the sigmoid / softmax normalization function is quite used at the output level, so that the entire layer is added to 1. You simply mixed up the sigmoid function (for single neurons) compared to normalization functions sigmoid / soft max (for the whole layer).
EDIT: To do this for you, I will give you a simple example with outliers, this demonstrates the behavior of two different functions for you.
Let the sigmoid function be realized:
import numpy as np def s(x): return 1.0 / (1.0 + np.exp(-x))
And the normalized version (in small steps, which makes reading easier):
def sn(x): numerator = x - np.mean(x) denominator = np.std(x) fraction = numerator / denominator return 1.0 / (1.0 + np.exp(-fraction))
Now we define some dimensions of something with huge emissions:
measure = np.array([0.01, 0.2, 0.5, 0.6, 0.7, 1.0, 2.5, 5.0, 50.0, 5000.0])
Now let's look at the results that s (sigmoid) and sn (normalized sigmoid) give:
> s(measure) > array([ 0.50249998, 0.549834 , 0.62245933, 0.64565631, 0.66818777, 0.73105858, 0.92414182, 0.99330715, 1. , 1. ]) > sn(measure) > array([ 0.41634425, 0.41637507, 0.41642373, 0.41643996, 0.41645618, 0.41650485, 0.41674821, 0.41715391, 0.42447515, 0.9525677 ])
As you can see, s only translates the values “one after another” through the logistic function, therefore the emissions completely freeze from 0.999, 1.0, 1.0. The distance between the other values changes.
When we look at sn , we see that the function actually normalized our values. Everything is now absolutely identical, except for 0.95, which was 5000.0.
What is it useful for or how to interpret it?
Think about the output level in a neural network: activating 5000.0 in one class at the output level (compared to other other values) means that the network is really sure that this is the “right” class for your given input. If you used s there, you would get 0.99, 1.0, and 1.0 and could not tell which class is the right guess for your input.