How to implement the Softmax derivative independently of any loss function?

Question

How to implement the Softmax derivative independently of any loss function?

For the library of neural networks, I implemented some activation functions and loss functions and their derivatives. They can be combined arbitrarily, and the derivative on the output layers simply becomes the product of the derivative of losses and the derivative of activation.

However, I was not able to implement the derivative of the Softmax activation function regardless of any loss function. Due to normalization, that is, the denominator in the equation, a change in one input activation changes all output activations, and not just one.

Here is my Softmax implementation where the derivative does not perform a gradient check of about 1%. How can I implement the Softmax derivative so that it can be combined with any loss function?

import numpy as np class Softmax: def compute(self, incoming): exps = np.exp(incoming) return exps / exps.sum() def delta(self, incoming, outgoing): exps = np.exp(incoming) others = exps.sum() - exps return 1 / (2 + exps / others + others / exps) activation = Softmax() cost = SquaredError() outgoing = activation.compute(incoming) delta_output_layer = activation.delta(incoming) * cost.delta(outgoing)

+12

regression neural-network softmax backpropagation derivative

danijar Nov 05 '15 at 10:16

source share

3 answers

It should be like this: (x is the entrance to the softmax layer, and dy is the delta coming from the loss above it)

  dx = y * dy s = dx.sum(axis=dx.ndim - 1, keepdims=True) dx -= y * s return dx

But the way to calculate the error should be:

  yact = activation.compute(x) ycost = cost.compute(yact) dsoftmax = activation.delta(x, cost.delta(yact, ycost, ytrue))

Explanation: Since the delta function is part of the backpropagation algorithm, its responsibility is to multiply the dy vector (in my code outgoing in your case) by the Jacobian of the compute(x) function is evaluated at x . If you find out what this Jacobian does for softmax [1], and then multiply it on the left by the dy vector, after a bit of algebra you will find that you get what corresponds to my Python code.

[1] https://stats.stackexchange.com/questions/79454/softmax-layer-in-a-neural-network

+10

ticcky Nov 07 '15 at 8:15

source share

Here is a vectorized version of c ++ using built-in functions (22 times (!) Faster than the version without SSE):

 // How many floats fit into __m256 "group". // Used by vectors and matrices, to ensure their dimensions are appropriate for // intrinsics. // Otherwise, consecutive rows of matrices will not be 16-byte aligned, and // operations on them will be incorrect. #define F_MULTIPLE_OF_M256 8 //check to quickly see if your rows are divisible by m256. //you can 'undefine' to save performance, after everything was verified to be correct. #define ASSERT_THE_M256_MULTIPLES #ifdef ASSERT_THE_M256_MULTIPLES #define assert_is_m256_multiple(x) assert( (x%F_MULTIPLE_OF_M256) == 0) #else #define assert_is_m256_multiple (q) #endif // usually used at the end of our Reduce functions, // where the final __m256 mSum needs to be collapsed into 1 scalar. static inline float slow_hAdd_ps(__m256 x){ const float *sumStart = reinterpret_cast<const float*>(&x); float sum = 0.0f; for(size_t i=0; i<F_MULTIPLE_OF_M256; ++i){ sum += sumStart[i]; } return sum; } f_vec SoftmaxGrad_fromResult(const float *softmaxResult, size_t size, const float *gradFromAbove){//<--gradient vector, flowing into us from the above layer assert_is_m256_multiple(size); //allocate vector, where to store output: f_vec grad_v(size, true);//true: skip filling with zeros, to save performance. const __m256* end = (const __m256*)(softmaxResult + size); for(size_t i=0; i<size; ++i){// <--for every row //go through this i'th row: __m256 sum = _mm256_set1_ps(0.0f); const __m256 neg_sft_i = _mm256_set1_ps( -softmaxResult[i] ); const __m256 *s = (const __m256*)softmaxResult; const __m256 *gAbove = (__m256*)gradFromAbove; for (s; s<end; ){ __m256 mul = _mm256_mul_ps(*s, neg_sft_i); // sftmaxResult_j * (-sftmaxResult_i) mul = _mm256_mul_ps( mul, *gAbove ); sum = _mm256_add_ps( sum, mul );//adding to the total sum of this row. ++s; ++gAbove; } grad_v[i] = slow_hAdd_ps( sum );//collapse the sum into 1 scalar (true sum of this row). }//end for every row //reset back to start and subtract a vector, to account for Kronecker delta: __m256 *g = (__m256*)grad_v._contents; __m256 *s = (__m256*)softmaxResult; __m256 *gAbove = (__m256*)gradFromAbove; for(s; s<end; ){ __m256 mul = _mm256_mul_ps(*s, *gAbove); *g = _mm256_add_ps( *g, mul ); ++s; ++g; } return grad_v; }

If for some reason someone wants a simple (non-SSE) version, here it is:

 inline static void SoftmaxGrad_fromResult_nonSSE(const float* softmaxResult, const float *gradFromAbove, //<--gradient vector, flowing into us from the above layer float *gradOutput, size_t count ){ // every pre-softmax element in a layer contributed to the softmax of every other element // (it went into the denominator). So gradient will be distributed from every post-softmax element to every pre-elem. for(size_t i=0; i<count; ++i){ //go through this i'th row: float sum = 0.0f; const float neg_sft_i = -softmaxResult[i]; for(size_t j=0; j<count; ++j){ float mul = gradFromAbove[j] * softmaxResult[j] * neg_sft_i; sum += mul;//adding to the total sum of this row. } //NOTICE: equals, overwriting any old values: gradOutput[i] = sum; }//end for every row for(size_t i=0; i<count; ++i){ gradOutput[i] += softmaxResult[i] * gradFromAbove[i]; } }

0

Kari Jul 11 '19 at 18:33

source share

Aerin · Accepted Answer · 2017-09-03T21:34:55+0000

The mathematically derivative Softmax σ (j) with respect to the logi Zi (for example, Wi * X) has the form

where the red delta is the Kronecker delta.

If you implement iteratively:

 def softmax_grad(s): # input s is softmax value of the original input x. Its shape is (1,n) # ie s = np.array([0.3,0.7]), x = np.array([0,1]) # make the matrix whose size is n^2. jacobian_m = np.diag(s) for i in range(len(jacobian_m)): for j in range(len(jacobian_m)): if i == j: jacobian_m[i][j] = s[i] * (1 - s[i]) else: jacobian_m[i][j] = -s[i] * s[j] return jacobian_m

Test:

 In [95]: x Out[95]: array([1, 2]) In [96]: softmax(x) Out[96]: array([ 0.26894142, 0.73105858]) In [97]: softmax_grad(softmax(x)) Out[97]: array([[ 0.19661193, -0.19661193], [-0.19661193, 0.19661193]])

If you are implementing in a vectorized version:

 soft_max = softmax(x) # reshape softmax to 2d so np.dot gives matrix multiplication def softmax_grad(softmax): s = softmax.reshape(-1,1) return np.diagflat(s) - np.dot(s, sT) softmax_grad(soft_max) #array([[ 0.19661193, -0.19661193], # [-0.19661193, 0.19661193]])

How to implement the Softmax derivative independently of any loss function?

More articles: