Why is there no weight decay on convolutional layers in the cifar10 tensor flow example?

It seems that there is no weight decay on convolutional layers in the cifar10 example on tensor flow. In fact, there is no weight decay on any layers, with the exception of two fully connected layers. Is this a common practice? I thought that weight decay applies to all weights (except prejudices).

For reference, here is the corresponding code ( wd - weight decay coefficient):

  # conv1 with tf.variable_scope('conv1') as scope: kernel = _variable_with_weight_decay('weights', shape=[5, 5, 3, 64], stddev=1e-4, wd=0.0) conv = tf.nn.conv2d(images, kernel, [1, 1, 1, 1], padding='SAME') biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.0)) bias = tf.nn.bias_add(conv, biases) conv1 = tf.nn.relu(bias, name=scope.name) _activation_summary(conv1) # pool1 pool1 = tf.nn.max_pool(conv1, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1], padding='SAME', name='pool1') # norm1 norm1 = tf.nn.lrn(pool1, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75, name='norm1') # conv2 with tf.variable_scope('conv2') as scope: kernel = _variable_with_weight_decay('weights', shape=[5, 5, 64, 64], stddev=1e-4, wd=0.0) conv = tf.nn.conv2d(norm1, kernel, [1, 1, 1, 1], padding='SAME') biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.1)) bias = tf.nn.bias_add(conv, biases) conv2 = tf.nn.relu(bias, name=scope.name) _activation_summary(conv2) # norm2 norm2 = tf.nn.lrn(conv2, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75, name='norm2') # pool2 pool2 = tf.nn.max_pool(norm2, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1], padding='SAME', name='pool2') # local3 with tf.variable_scope('local3') as scope: # Move everything into depth so we can perform a single matrix multiply. dim = 1 for d in pool2.get_shape()[1:].as_list(): dim *= d reshape = tf.reshape(pool2, [FLAGS.batch_size, dim]) weights = _variable_with_weight_decay('weights', shape=[dim, 384], stddev=0.04, wd=0.004) biases = _variable_on_cpu('biases', [384], tf.constant_initializer(0.1)) local3 = tf.nn.relu(tf.matmul(reshape, weights) + biases, name=scope.name) _activation_summary(local3) # local4 with tf.variable_scope('local4') as scope: weights = _variable_with_weight_decay('weights', shape=[384, 192], stddev=0.04, wd=0.004) biases = _variable_on_cpu('biases', [192], tf.constant_initializer(0.1)) local4 = tf.nn.relu(tf.matmul(local3, weights) + biases, name=scope.name) _activation_summary(local4) # softmax, ie softmax(WX + b) with tf.variable_scope('softmax_linear') as scope: weights = _variable_with_weight_decay('weights', [192, NUM_CLASSES], stddev=1/192.0, wd=0.0) biases = _variable_on_cpu('biases', [NUM_CLASSES], tf.constant_initializer(0.0)) softmax_linear = tf.add(tf.matmul(local4, weights), biases, name=scope.name) _activation_summary(softmax_linear) return softmax_linear 
+7
tensorflow conv-neural-network
source share
1 answer

Weight loss does not necessarily increase productivity. In my own experience, I quite often found that my models work worse (as measured by some metrics on a discreet set) with any significant amount of weight decay. This is a useful form of regularization that you need to know about, but you donโ€™t need to add it to each model without considering whether it seems necessary or comparing performance with and without.

As for whether the decay of the weight only on the part of the model can be good compared to the mass consumption on the whole model, it seems that it is less likely to distribute only some of the weights in this way. However, I do not know what the theoretical reason for this is. In general, neural networks already have too many hyperparameters to configure. Whether the use of weight loss or not is already a question, and how much weight can be regulated if you do it. If you also wonder which layers should change this way, you quickly run out of time to test the performance of all the different ways you could turn on and off for each layer.

I assume that there are models that will benefit from the breakdown of weight only on part of the model; I donโ€™t think it was done often because itโ€™s hard to check all the possibilities and figure out which one works best.

0
source share

All Articles