How can I use Batch Normalization in TensorFlow?

I would like to use Batch Normalization in TensorFlow since I found it in the source code core/ops/nn_ops.cc . However, I did not find it documented on tensorflow.org.

BN has different semantics in MLP and CNN, so I'm not sure what exactly BN does.

I did not find the MovingMoments method.

C ++ code is copied here for reference:

 REGISTER_OP("BatchNormWithGlobalNormalization") .Input("t: T") .Input("m: T") .Input("v: T") .Input("beta: T") .Input("gamma: T") .Output("result: T") .Attr("T: numbertype") .Attr("variance_epsilon: float") .Attr("scale_after_normalization: bool") .Doc(R"doc( Batch normalization. t: A 4D input Tensor. m: A 1D mean Tensor with size matching the last dimension of t. This is the first output from MovingMoments. v: A 1D variance Tensor with size matching the last dimension of t. This is the second output from MovingMoments. beta: A 1D beta Tensor with size matching the last dimension of t. An offset to be added to the normalized tensor. gamma: A 1D gamma Tensor with size matching the last dimension of t. If "scale_after_normalization" is true, this tensor will be multiplied with the normalized tensor. variance_epsilon: A small float number to avoid dividing by 0. scale_after_normalization: A bool indicating whether the resulted tensor needs to be multiplied with gamma. )doc"); 
+62
python tensorflow
Nov 27 '15 at 3:17
source share
9 answers

Update July 2016 The easiest way to use batch normalization in TensorFlow is through higher-level interfaces provided in contrib / layers , tflearn , or slim .

The previous answer, if you want to do DIY : The documentation line for this has improved since the release - see the docs comment in the main thread instead of what you found. It specifies, in particular, that it is derived from tf.nn.moments .

You can see a very simple example of its use in the batch_norm test code . For a more practical use case, I included the helper class below and used the notes I scratched for my own use (no warranty!):

 """A helper class for managing batch normalization state. This class is designed to simplify adding batch normalization (http://arxiv.org/pdf/1502.03167v3.pdf) to your model by managing the state variables associated with it. Important use note: The function get_assigner() returns an op that must be executed to save the updated state. A suggested way to do this is to make execution of the model optimizer force it, eg, by: update_assignments = tf.group(bn1.get_assigner(), bn2.get_assigner()) with tf.control_dependencies([optimizer]): optimizer = tf.group(update_assignments) """ import tensorflow as tf class ConvolutionalBatchNormalizer(object): """Helper class that groups the normalization logic and variables. Use: ewma = tf.train.ExponentialMovingAverage(decay=0.99) bn = ConvolutionalBatchNormalizer(depth, 0.001, ewma, True) update_assignments = bn.get_assigner() x = bn.normalize(y, train=training?) (the output x will be batch-normalized). """ def __init__(self, depth, epsilon, ewma_trainer, scale_after_norm): self.mean = tf.Variable(tf.constant(0.0, shape=[depth]), trainable=False) self.variance = tf.Variable(tf.constant(1.0, shape=[depth]), trainable=False) self.beta = tf.Variable(tf.constant(0.0, shape=[depth])) self.gamma = tf.Variable(tf.constant(1.0, shape=[depth])) self.ewma_trainer = ewma_trainer self.epsilon = epsilon self.scale_after_norm = scale_after_norm def get_assigner(self): """Returns an EWMA apply op that must be invoked after optimization.""" return self.ewma_trainer.apply([self.mean, self.variance]) def normalize(self, x, train=True): """Returns a batch-normalized version of x.""" if train: mean, variance = tf.nn.moments(x, [0, 1, 2]) assign_mean = self.mean.assign(mean) assign_variance = self.variance.assign(variance) with tf.control_dependencies([assign_mean, assign_variance]): return tf.nn.batch_norm_with_global_normalization( x, mean, variance, self.beta, self.gamma, self.epsilon, self.scale_after_norm) else: mean = self.ewma_trainer.average(self.mean) variance = self.ewma_trainer.average(self.variance) local_beta = tf.identity(self.beta) local_gamma = tf.identity(self.gamma) return tf.nn.batch_norm_with_global_normalization( x, mean, variance, local_beta, local_gamma, self.epsilon, self.scale_after_norm) 

Note that I called it ConvolutionalBatchNormalizer , because it links the use of tf.nn.moments to sum along the axes 0, 1, and 2, whereas for non-convolutional use you only need the 0 axis.

Feedback is appreciated if you use it.

+48
Nov 27 '15 at 4:16
source share

The following works fine for me, it does not require calling an EMA application from the outside.

 import numpy as np import tensorflow as tf from tensorflow.python import control_flow_ops def batch_norm(x, n_out, phase_train, scope='bn'): """ Batch normalization on convolutional maps. Args: x: Tensor, 4D BHWD input maps n_out: integer, depth of input maps phase_train: boolean tf.Varialbe, true indicates training phase scope: string, variable scope Return: normed: batch-normalized maps """ with tf.variable_scope(scope): beta = tf.Variable(tf.constant(0.0, shape=[n_out]), name='beta', trainable=True) gamma = tf.Variable(tf.constant(1.0, shape=[n_out]), name='gamma', trainable=True) batch_mean, batch_var = tf.nn.moments(x, [0,1,2], name='moments') ema = tf.train.ExponentialMovingAverage(decay=0.5) def mean_var_with_update(): ema_apply_op = ema.apply([batch_mean, batch_var]) with tf.control_dependencies([ema_apply_op]): return tf.identity(batch_mean), tf.identity(batch_var) mean, var = tf.cond(phase_train, mean_var_with_update, lambda: (ema.average(batch_mean), ema.average(batch_var))) normed = tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-3) return normed 

Example:

 import math n_in, n_out = 3, 16 ksize = 3 stride = 1 phase_train = tf.placeholder(tf.bool, name='phase_train') input_image = tf.placeholder(tf.float32, name='input_image') kernel = tf.Variable(tf.truncated_normal([ksize, ksize, n_in, n_out], stddev=math.sqrt(2.0/(ksize*ksize*n_out))), name='kernel') conv = tf.nn.conv2d(input_image, kernel, [1,stride,stride,1], padding='SAME') conv_bn = batch_norm(conv, n_out, phase_train) relu = tf.nn.relu(conv_bn) with tf.Session() as session: session.run(tf.initialize_all_variables()) for i in range(20): test_image = np.random.rand(4,32,32,3) sess_outputs = session.run([relu], {input_image.name: test_image, phase_train.name: True}) 
+30
Jan 06 '16 at 13:26
source share

To add another alternative: with TensorFlow 1.0 (February 2017) there will also be tf.layers.batch_normalization API included in TensorFlow itself.

It is very easy to use:

 # Set this to True for training and False for testing training = tf.placeholder(tf.bool) x = tf.layers.dense(input_x, units=100) x = tf.layers.batch_normalization(x, training=training) x = tf.nn.relu(x) 

... except that it adds additional ops to the graph (to update its mean and variance variables) so that they will not be dependent on your op training. You can simply run OP separately:

 extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) sess.run([train_op, extra_update_ops], ...) 

or add update statements as dependencies of your training course manually, then just start your training process as usual:

 extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) with tf.control_dependencies(extra_update_ops): train_op = optimizer.minimize(loss) ... sess.run([train_op], ...) 
+15
Apr 07 '17 at 19:00
source share

There is also an “official” level of party normalization coded by developers. They don't have very good documents on how to use them, but here's how to use them (for me):

 from tensorflow.contrib.layers.python.layers import batch_norm as batch_norm def batch_norm_layer(x,train_phase,scope_bn): bn_train = batch_norm(x, decay=0.999, center=True, scale=True, updates_collections=None, is_training=True, reuse=None, # is this right? trainable=True, scope=scope_bn) bn_inference = batch_norm(x, decay=0.999, center=True, scale=True, updates_collections=None, is_training=False, reuse=True, # is this right? trainable=True, scope=scope_bn) z = tf.cond(train_phase, lambda: bn_train, lambda: bn_inference) return z 

in order to actually use it, you need to create a placeholder for train_phase that indicates whether you are at the training or output stage (as in train_phase = tf.placeholder(tf.bool, name='phase_train') ). Its value can be populated during output or training with tf.session , as in:

 test_error = sess.run(fetches=cross_entropy, feed_dict={x: batch_xtest, y_:batch_ytest, train_phase: False}) 

or during training:

 sess.run(fetches=train_step, feed_dict={x: batch_xs, y_:batch_ys, train_phase: True}) 

and here is the function that implements BN according to them (this may be in the link I gave above, but I only insert it if they change):

 @add_arg_scope def batch_norm(inputs, decay=0.999, center=True, scale=False, epsilon=0.001, activation_fn=None, updates_collections=ops.GraphKeys.UPDATE_OPS, is_training=True, reuse=None, variables_collections=None, outputs_collections=None, trainable=True, scope=None): """Adds a Batch Normalization layer from http://arxiv.org/abs/1502.03167. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" Sergey Ioffe, Christian Szegedy Can be used as a normalizer function for conv2d and fully_connected. Args: -inputs: a tensor of size `[batch_size, height, width, channels]` or `[batch_size, channels]`. -decay: decay for the moving average. -center: If True, subtract `beta`. If False, `beta` is ignored. -scale: If True, multiply by `gamma`. If False, `gamma` is not used. When the next layer is linear (also eg `nn.relu`), this can be disabled since the scaling can be done by the next layer. -epsilon: small float added to variance to avoid dividing by zero. -activation_fn: Optional activation function. -updates_collections: collections to collect the update ops for computation. If None, a control dependency would be added to make sure the updates are computed. -is_training: whether or not the layer is in training mode. In training mode it would accumulate the statistics of the moments into `moving_mean` and `moving_variance` using an exponential moving average with the given `decay`. When it is not in training mode then it would use the values of the `moving_mean` and the `moving_variance`. -reuse: whether or not the layer and its variables should be reused. To be able to reuse the layer scope must be given. -variables_collections: optional collections for the variables. -outputs_collections: collections to add the outputs. -trainable: If `True` also add variables to the graph collection `GraphKeys.TRAINABLE_VARIABLES` (see tf.Variable). -scope: Optional scope for `variable_op_scope`. Returns: a tensor representing the output of the operation. """ with variable_scope.variable_op_scope([inputs],scope, 'BatchNorm', reuse=reuse) as sc: inputs_shape = inputs.get_shape() dtype = inputs.dtype.base_dtype axis = list(range(len(inputs_shape) - 1)) params_shape = inputs_shape[-1:] # Allocate parameters for the beta and gamma of the normalization. beta, gamma = None, None if center: beta_collections = utils.get_variable_collections(variables_collections,'beta') beta = variables.model_variable('beta',shape=params_shape,dtype=dtype,initializer=init_ops.zeros_initializer,collections=beta_collections,trainable=trainable) if scale: gamma_collections = utils.get_variable_collections(variables_collections,'gamma') gamma = variables.model_variable('gamma',shape=params_shape,dtype=dtype,initializer=init_ops.ones_initializer,collections=gamma_collections,trainable=trainable) # Create moving_mean and moving_variance variables and add them to the # appropiate collections. moving_mean_collections = utils.get_variable_collections(variables_collections, 'moving_mean') moving_mean = variables.model_variable('moving_mean',shape=params_shape,dtype=dtype,initializer=init_ops.zeros_initializer,trainable=False,collections=moving_mean_collections) moving_variance_collections = utils.get_variable_collections(variables_collections, 'moving_variance') moving_variance = variables.model_variable('moving_variance',shape=params_shape,dtype=dtype,initializer=init_ops.ones_initializer,trainable=False,collections=moving_variance_collections) if is_training: # Calculate the moments based on the individual batch. mean, variance = nn.moments(inputs, axis, shift=moving_mean) # Update the moving_mean and moving_variance moments. update_moving_mean = moving_averages.assign_moving_average(moving_mean, mean, decay) update_moving_variance = moving_averages.assign_moving_average(moving_variance, variance, decay) if updates_collections is None: # Make sure the updates are computed here. with ops.control_dependencies([update_moving_mean,update_moving_variance]): outputs = nn.batch_normalization(inputs, mean, variance, beta, gamma, epsilon) else: # Collect the updates to be computed later. ops.add_to_collections(updates_collections, update_moving_mean) ops.add_to_collections(updates_collections, update_moving_variance) outputs = nn.batch_normalization(inputs, mean, variance, beta, gamma, epsilon) else: outputs = nn.batch_normalization( inputs, moving_mean, moving_variance, beta, gamma, epsilon) outputs.set_shape(inputs.get_shape()) if activation_fn: outputs = activation_fn(outputs) return utils.collect_named_outputs(outputs_collections, sc.name, outputs) 

I am sure this is correct according to the discussion on github .




There seems to be another useful link:

http://r2rt.com/implementing-batch-normalization-in-tensorflow.html

+13
Jul 12 '16 at 5:23
source share

You can simply use the built-in batch_norm layer:

 batch_norm = tf.cond(is_train, lambda: tf.contrib.layers.batch_norm(prev, activation_fn=tf.nn.relu, is_training=True, reuse=None), lambda: tf.contrib.layers.batch_norm(prev, activation_fn =tf.nn.relu, is_training=False, reuse=True)) 

where prev is the result of your previous layer (it can be either completely connected or convolutional), and is_train is a Boolean place. Just use batch_norm as input for the next layer, then.

+12
Aug 24 '16 at 16:35
source share

Just heads-up, I can’t comment because I don’t have 50 reputation. But @ bgshi's answer above doesn't seem to be correct. If phase_train is set to false, it still updates ema and variance. This can be verified using the following code snippet.

 x = tf.placeholder(tf.float32, [None, 20, 20, 10], name='input') phase_train = tf.placeholder(tf.bool, name='phase_train') # generate random noise to pass into batch norm x_gen = tf.random_normal([50,20,20,10]) pt_false = tf.Variable(tf.constant(True)) #generate a constant variable to pass into batch norm y = x_gen.eval() [bn, bn_vars] = batch_norm(x, 10, phase_train) tf.initialize_all_variables().run() train_step = lambda: bn.eval({x:x_gen.eval(), phase_train:True}) test_step = lambda: bn.eval({x:y, phase_train:False}) test_step_c = lambda: bn.eval({x:y, phase_train:True}) # Verify that this is different as expected, two different x have different norms print(train_step()[0][0][0]) print(train_step()[0][0][0]) # Verify that this is same as expected, same x (y) have same norm print(train_step_c()[0][0][0]) print(train_step_c()[0][0][0]) # THIS IS DIFFERENT but should be they same, should only be reading from the ema. print(test_step()[0][0][0]) print(test_step()[0][0][0]) 
+9
Apr 09 '16 at 0:36
source share

Using the built-in Batch_norm TensorFlow layer, below is the code for loading data, building a network with one hidden ReLU level and L2 normalization and introducing batch normalization for both the hidden and the outer layer. It works fine and trains great. Just FYI, this example is mainly built on data and codes from the Udacity DeepLearning course. Postscript Yes, parts of it were discussed one way or another in the answers earlier, but I decided to collect everything in one piece of code so that you had an example of the whole process of learning a network with Batch Normalization and its evaluation

 # These are all the modules we'll be using later. Make sure you can import them # before proceeding further. from __future__ import print_function import numpy as np import tensorflow as tf from six.moves import cPickle as pickle pickle_file = '/home/maxkhk/Documents/Udacity/DeepLearningCourse/SourceCode/tensorflow/examples/udacity/notMNIST.pickle' with open(pickle_file, 'rb') as f: save = pickle.load(f) train_dataset = save['train_dataset'] train_labels = save['train_labels'] valid_dataset = save['valid_dataset'] valid_labels = save['valid_labels'] test_dataset = save['test_dataset'] test_labels = save['test_labels'] del save # hint to help gc free up memory print('Training set', train_dataset.shape, train_labels.shape) print('Validation set', valid_dataset.shape, valid_labels.shape) print('Test set', test_dataset.shape, test_labels.shape) image_size = 28 num_labels = 10 def reformat(dataset, labels): dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32) # Map 2 to [0.0, 1.0, 0.0 ...], 3 to [0.0, 0.0, 1.0 ...] labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32) return dataset, labels train_dataset, train_labels = reformat(train_dataset, train_labels) valid_dataset, valid_labels = reformat(valid_dataset, valid_labels) test_dataset, test_labels = reformat(test_dataset, test_labels) print('Training set', train_dataset.shape, train_labels.shape) print('Validation set', valid_dataset.shape, valid_labels.shape) print('Test set', test_dataset.shape, test_labels.shape) def accuracy(predictions, labels): return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1)) / predictions.shape[0]) #for NeuralNetwork model code is below #We will use SGD for training to save our time. Code is from Assignment 2 #beta is the new parameter - controls level of regularization. #Feel free to play with it - the best one I found is 0.001 #notice, we introduce L2 for both biases and weights of all layers batch_size = 128 beta = 0.001 #building tensorflow graph graph = tf.Graph() with graph.as_default(): # Input data. For the training data, we use a placeholder that will be fed # at run time with a training minibatch. tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size * image_size)) tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels)) tf_valid_dataset = tf.constant(valid_dataset) tf_test_dataset = tf.constant(test_dataset) #introduce batchnorm tf_train_dataset_bn = tf.contrib.layers.batch_norm(tf_train_dataset) #now let build our new hidden layer #that how many hidden neurons we want num_hidden_neurons = 1024 #its weights hidden_weights = tf.Variable( tf.truncated_normal([image_size * image_size, num_hidden_neurons])) hidden_biases = tf.Variable(tf.zeros([num_hidden_neurons])) #now the layer itself. It multiplies data by weights, adds biases #and takes ReLU over result hidden_layer = tf.nn.relu(tf.matmul(tf_train_dataset_bn, hidden_weights) + hidden_biases) #adding the batch normalization layerhi() hidden_layer_bn = tf.contrib.layers.batch_norm(hidden_layer) #time to go for output linear layer #out weights connect hidden neurons to output labels #biases are added to output labels out_weights = tf.Variable( tf.truncated_normal([num_hidden_neurons, num_labels])) out_biases = tf.Variable(tf.zeros([num_labels])) #compute output out_layer = tf.matmul(hidden_layer_bn,out_weights) + out_biases #our real output is a softmax of prior result #and we also compute its cross-entropy to get our loss #Notice - we introduce our L2 here loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits( out_layer, tf_train_labels) + beta*tf.nn.l2_loss(hidden_weights) + beta*tf.nn.l2_loss(hidden_biases) + beta*tf.nn.l2_loss(out_weights) + beta*tf.nn.l2_loss(out_biases))) #now we just minimize this loss to actually train the network optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss) #nice, now let calculate the predictions on each dataset for evaluating the #performance so far # Predictions for the training, validation, and test data. train_prediction = tf.nn.softmax(out_layer) valid_relu = tf.nn.relu( tf.matmul(tf_valid_dataset, hidden_weights) + hidden_biases) valid_prediction = tf.nn.softmax( tf.matmul(valid_relu, out_weights) + out_biases) test_relu = tf.nn.relu( tf.matmul( tf_test_dataset, hidden_weights) + hidden_biases) test_prediction = tf.nn.softmax(tf.matmul(test_relu, out_weights) + out_biases) #now is the actual training on the ANN we built #we will run it for some number of steps and evaluate the progress after #every 500 steps #number of steps we will train our ANN num_steps = 3001 #actual training with tf.Session(graph=graph) as session: tf.initialize_all_variables().run() print("Initialized") for step in range(num_steps): # Pick an offset within the training data, which has been randomized. # Note: we could use better randomization across epochs. offset = (step * batch_size) % (train_labels.shape[0] - batch_size) # Generate a minibatch. batch_data = train_dataset[offset:(offset + batch_size), :] batch_labels = train_labels[offset:(offset + batch_size), :] # Prepare a dictionary telling the session where to feed the minibatch. # The key of the dictionary is the placeholder node of the graph to be fed, # and the value is the numpy array to feed to it. feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels} _, l, predictions = session.run( [optimizer, loss, train_prediction], feed_dict=feed_dict) if (step % 500 == 0): print("Minibatch loss at step %d: %f" % (step, l)) print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels)) print("Validation accuracy: %.1f%%" % accuracy( valid_prediction.eval(), valid_labels)) print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels)) 
+2
Jul 12 '16 at 9:45
source share

So, a simple example of using this batchnorm class:

 from bn_class import * with tf.name_scope('Batch_norm_conv1') as scope: ewma = tf.train.ExponentialMovingAverage(decay=0.99) bn_conv1 = ConvolutionalBatchNormalizer(num_filt_1, 0.001, ewma, True) update_assignments = bn_conv1.get_assigner() a_conv1 = bn_conv1.normalize(a_conv1, train=bn_train) h_conv1 = tf.nn.relu(a_conv1) 
+1
Apr 10 '16 at 4:43 on
source share

Here is a simple network with two hidden layers and normalization of the party. It uses several convenient tensorflow.contrib.layers :

 from tensorflow.contrib.layers import fully_connected, batch_norm from tensorflow.contrib.framework import arg_scope batch_norm_params = { 'is_training': True, 'decay': 0.9, 'updates_collections': None, 'scale': True, } with arg_scope( [fully_connected], normalizer_fn=batch_norm, normalizer_params=batch_norm_params): hidden1 = fully_connected(inputs, n_hidden1, scope="hidden1") hidden2 = fully_connected(hidden1, n_hidden2, scope="hidden2") logits = fully_connected(hidden2, n_outputs, scope="output", activation_fn=None) 

Note. To run this, you need to add inputs (e.g. a placeholder) and determine the number of neurons you want on each layer ( n_hidden1 , n_hidden2 and n_outputs ).

Edit

The correct answer is the answer of Matthew Ratz.

0
Apr 17 '17 at 21:44 on
source share



All Articles