How does "tf.train.replica_device_setter" work?

Question

How does "tf.train.replica_device_setter" work?

I realized that tf.train.replica_device_setter can be used to automatically assign variables always on the same parameter server (PS) (using cyclic) and nodes with intensive calculation for one worker.

How are the same variables reused for multiple copies of a graph created by different workers? Does the parameter server only indicate the variable name that the worker requests?

Does this mean that tasks should not be used in parallel to execute two different schedules if the variables are called the same in both schedules?

+5

python tensorflow

Paul Sep 23 '16 at 17:56

source share

1 answer

mrry · Accepted Answer · 2016-09-24T22:26:09+0000

tf.train.replica_device_setter() quite simple in its behavior: it takes a purely local decision to assign a device to each tf.Variable , since it was created in cyclic mode according to the tasks of the parameter server.

In the distributed version of TensorFlow, each device (for example, "/job:ps/task:17/cpu:0" ) supports a map of variable names for variables that are shared by all sessions that use this device. This means that when different working replicas create a session using this device, if they assign the same symbolic variable (having the same Variable.name property) to the same device, they will see each other's updates.

When you perform “between graph replication” across multiple replicas, tf.train.replica_device_setter() provides a simple, deterministic way to assign variables to devices. If you build an identical graph for each working replica, each variable will be assigned to the same device and will successfully shared without any external coordination.

Caution:. Using this scheme, your working replicas should create the same graph *, and there should be no accident in how the graph is constructed. Once I saw a problem when the order in which variables were created was determined by iterating over the Python dict keys, which are not guaranteed in the same order as the processes. This led to variables being assigned to different PS devices by different workers.

As for your other question, you need to be careful when clashing names when training multiple models using the same processes. By default, all variables are separated in the global namespace, so two variables from different networks with the same name will collide. One way to mitigate this problem is to wrap each model in a block with tf.container(name): (with different values for name , for example, "model_1" and "model_2" ) to put your variables in another namespace called the "container" in TensorFlow jargon. You might think of a container as a prefix that is added to the name of all your variables when viewed on the device. Support for containers in the API is still pretty preliminary, but there are plans to make them more useful in the future.

* Technically, they should only create their tf.Variable objects in the same sequence.

How does "tf.train.replica_device_setter" work?

More articles: