I have cracked the deep feed of NN from scratch into R, and it seems more stable with "hard sigmoid" activations - max (0, min (1, x)) - than ReLU. Trying to port it to TensorFlow and noticed that they do not have a built-in activation function, only relu6, which uses top cropping by 6. Is there a reason for this? (I understand that you could do relu6 (x * 6) / 6, but if the TF guys put 6 for a good reason, I would like to know.) Also, I would like to know if others have hacking problems with ReLU in feed networks (I know about problems with RNN).
source
share