So a minor correction here. We did explore placing ftz on various instructions inside the matmul ops, but it turns out you don't need anything more than what is already baked into tf by default. All tf gpu primitives are built with -nvcc_options=ftz=true. This means you have an implicit non-linearity after any non-matmul op (provided the scale of computation is near 1e-38). Matmul ops are called through cublas and have denormals enabled.