Yes. The optimizer is keeping a higher precision copy. It's likely slower and requires more memory than an equivalent full precision model when it comes to training. I'd also imagine it requires a multiple of epochs to get one epoch equivalent because the forward pass will need several goes to get the right choice between three states, rather than just moving a little bit in the right direction.
Isn't memory use in training higher, since they maintain high precision latent weights in addition to the binarized weights used in the forward pass?