I just checked in one possible R calculation of two-sided significance under a b...

I just checked in one possible R calculation of two-sided significance under a binomial model under the simple null hypothesis A and B have the same common rate (and that that rate is exactly what was observed, a simplifying assumption) here http://winvector.github.io/rateTest/rateTestExample.html . The long and short is you get slightly different significances under what model you assume, but in all cases you should consider it easy to calculate an exact significance subject to your assumptions. In this case it says differences this large would only be seen in about 1.8% to 2% of the time (a two-sided test). So the result isn't that likely under the null-hypothesis (and then you make a leap of faith that maybe the rates are different). I've written a lot of these topics at the Win-Vector blog http://www.win-vector.com/blog/2014/05/a-clear-picture-of-po... .

They said they ran an A/A test (a very good idea), but the numbers seem slightly implausible under the two tests are identical assumption (which again, doesn't immediately imply the two tests are in fact different).

The important thing to remember is your exact significances/probabilities are a function of the unknown true rates, your data, and your modeling assumptions. The usual advice is to control the undesirable dependence on modeling assumptions by using only "brand name tests." I actually prefer using ad-hoc tests, but discussion what is assumed in them (one-sided/two-sided, pooled data for null, and so on). You definitely can't assume away a thumb on the scale.

Also this calculation is not compensating for any multiple trial or early stopping effect. It (rightly or wrongly) assumes this is the only experiment run and it was stopped without looking at the rates.

This may look like a lot of code, but the code doesn't change over different data.