You can prove the ANN works, but how do you check it conforms to any specification, other than by testing and accepting high probability of real world failure?
Eventually, you have a high level fuzzy requirement which is hard to even specify right. Just look at, say, latest YouTube guidelines about content they try to ML around to make tractable...
And that's a classifier, now try to do the same with a machine translation system.
In fact, this is exactly what I was thinking about when writing the last paragraph of my comment.