For people with a signal processing background, this is actually a very trivial thing. The basic technique has been in use for probably a hundred years - radar being a classical example - nothing at all to do with Google. Basically, you can get a signal-to-noise ratio improvement proportional to the duration of the signal you're attempting to detect (in this case, audio), thus allowing you to detect very weak signals in the presence of strong noise or other unwanted signals. Look up "pulse compression" or "correlation detection" if you're interested.
This is very true. In UC Berkeley, the very first linear algebra course all EE/CS freshman take (EE 16A) has a lab that does EXACTLY this. You match part of a song with some very noisy sound. If freshmen are taught to do this, it's easily something Google can do.
If it's background noise, is it still a violation? Not suggestion an answer, just asking. Like, how (not in all countries) it's okay to get passers-by in your video shot and broadcast them uncensored?