I'd like to compare parallel CPU and GPU versions with Thrust: http://thrust.github.io
When is a vector big enough that is worth processing on the GPU?
It depends on the type of vector operation(s) you are doing and the machine you are on. For one off vector operations it is never worth it to make a transfer to the gpu.
If you are going to have a lot of temporary vectors as part of a larger algorithm, it is usually beneficial to copy the inputs once, do all the computations on the gpu, and copy them back.
You would still need to transfer the data from the core L1/L2 to the GPU (same as for inter core communication). While cheaper than a copy though the pci bus it is not free.