Unfortunately all experience shows that both programmers and compilers are much worse at parallelizing code than CPUs are. There have been many attempts at architectures that allowed compilers or programmers to express code in more parallel ways (VLIW architectures, Intel's Itanium, IBM's Cell used in the PS3) and they all categorically failed.
Successful processors offer a sequential execution model, and handle the parallelism internally. Even CUDA is designed somewhat like this: you express your code in a largely linear fashion, and rely on the NVIDIA-created compiler and on the GPU itself to run it in parallel.
CUDA is quite different. In CUDA you do not express parallel code in a linear fashion, a.k.a. sequential fashion, hoping that CUDA will determine the dependence chains and extract them from the sequential code and execute them in parallel.
The main feature of CUDA is that in order to describe an algorithm that is applied to an array, you just write the code that applies the algorithm to an element of that array, like writing only the body of a "for" loop.
Then the CUDA run-time will take care of creating an appropriate number of execution threads, taking into account the physical configuration of the available GPU, e.g. how many array elements can be processed by a single instruction, how many execution threads can share a single processing core, how many processing cores exist in the GPU, and so on. When there are more array elements than the GPU resources can process at once, CUDA will add some appropriate looping construct, to ensure the processing of the entire array.
The CUDA programmer writes the code that processes a single element array, informing thus the CUDA run-time that this code is independent of its replicas that process any other array element, except when the programmer references explicitly other array elements, which is normally avoided.
The task of a CPU able to do non-sequential instruction execution (a.k.a. out-of-order execution) and simultaneous initiation of multiple instructions (a.k.a. superscalar instruction execution) is quite different.
The main problem for such a CPU is the determination of the dependence relationships between instructions, based on examining the register numbers encoded in the instructions. Based on the detected dependencies and on the availability of the operands, the CPU can schedule the execution of the instructions, in parallel over multiple execution units.
There exists an open research problem, whether there is any better way to pass the information about the dependencies between instructions from the compiler, which already knows them, to the CPU that runs the machine code, otherwise than by using fake register numbers, whose purpose is only to express the dependencies, and which must be replaced for execution with the real numbers of the physical registers, by the renaming mechanism of the CPU.
Agreed, CUDA is very different. But still, it's another example where programmers just write sequential single flow of execution code, and it gets executed in parallel according to "external" rules.
Successful processors offer a sequential execution model, and handle the parallelism internally. Even CUDA is designed somewhat like this: you express your code in a largely linear fashion, and rely on the NVIDIA-created compiler and on the GPU itself to run it in parallel.