In theory, there's no reason one couldn't compile a linked list in source to a flat array at runtime if the access pattern was right (and linear types should make that kind of optimization a lot more practical. This might even be something one could implement "in userspace" in a language with linear types - like a safe version of an iterator. Certainly I'd be excited to try - I'm not claiming this is immediately production-quality). Or one could ask where the string is coming from in the first place, and trust some kind of fusion-style optimization to remove the intermediate datastructure entirely at runtime.
Peeling off the front 8 or 16 elements should be no harder than peeling off the front 1, though naively you might have to handle each partial case from 1 to 15.
GHC already does a lot of loop fusion and optimizing away of linked lists (which is good given the unfortunate overdue of cons-lists over nicer things among Haskell users) which can be extended to work with user code using rewrite rules. Libraries like text, bytestring, and vector do exactly the elision of intermediate structures that you talk about, allowing you to convert from a vector to a list, apply some function that annoyingly only works with lists, and convert back into a vector, and have that compile into a single tight for-loop in many cases.
I wonder what the upcoming linear types work will do.
> Peeling off the front 8 or 16 elements should be no harder than peeling off the front 1, though naively you might have to handle each partial case from 1 to 15.
I think I under-specified my requirement. :-) By "16 bytes at a time," I mean, "run a single CPU instruction on those 16 bytes."
But yeah, I get your drift. I can see how it might be theoretically possible. I suppose the key gains might be in how much confidence a programmer can have that their code compiles down to the right set of instructions.
Ultimately what we want is a representation that makes it easy to understand the performance characteristics of a given piece of code (and I'd agree that this is an area where the purist functional tradition has lagged behind). In the past assembly was this, but I don't know that it can be these days; I've read that on modern hardware, the biggest factor affecting performance is cache efficiency. So even seeing the sequence of machine instructions might not be enough to allow a developer to reason about performance, because two similar-looking sequences of CPU instructions can easily end up having very different performance characteristics (branch prediction can have a similar effect AIUI).
In the long term I hope for a "nice" language (in particular one in which checked correctness is easy) with explicit performance semantics along the lines of http://www.cs.cmu.edu/~rwh/papers/iolambda/short.pdf . But yeah under the current state of the art there is no ideal approach, so there's a tradeoff in practice even though I don't think there should be one in theory.
In theory, there's no reason one couldn't compile a linked list in source to a flat array at runtime if the access pattern was right (and linear types should make that kind of optimization a lot more practical. This might even be something one could implement "in userspace" in a language with linear types - like a safe version of an iterator. Certainly I'd be excited to try - I'm not claiming this is immediately production-quality). Or one could ask where the string is coming from in the first place, and trust some kind of fusion-style optimization to remove the intermediate datastructure entirely at runtime.
Peeling off the front 8 or 16 elements should be no harder than peeling off the front 1, though naively you might have to handle each partial case from 1 to 15.