Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> This particular code cannot be effectively optimized because of the early exit inside the loop. What if `size` was enormous and the first character was negative? One extra byte of lookahead might SIGSEGV! So it must gingerly step one byte at a time.

If `size` was enormous and the first character "was negative"¹, then one extra byte of lookahead would never SIGSEGV; `size` is enormous, and by definition, that is presumably more than our lookahead.

If you mean, "the first character of the lookahead's block" then it still doesn't make sense. Even in the case where we exit later, we still can't access later bytes of the lookahead if we don't know them to be out of bounds.

In the example Godbolt you linked to, the compiler is doing 8 byte jumps with the index. (It's not doing a great job, as we're still reading a byte at a time. But the bounds checks and the index increment will get reduced.) But, what the compiler is doing in your example could also be done in the early exit example. It'd be basically the same code, but replace .LBB0_10 with something like,

    cmpb $0, (%rdi,%rdx)
    js do_exit
    cmpb $0, 1(%rdi,%rdx)
    js do_exit
    # etc. for the remaining 6 reads
  .do_exit
    xorl    %eax, %eax
    retq
This is very similar to your Godbolt, which also does 8 individual memory accesses. We know they won't segfault here for the same reason we know they won't in your Godbolt: we check ahead that we can access the next 8 bytes and abort to the non-unrolled version if not.

The cmp/js's are very likely not as efficient as the mov/or's, but that's not the point; the point is the unroll is possible.

It is not the early exit that prevents a loop unroll. Something else is stopping it, either it does not think it worth it, is not sufficiently intelligent, or some arcana of C prevents it.

(Your example alternates between mov and or, in a very interesting pattern, which makes me suspect it also knows something there, and perhaps more opportunity.)

¹This is an ugly example, since the implementation is free to have a char with range 0-255.



> one extra byte of lookahead would never SIGSEGV; `size` is enormous, and by definition, that is presumably more than our lookahead

But `size` is just arbitrary number, while the code as written will not proceed past the first negative char. For example some jerk may write:

    c_is_ascii_branchy("\xFF", a_gazillion);
but the function has to stop at the first char.

> It is not the early exit that prevents a loop unroll

It really is!


>> It is not the early exit that prevents a loop unroll > It really is!

I think that's the dispute: does an early exit prohibit the compiler from generating code with additional reads? Or is the actual prohibition about causing segfaults? Intuitively, I'd think that so long as it produces the correct answer and avoids visible side effects (like segfaults), any number of additional reads from anywhere would be legal, but I don't know the spec well enough to claim this is actually correct.


Without any additional tricks, a vectorized loop could fail in practice with a segfault, which is certainly observable.

However, a compiler could align the memory accesses to avoid this problem on a particular platform, say based on its knowledge of the 4k protection granularity on x86. For example, if you 32-byte align your reads, and the source alogorithm reads the first byte in a 32-byte chunk, it's safe to read the remaining 31.

Actual implementations of things like strlen do exactly this, but these are generally hand-written: I haven't seen compilers do it during autovec. On problem is that it produce false-positives in tools like Valgrind.

Bonus fact: icc actually breaks the rules in similar scenarios and vectorizes (without any alignment) loops that gcc and clang don't, and may produce writes where none existed in the source, causing segfaults or concurrency bugs in correct programs. So any time you see icc successfully vectorize a loop that other compilers don't, you have to wonder whether it was smarter, or simply breaking the rules.


> Without any additional tricks...

Yes, the question is about which "additional tricks" are legal according to the C standard. That is, what are the limits of what is technically allowed if you are wearing nothing but your language lawyer hat? Is the rule that anything goes as long as you don't segfault, or something more stringent? For that matter, is a segfault actually evidence that you have broken the rules? Note that this is different the rules for "undefined" behavior, as that pertains to the C source code rather than the generated assembly.

My impression is that the C standard assumes a hypothetical virtual machine that does not specify details like page sizes. I think this means that according to the standard, over-reads must be legal or illegal for a conforming compiler regardless of whether they cause a segfault on a particular real-world architecture. Or maybe the C standard doesn't make any claims as to whether the generated code will actually run on any particularly architecture?

> or simply breaking the rules

Which rule? My conclusion right now is that maybe there isn't one, and that the standard doesn't actually guarantee that a conforming compiler will generate code that runs to completion without segfault in the real world. There are simply too many differences in actual CPU's to make any firm guarantees. It's obviously fair to consider a compiler "buggy" if it does this, but I'm not sure it's noncompliant.


> Yes, the question is about which "additional tricks" are legal according to the C standard. That is, what are the limits of what is technically allowed if you are wearing nothing but your language lawyer hat?

I probably should have been clearer, but it is my position the suggested trick (alignment) is definitely allowed by the standard. As you point out, the standard is working at a much higher level and is never going to say anything about memory protection, page size, etc. The program just has to produce the side effects required by the standard (output, etc), and whatever it wants to do to achieve that is fine. Even with my language lawyer hat firmly in place, I can't see any possible reason it would be disallowed: the standard doesn't even have the tools or language in place to even start talking about it.

> I think this means that according to the standard, over-reads must be legal or illegal for a conforming compiler regardless of whether they cause a segfault on a particular real-world architecture. Or maybe the C standard doesn't make any claims as to whether the generated code will actually run on any particularly architecture?

The standard makes no guarantees about any of it. It basically just says "these side effects must occur". How that is implemented is totally up to the compiler and runtime environment.

Note though that what I'm suggesting will "always work" in the sense that the a compiler targeting x86 would produce some x86-specific code that uses alignment and knowledge of the page boundary to avoid faults. This won't fail on some other architecture, because the code wouldn't be compiled in that way on a different architecture. It's not much different than the compiler using x86 specific instructions when compiling for x86: obviously it must do this and this can't work on some other uarch.

Added: I should be clear that this is a valid transformation for the compiler to make, but it is not conforming for a programmer to write a program that does this at a source level!

That is, the compiler can take a conforming program that doesn't make any over-reads at the source level, and compile it to assembly that makes "known safe" over-reads at the assembly level.

However, if you wrote the equivalent of the transformed algorithm yourself, in the source, it would not be a conforming program, because out of bound reads are definitely disallowed by the standard, regardless of what you "know" about the standard. This sort of asymmetry between what the compiler can do and what you can validly write in the source language is pervasive.

> Which rule?

I was a bit loose in my language. I mean the most important rule: that a conforming program must produce the side effects that would be produced by the abstract machine. Informally, that it "works at all rather than simply crashing".

You can write a simple, conforming, program in icc that produces "Segmentation fault" rather than "Hello world" or whatever, which is what other compilers produce and what the abstract machine will produce. This is breaking the rules (rule #1, basically).


> but the function has to stop at the first char.

Yes, and my example of an unrolled version of the early exit loop does stop at the first character, despite being unrolled.

(That is an interesting example, though, and I think is a good proof that generating extra reads is not permitted.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: