A function with an unlikely slowpath can easily end up arranged as
top part
jxx slow
fast middle part
end:
bottom part
ret
slow:
slow middle part
jmp end
There may be more than one slow part, the slow parts might actually be exiled from inside a loop and not a simple linear code path and can themselves contain loops, etc. Play with __builtin_expect and objdump --visualize-jumps a bit and you’ll encounter many variations.
In addition to what others said, I'd simply point out that all 'ret' does on x86 is pop an address off the top of the stack and jump to it. It's more of a "helper" than a special instruction and it's use is never required as long as you ensure the stack will be kept correct (such as with a tail-call situation).
Right, I didn't want to get into it but definitely using 'ret' "properly" has big performance benefits. My point was just that it won't prevent your code from running, it's not like x86 will trigger an exception if they don't match up.
RET does more these days. If Intel CET is enabled then it also updates the hardware shadow stack, and the program will crash if RET is bypassed unless the SSP is adjusted. IIRC Windows x64 also has pertinent requirements on how the function epilog restores registers and returns since it will trace portions of the instruction stream during stack unwinding.
the call is still in tail position whether or not it reuses the stack frame. there are also more involved ways to do tail call optimization than a direct single-jump compilation when you leave ret behind entirely, such as in forth-style threaded interpreters
You don’t need recursion to make use of tail call elimination. In Scheme and SML all tail calls are eliminated. GCC also does it, but less often. Still, it’s not recursion that triggers it.
i only meant that "optimized/eliminated tail call" is more useful terminology than an uneliminated tail call not counting as "a tail call". i find this distinction useful when discussing clojure, for instance, where you have to explicitly trampoline recursive tail calls and there is a difference between an eliminated tail call and a call in tail position which is eligible for TCO
i'm not sure how commonly tail calls are eliminated in other forthlikes at the ~runtime level since you can just do it at call time when you really need it by dropping from the return stack, but i find it nice to be able to not just pop the stack doing things naively. basically since exit is itself a threaded word you can simply¹ check if the current instruction precedes a call to exit and drop a return address
in case it's helpful this is the relevant bit from mine (which started off as a toy 64-bit port of jonesforth):
.macro STEP
lodsq
jmp *(%rax)
.endm
INTERPRET:
mov (%rsi), %rcx
mov $EXIT, %rdx
lea 8(%rbp), %rbx
cmp %rcx, %rdx # tail call?
cmovz (%rbp), %rsi # if so, we
cmovz %rbx, %rbp # can reuse
RPUSH %rsi # ret stack
add $8, %rax
mov %rax, %rsi
STEP
¹ provided you're willing to point the footguns over at the return stack manipulation side of things instead
My gut (been a while since I've been that low level) is various forms of inlining and/or flow continuation (which is kinda inlining, except when we talk about obfuscation/protection schemes where you might inline but then do fun stuff on the inlined version.)
If compilation uses jmp2ret mitigation, a trailing ret instruction will be replaced by a jmp to a return thunk. It is up to the return thunk to do as it pleases with program state.