Good production compilers recognize all these swap idioms and will compile them down (i.e., canonicalize them) to the same thing. In the case of things like swapping by adding and xoring, they'll often undo the cleverness and just use temporary registers anyway. These are the kinds of micro-optimizations that compilers excel at, so write the code clearly, leave that job to them, and worry about the high-level stuff that the compiler can't do for you.
I don’t think the compilers recognize the specific idiom here, but instead (what you also say) perform normalizations (e.g. into static single assignment form), expression analysis and elimination, yielding the same result. For the present case, I’d imagine a sequence of transformations like this:
True; I probably overstated it in terms of idiom recognition (compiler enthusiast, but by no means expert). I suspect that a similar SSA or data flow analysis combined with some judicious rewrite rules like (X ^ Y) ^ Y -> X [0] are enough for the xor swap trick as well.
This example is kind of funny in a compiled language because a lot of modern compilers use a static single assignment (SSA) form as an intermediate representation. SSA effectively creates temporaries for each subsequent assignment operation. This helps with data-flow analysis but also makes it so that code like in the post can actually be less space efficient without optimizations in compiled languages.
Its better in a number of ways, and since each bit is independent, it could actually be faster on really low end/old processors.
Although these days just about any modern OoO processor will just detect swaps (even if there isn't an actual instruction) and the renamer makes it zero cost.
The only weird thing is that the article has managed to reach the front page. I was under the impression all number (and pointer) swap methods are very well known - with xor being the most applicable.
Edit: I am hard pressed to think of an Assembly that doesn't have exchange (swap) between registers - e.g. 6502, but has a xor instead.
Indeed, I first learned about xor-swap from an example in the IBM System/360 Principles of Operations manual, circa 1968. (For some funky reason I don't understand, the Programming Examples section vanished from later editions of PrincOps.) I can't see any reason for wanting to use arithmetic instead.
6502's xor is not efficient (IIRC) for swapping because the result is always stored in A, and the second operation is either immediate or memory. AKA it is to much an accumulator machine, similarly for 6800 even though the xor can target multiple registers the source still has to be memory, so while its possible to create an xor swap, the alternatives are going to be more efficient. Z80+/8080/etc have an exchange..
I suspect it might be useful on ArmV1 because xor can operate between registers and it doesn't (IIRC) have an xchng. Although Arm wasn't nearly register starved enough, nor does it have special register uses (index vs accumulator, etc) for a trick like this to be really useful.
PS: The fastest way to swap a register pair on 6502 is probably storing the first register into the instruction stream for an immediate load, then a register, register transfer, followed by the immediate load. Bonus if the code is running in the zero page, saving a cycle on the store. (3+2+2=7 cycles).
I had it as an interview question a few years ago. I explained it before the whiteboard was complete but many people aren't necessarily familiar with such things.
> The only weird thing is that the article has managed to reach the front page.
This is the kind of detail not surfaced in a Boot Camp.
This isn’t to harsh on BCs. They have a specific goal and a short fuse, and the kind of things they cover use languages that are so high level you tend to have no visibility into the actual machine.
I am not a Javascript expert: Would the XOR version work in Javascript where numbers are Doubles? Perhaps that's the reason the article used addition/subtraction instead?
By the way, while one indeed needs the ability to "cancel out"/invert, we can derive similar swaps in weaker algebraic structures (and don't need full Abelian groups). For example, non-zero rationals can be divide-swapped: a = a / b; b = a / (1 / b); a = b / a; or similar for subtraction-swap. In general, a quasigroup seems to suffice: a = a * b; b = a / b; a = b \ a (where * is the group operation and / and \ are the right and left divisions, respectively).
The classic use of XOR-swap was to traverse a tree without needing a stack. As you traverse the tree, the forward links are swapped with backward links, so you can find your way back. Used in the mark phase of some early garbage collectors.
Yah, or just store both the forward and backwards links in a single variable by xoring the forward and backward links together and storing it. Then the list traversal direction is picked by selecting either the head or tail and starting the operation. Halves the cost of a doubly linked list, while allowing the same function to be used for forward and backwards traversal.
Halves the cost provided that touching that many cache lines is free. Even after you rewrite them back to how they were, the cache doesn't know, so has to write back every line.
If it's a balanced tree, you need only log(n) stack entries you can statically provision, and then other threads can traverse the tree at the same time.
Most of the traditional optimizations turned into pessimizations decades back. That has been a good thing.
> Even after you rewrite them back to how they were
Why would you do that? I was under the impression that the trick with the "XOR-ly linked list" is to keep these in RAM strictly in the XOR'd form. What exactly did you mean by "has to write back every line"?
If you're interested in this sort of thing "The Art of Garbage Collection" is the standard go-to text these days.
It's been a little while, but as I remember, the main reason for pointer reversing is to be able to not use any extra space to keep track of the list of grey (currently being processed) objects in a standard three-color collector. As you'll recall, a basic tracing collector marks each object as one of 3 colors: black, white, and grey. It must maintain the invariant that black objects never have pointers to white objects. A (full) GC cycle begins with all objects marked white, then the GC roots are marked grey. Then the collector repeatedly removes an object from the grey set, marks all of the white objects it points to as grey, and then marks the just removed grey object black. The GC cycle ends when there are no more grey objects, at which point all white objects are garbage and can be reclaimed.
This is often implemented by using a single bit in the object header to keep track of white/black, and a stack to keep track of the set of grey objects. In this setup, a white object is marked grey by setting the color bit in the object header to whichever color indicates black, and putting it in the grey set. That is, grey objects are really marked black, but in the grey set. In other words, the color bit in the object header is really a "white"/"not white" bit. This is the easiest way to ensure a white object never gets added twice to the grey set without having to explicitly mark the object as grey. (Some systems actually invert the sense of the color bit in the object header between major GC collections. In one major collection 0 means white, and the next major collection 0 means black. This avoids having to scan the heap marking all objects white at the beginning of the GC cycle.)
One way to implement the logical stack of grey objects is to use a singly-linked list. If you want to implement a linked list of objects without adding an extra pointer to every object's header, and you don't want to allocate any extra space in the middle of a garbage collection, then you need to figure out some kind of clever hack to re-purpose some existing space. I forget the exact details, but (1) if an object doesn't have any pointers in it, you can immediately mark it black without first adding it to the grey set ... and that's where I lose track of the details about how you keep track of which object pointer you've borrowed to reverse in order to use a linked list to implement your grey object stack.
So, I'm missing the key detail that makes the whole trick work, but at a high level, you have extra storage space of one pointer to point at the top of the grey stack. The object at the bottom of the grey stack has its pointer nulled out, and off to the side you keep track of that last final pointer to restore when you empty the grey stack (which ends the GC cycle). When you pop an object off of the grey stack, you point its pointer to the previous object popped off of the grey stack (thus restoring (re-reversing) its reversed pointer).
However, I can't for the life of me remember how you keep track of which pointer you've reversed in each object without using any extra space. I don't think you can always just use the first pointer in an object, because that could trap you in a cyclic sub-graph.
EDIT: I've found a few different sets of very vague lecture notes online that are super vague... just saying you reverse pointers in order to make a linked list of grey objects. However, if you reverse all of the pointers, you don't get a linked list, but a general graph. You need to have exactly one pointer per object in order to have a singly linked list. If you have some rule of always using the first (or last) non-null pointer in an object, you might still wind up in a situation where you're reversing a cycle of pointers. I found a nice slide deck showing the steps, but the example was acyclic, and there was no explanatory text.
I thought maybe there's a trick with using two bits for color in the object header, and using that to mark which pointer has been reversed by also marking the target object, but that breaks if there are multiple pointers to the specially marked object.
Maybe the unmentioned trick is to use tagged pointers to keep track of the one pointer in each grey object that has been reversed. I really hope that's not it. That just feels... less brilliant.
Aha... it seems I've been lied to! It seems the proof of correctness is for a binary graph (each node has an out-degree of at most 2, but cycles are allowed in the graph) and needs 2 extra bits of state per node. [0][1]
Now, one can represent any higher-degree graph as a binary graph by simply replacing any higher-degree nodes with a binary tree with the required number of out-nodes, but then that increases the number of extra bits required.
Gries's 2006 less formal writeup [1] is simple to extend to higher-degree nodes. Instead of his 2-bit counter, one can use a counter that goes from 0 to the number of out-links. My guess is that real implementations use tagged pointers to indicate which pointer has been reversed instead of using a counter in Gries's write-up.
I feel so lied to! No wonder I couldn't figure out how to do it without tagged pointers! Though, please correct me if you find a pointer reversal algorithm that doesn't use tagged pointers and is guaranteed to terminate in the case of arbitrary graph cycles.
That's not quite what I'm talking about. If you look at Gries's 2006 paper (the less formal one, not his formal proof of the algorithm), each node has 2 bits for state and two child pointers.
State 0: Node is white
State 1: Node is grey and child[0] points to parent
State 2: Node is grey and child[1] points to parent
State 3: Node is black and both child pointers are restored
To generalize this algorithm to nodes with k children, you need to either pack ceil(log2(k+1)) bits into object header and tag bits in the first few fields of the object, or else have a white/not-white bit in the object header and have one tag bit in each object field, indicating which pointer has been reversed to point to the parent node. The latter is less complex, but results in O(k^2) time complexity in the repeated scanning of the object to find the currently reversed pointer each time you recurse back up to the node while traversing the graph. To really maximize speed, I suspect the O(k^2) algorithm is used in practice for small k, and the more complex counter spread across tag bits of multiple pointers is used for nodes with larger k.
By any chance, were you referring instead to Richard Jones' "The Garbage Collection Handbook: The Art of Automatic Memory Management"?
Garbage Collection interests me greatly, and I was looking forward to exploring a new reference - as it turned out, I couldn't find anything titled "The Art of Garbage Collection", so I thought I'd ask.
By the same logic, you could argue that `x++` doesn't work work for floats. As long as your intermediate values stay between `Number.MAX_SAFE_INTEGER` and `Number.MIN_SAFE_INTEGER`, you'll be fine.
"Please never use this in any production code. The less we have to think about a piece of code, the better it is. It's a fun thought experiment nevertheless!"
For what it's worth, the Python example isn't too relevant since it's hiding a number of intermediate variables (that said, swapping without intermediate variables tends to be more of a curiosity anyway -- and a motivated person can "well actually" the whole concept into oblivion). What it does is push a and b onto the interpreter stack, then (in the version of Python 3 I'm running, though the more recent one should be similar) it pops those into two local C variables and pushes them in a different order, then it pops them and stores them into local python variables.
import dis
def swap(a, b):
a, b = b, a
return a
dis.dis(swap)
The downvotes on the OP comment and the negativity in the replies is totally baffling to me. The OP's post was interesting and relevant, despite being non-algorithmic examples. Non-algorithmic examples are not required by the title of the article, and are only implicitly the topic of the article's body, and even if it were explicit, that still wouldn't render the OP comment unrelated. It would have been enough to simply talk about the difference between an algorithmic and language-feature solution without implying something is wrong with even posting the latter.
And on a modern CPU the machine code does not map 1-to-1 to what actually happens when this is executed, the phrase "out-of-order execution" ought to give away that even though your machine code says A, then B, then C, the CPU may decide it was better to do A, then C, then B instead.
Express your actual intent, if your chosen language has a "swap" intrinsic, use it, if not write the swap in a natural and idiomatic way, and let other people do the lifting.
Congratulations, your program is now thread-unsafe because you thought using a temporary variable is expensive :-p using mutable static variables is a terrible idea in most situations.
The person that developed this trick is obviously familiar with the Mathematical properties of numbers. But this is the wrong approach from many prospective, including the impact to performance, the potential of under/overflow, and the lack of readability.
The author explicitly states that it should never be used due to readability. I'm interested in your point about performance, though - as a layman in this area, I would have thought that the additional operations were cheaper than whatever the overhead of a variable is, even for a primitive value type. Is that definitely not the case?
A modern CPU makes heavy use of out of order execution to do things quickly. Having the results of one operation depend on another makes it harder for the CPU to do this.
If you have this:
a = a + b
b = a - b
a = a - b
// use a for something
c = a + 7
then the final value of 'a' depends on both 'a' and 'b', so execution of the final line can't happen until the original 'a' and 'b' have both been fetched from memory/calculated/whatever.
However, if you have this:
temp = b
b = a
a = temp
// use a for something
c = a + 7
then the final value of 'a' depends only on 'b', so the final line can be executed even if the original 'a' isn't yet available.
In most cases swapping two variables with a temporary variable will be either zero-overhead -- because you've just changed the names you're referring to those variables by, and the compiler knows it -- or very, very low-overhead. You're not actually saving any memory with the fancy tricks.
> under/overflow
Assuming the 2 variables are treated as unsigned ints, even with under/overflows the algorithm works, since if a=a+b overflows, then b=a-b is guaranteed to underflow, thus returning the original a.
The overflowed bit is irrelevant here
I remember just after college I applied for a .net job, when I came in for the interview they gave me a paper exam with a bunch of questions on how to do silly little tricks including that one specifically. I didn't take that job.
I tried to explain to a previous employer how much their little "brainteaser" test was turn off to legitimately talented engineers. I explained that everybody in town knows about the test and basically refuses to even entertain the company because of it.
Bear in mind this is a place building relatively simple web apps... standard line of business stuff. The test was questions like "how many words can you unscramble from these letters" etc.
A majority of the engineers felt the test meant you "knew you were working with smart people". A majority of the code was unmaintainable sludge. (though a few people were extremely talented)
I can't comment on the rest of the interview, but this question in particular wouldn't necessarily be bad for some positions if so many people hadn't memorized a few solutions. A large part of some jobs is applying first-principles reasoning to a set of arbitrary requirements, and predicated on the assumption that a reasonable interviewer thought this was a good question or that they explicitly asked you to do your best to reason through the problem you'd be pretty safe starting with what you know and seeing how far that got you.
A sample chain of reasoning might go as follows:
- We have to take some first step transforming (X,Y) into (f(X,Y),Y) or (X,f(X,Y)). If one has a solution the other must as well, so assume (X,f(X,Y)).
- The function f can't destroy any information since swapping two variables twice ends up where we started.
- That immediately requires that f(X,Y) must properly depend on Y or else some information would be destroyed. Moreover, it has to properly depend on X if we're to ever make forward progress in swapping the information content of the two variables.
- The problem doesn't specify how big the numbers are, so as a first pass at the problem there should probably exist a solution that works with 1-bit numbers.
- There are only 16 two-input functions on bits. Only 2 of them (XOR and XNOR) properly depend on both variables and don't destroy information.
- Applying XOR or XNOR a few times you'll easily find a small number of operations that swap two variables.
For an alternative formulation you might not bother with or notice the 1-bit simplification. You'll go on to the second operation yielding (g(X,f(X,Y)), f(X,Y)). Again looking for something that makes your life easier you assume g is some kind of partial inverse of f so that you've actually almost completed the swap (i.e., that your partial solution at this point is (Y, f(X,Y))). That suggests some kind of a group-structure and lends itself to the XOR/XNOR we previously discovered, the ADD/SUB people commonly attempt as well (and we live in the real-world, say something to the interviewer about wraparound arithmetic or needing an extra bit if you go that route), and all sorts of other methods.
And so on. If you have reason to actually try to _solve_ this problem instead of regurgitating answers you'll find a half-dozen easy methods which should still leave plenty of time for the rest of the interview.
Side-note: I recall needing silly tricks like that more in .NET than other platforms. They didn't have an optimized popcount even as recently as the beginning of Covid for God's sake.
XOR swap, as others have noted, is another very common way of implementing a swap without temporary variables. It's also a good exercise for formal verification, here's a proof in Coq of XOR swap's correctness in C.[0]
This was my technical interview question for university. Something like “you have two registers, and only basic mathematical operators. Swap the values contained in the two registers” on a whiteboard.
I wonder if there is an efficient hardware implementation for swap... I imagine plainly swapping bits of data[i] with data[n-i] is an intense and cache-unfriendly algorithm.
I'm reasonably certain you'd more commonly say that the range is doubled. When we say "size", we usually mean the "width" of the variable, as e.g. demonstrated by the "sizeof()" operator in C, which gives the size/width in bytes[1].
[1] Actually multiple of sizeof(char), where sizeof(char) is defined as 1. And that's technically not 1 byte everywhere, but most of the time nowadays it is.
For future reference, don't assume everyone on HN writes C/C++. I know what those terms mean but I was trying to point out that OP probably meant the exact same thing.
Telling someone who doesn't know what widening a variable means that they meant widening a variable without explaining what widening a variable means helps no one.
(Plus adding 1 bit to the "bit size" of a variable _does_ double the base 10 size of the variable if you want to be really pedantic)
I think it's pretty intuitive that the size of a variable is the amount of memory it takes, while the range of a variable describes the bounds of the values it can represent.
This is consistent with other uses: The size of an array, the size of a data structure in general, the size of a file... In all of those you are concerned about how much memory or disk space is taken.
So I think calling the range "size" is counterintuitive and gives the wrong idea, and I do not agree that the "base 10 size" of a variable is equivalent to its range. (What does "base 10" have to do with it anyway? The range is doubled no matter what base we're operating in.)
https://godbolt.org/z/joh1jhdhq
Good production compilers recognize all these swap idioms and will compile them down (i.e., canonicalize them) to the same thing. In the case of things like swapping by adding and xoring, they'll often undo the cleverness and just use temporary registers anyway. These are the kinds of micro-optimizations that compilers excel at, so write the code clearly, leave that job to them, and worry about the high-level stuff that the compiler can't do for you.