One of my all-time favourites is this: suppose you have a linked list that eventually cycles: that is, the link of one of the nodes in the list points to a previous node in the list, but not necessarily the first node. Write a function to compute the cycle length of the cycle. Now if it was just a simple circular list, this would be trivial: set a pointer to a node, and move another pointer a node at a time until the two pointers meet. However, this list has an initial sequence of nodes of unknown length that is not in the cycle. So the trick is really to find a node that is guaranteed to be in the cycle.
A nice way to do this is to start with two pointers to the head to the list, and then advance one by a single node at a time, and the other by two nodes at a time. It is easy to see that they will eventually meet on the cycle. Once that happens, it is easy to count the cycle length.
Now suppose you are running a casino and you are using a random number generator. For any given seed, the generator will eventually cycle, but not necessarily to the seed. You can compute the cycle lengths for a given seed with two variables by analogy with the linked list solution stated above.
If I ask this question (I never would) and I get the "tortoise and the hare" answer, it's much more likely that the candidate already knew the answer rather than deduced it themselves. And I have learned exactly nothing about that candidate, except that they probably looked up common interview questions before coming.
A more reasonable answer to this question would be to store the nodes you've already seen in a hash map, keyed to their position, and stop when you encounter the same node twice. This in turn leads to interesting discussions about list size, memory cost, and more general questions about the properties of a hash map.
Only if you're looking for that specific answer? Seems pretty trivial to write a memory heavy solution with an array or hash which simply asks: "Have I seen this node before? Is it the first node?"
Does that tell you anything interesting about your candidate's ability to think critically and problem solve, though? There are far better questions (and indeed, types of questions) that can cover more ground and be more candidate-friendly.
Who couldn't figure out the idea behind tortoise and the hare? Maybe not the implementation, but the concept is probably the simplest way of going about it.
I am no programmer, and I managed to implement it not too long ago all on my own. It took no more than 10 minutes from idea to working implementation.
As the linked article in parent post points out, this was an open problem in CS for 12 years before someone came up with the tortoise and hare algorithm. You really expect a candidate to come up with it from the top of their head, in a few minutes, during a stressful interview?
No not the specific solution, but the concept of having to pointers moving at different speeds did be pretty clear no?
It is really not a hard problem. If you are dealing with cirular lists you will end up having to solve it sooner or later. The "open question in CS for 12 years" is probably easier explained by "nobody could be bothered" than "it is hard".
Considering there are limited use cases for pointers moving at different speeds in other fields, no, I wouldn't say it would be easy.
The idea itself, once you hear it, instantly makes sense. But having the ingenuity to thinking of it during a stressful period shouldn't be used as a measure of how good a candidate would be for a position. Especially considering there are hundreds of different ways of approaching such problems.
I think the opposite. It is a simple problem. I wouldn't expect someone to write a working solution on a whiteboard, but in a reasoning discussion come up with the answer.
Say if you have one list that might be circular from the "starting" element or not circular at all. You simply save the pointer to the first element an compare every subsequent element to that. If you find the end of the list it is not circular. If you find the first element it is circular.
The step from that to a general approach for finding whether a list contains circular references should be simple.
I am however coming from scheme where linked lists are abundant.
Edit: not to say I think it is a good problem for an interview, but I would expect someone to be able to solve the problem in one way or another it without googling. What I am questioning is that it is a hard problem, or a very hard solution.
I mostly agree with this, however if the job requires a C.S. degree or equivalent and you assume all qualified candidates have been exposed to this, then perhaps it can be a useful question for testing how well you can explain a solution.
Just look at how varied and complicated some of the explanations are for this out there. Some take mathematical approaches, others (me) try to boil it down to a few intuitive sentences... I can see it giving some insight into a candidate.
Unless you're interviewing Dijkstra, it's a ridiculous question to ask. Anyone who's heard it before knows the answer instantly (I knew the answer before I even finished reading the description) and anyone unfamiliar with it will likely flail.
Nah, I wouldn't say everyone would flail but Dijkstra. There are simpler solutions to cycle detection.. maybe not as efficient as Floyd, but you could answer the question. For example you could start with something basic and unoptimized like keep a data structure of all the nodes you walk, and when you hit one that's in the structure then you've encountered a cycle. This is a fair approach to engineering solutions to novel problems -- don't panic, get something working, then optimize where necessary.
And if you already know the best practice, like I said, it's not just about whether you know the answer -- it's how simply you can explain it to a teammate and the theory behind why it works. That's also a useful skill to interview for, especially for more experienced candidates.
I’m not convinced anyone unfamiliar would likely fail. A pretty obvious solution is to just have a list of all nodes you’ve been to, and check if the current node is in that list. Not the most optimal solution, but one I’d imagine a lot of decent programmers would be able to come up with.
Not saying it’s a good interview question, just that most good programmers shoukd be able to come up with some form of answer if they understand linked lists.
>Unless you're interviewing Dijkstra, it's a ridiculous question to ask.
like many ridiculously looking interview questions, they are actually a filter. For this particular types of questions young people tend to quickly solve them. In 1987, in 9th grade, during one of the first computer classes, never before seeing or touching programming (previous years of school in a small town in USSR), I myself came up with Dijkstra algorithm in less than 10 minutes when we were assigned task of finding shortest path. I remember the teacher looking impressed :) These days i treat it more like a signal - if people asks such puzzles a lot then they either intentionally filtering or even don't understand that and seriously think it is important CS(cience).
> and you assume all qualified candidates have been exposed to this
The point is that that's an absurd thing to assume. It would almost never come up organically (and has only really come up in the context of awful interview question blogs). It's not a taught/talked about algorithm.
Rote memorization is also a poor metric for job performance.
Expecting a candidate to actually know this trick is however condoning the fact that the technical interview is just a matter of cheating and cramming.
Whether it is a good or a bad interview question depends on how you use it. I've used a variant of this when interviewing CS PhD candidates. I don't expect anyone to know how to solve this immediately. If anyone did, I'd assume they'd seen it before, and ask something else.
What I want is to see how someone thinks about algorithmic problems. As they talk through what they're thinking, there are certain to be problems or limitations with their first suggestion. I'll probe on these, and see whether they understand those issues as they're pointed out. I'll give hints and see how they assimilate new ideas and information. In the end, 75% of candidates will get to a reasonable solution, and 25% will get to an optimal one, given some hints along the way. I'm much less interested in whether they get to the optimal one, or even in how fast they get there, than in what happens during the discussion along the path to a solution. You learn a lot about how someone thinks from that. It's also useful to learn how someone communicates - this is also an essential part of a PhD - there's a lot of back-and-forth goes on regarding possible research ideas, so seeing how someone communicates their ideas is useful.
In summary, in an interview, such questions can be good as the basis of a dialog, but are useless unless the interviewer understands this is what they're actually trying to achieve.
You're not getting any insight into the candidate's problem solving ability in general, though. You may be getting insight into how they think about a specific class of algorithmic problems, and possibly (although it's unlikely) algorithmic problems in general.
What this interview question does is confirm your own biases regarding what "algorithmic thinking" is and little more.
Of course you're correct, and this is why such a question only forms a part of such an interview. I'm much more interested in what someone has built. We'll have a good discussion about any software they've built, and what they found interesting or difficult about that. Or basically anything where they showed creativity.
Still, in my area of CS, algorithmic thinking is an important aspect of the sort of systems we design, so this sort of question does help me build up a picture of the person's skills and approach to problems.
I have found though that there's pretty good correlation between how someone goes about answering this question (the mental process they follow, not whether they can jump to a solution), and how interesting the systems they've previously built are. And, although my sample size is small, the people who did best on this question also did best over the next few years on the systems they built and analyzed during their PhD.
OK, I wholeheartedly agree with that. I was thinking, though, that discussing a problem such as the one with the linked list at least would give one sample point of the candidate's ability to reason. In any case, what are you supposed to do if you are restricted to making your decision based on interviews?
It is a data point, but of questionable value generally. Consider this: even academic settings give students multiple chances to demonstrate their true level of competence over a course of months in a very constrained subject and a highly controlled setting. I don't think it's rational to think a panel of interviewers can do it in a matter of hours.
I think interviews are basically just a way we fool ourselves into thinking what is essentially a random chance biased by a self-selected set of applicants is (relatively) more objective.
I don't know that there is a good alternative. I am quite sure the status quo is thoroughly broken.
I've used a variant of this when interviewing CS PhD candidates.
If in fact it's for an actual PhD qualifying exam (or something similar), then this question might be OK.
The problem with the current crisis in interviewing is that it's become almost standard practice to ask questions like this (or its siblings: knapsack, outré graph search or sorting questions, etc) for what are basically run-of-the-mill API monkey / grunt finance programming / etc jobs.
As if the message these companies intend to convey is: "Fuck, we have no idea how to assess these candidates, nor do we have the time. So if we just ask a few gee-whiz questions that 75% of them will fail, then that might be an indication that the other 25% just might be a little smarter. Or at least very good at cramming. Because after all, that's how we got through college, too."
It's essentially impossible to require candidates not to prepare for a technical interview, but someone who can't discover the tortoise and hare algorithm on their own probably shouldn't have a computer science degree.
> someone who can't discover the tortoise and hare algorithm on their own probably shouldn't have a computer science degree
This sentence is simply ridiculous. Most engineers aren't able to discover this trick. I wouldn't.
The same goes for the rod cutting problem or the maximum sub-array problem (Kadane algorithm): expecting an engineer to find them, especially in an interview, is totally inane.
This amount of nontrivial, yes. Similarly, I'd expect a computer scientist to come up with a proof of correctness for either binary search or merge sort (the out of place variant) in ~30 minutes. It's not particularly difficult.
The number of incorrect binary search algorithms published in the literature over the years would indicate otherwise…
I agree that reasoning about the correctness of a binary search algorithm is an important job skill for a computer scientist. Deriving the tortoise and hare algorithm on the fly, though, is not.
You didn't say the answer has to use O(1) space, so I'd just throw the nodes in a WeakSet, and when I reach a node that's already in the WeakSet, that's the start of the loop. Or better yet, use a WeakMap to associate nodes with their position, so that once I find the start of the loop I can immediately subtract to get the loop length, no second pass needed.
I always wonder if odd(and perhaps pointless) interview Q like this have been contorted from the original(much useful) problem like pollard's rho or it just happened to reduce to more useful ones.
Digging deeper into such questions yields fun observations which may be the only useful thing that comes from questions like these. Like the length of the Longest Increasing Subsequence of a sequence A, has a fun fact to do with the Chromatic Number of a graph derived formed from a matching between A and sorted(A).
Theoretical Math is beautiful and the answer to so many things.
I was asked this exact question during an interview for a senior scala dev position and nearly walked out. I asked the guy to show me how one would go about implementing a cycle with prepend only api and of course did not get an answer or the job
A proof that the algorithm detects the existence of a cycle.
If there is a cycle of length n then eventually both pointers get into the cycle. Since at each step the distance between the two pointers is increased by one module n then the distance becomes zero. Also the time to detect the loop is bounded by the time it takes both pointers to get into the loop plus the cycle length. Also this proof shows that the key ingredient is that the difference between the pointers must be coprime with n but still n steps are required to guarantee that the difference eventually becomes zero. So there is no advantage in choosing other values for the increment of the pointers except to achieve that the slowest get faster into the cycle.
if you program in Lisp the function list-length gives nil for circular lists.
Ninety-nine problems contains similar problems with lists.
I should have said that the difference between increments must be coprime with n to guarantee that the differece becomes 0 module n, that is both pointers get equal.
Consider a cycle of length n. Consider two integers, i and j that have arbitrary initial values in [0, n).
We want to prove that the following loop always terminates:
while i % n != j % n:
i += 1
j += 2
If we represent this as an equation over time, where t is the number of loops, we want to prove that for some t the following is true:
(i + t) % n = (j + 2t) % n
Solving for t:
(i % n) + (t % n) = (j % n) + (2t % n)
(i % n) - (j % n) = t % n
(i - j) % n = t % n
We can generalize for arbitrary speeds of the pointers. Let's say i moves at speed x and j moves at speed y:
(i + x*t) % n = (j + y*t) % n
(i % n) + (x*t) % n = (j % n) + (y*t % n)
(i - j) % n = t*(y - x) % n
This will always have a solution if (y-x) is coprime with n.
A trivial case where it doesn't work, for example, is x=2, y=4, n=any even number. Imagine if i starts on an odd number and j starts on an even number. It's clear that i will always be odd and j will always be even. Note that y-x (2 in this case) shares a prime factor (not coprime) with all even numbers.
Note that if you add the requirement that i and j start on the same node, then no matter what step size you use for x and y, as long as they are integers > 0 and x != y, then i and j will meet again if and only if there is a cycle.
For particular values of x, y, and loop period there may be values of i and j such that going forward from those they never meet, but it is not possible to get into one of those positions if a past state had i == j.
Proof:
In the following I'll use your notation for the most part, except allowing for a "lead in" of length L from that common starting node to the start of the loop, so the nodes are numbered 0, 1, 2, ..., L-1, L, L+1, ..., L+n-1, with the nodes in the loop being {L, L+1, L+2, ..., L+n-1}.
At time t, i has taken xt steps and j has taken yt steps. Let t >= L/x and t >= L/y, so that i and j are both past the lead in into the loop.
Then i is xt-L steps into the loop, and j is yt-L into the loop. These will be at the same spot in the loop if xt-L = yt-L mod n, which happens whenever (x-y)t = 0 mod n. All multiples of n are solutions of this, so just pick any that are >= L/x and L/y, and you have a solution.
If i and j do not start at the same place, all bets are off. That 0 is replaced with the differences between the starting node numbers, and as your example shows may not have a solution for some combinations of parameters.
Maybe it would be helpful to rewrite the loop with a condition as you described, and %= i and j with n after incrementing (by 1 or by 2) both variables in the loop body.
Intuitive explanation: Once the pointers enter the cycle the fast one will eventually approach the slow one from behind like runners on a stadium track. If it's one node behind the slow one, then they'll meet on the next move (slow advances one, fast advances two). And if the fast one is two nodes behind, on the next move it'll be one node behind.
Counterexample: loop of 8 nodes, numbered 0, 1, ..., 7.
Start with one pointer at node 0, moving with speed 5. Start the second pointer at node 1, with speed 3 which is coprime to 5, the speed of the first pointer.
The pointer positions at each step are:
0 5 2 7 4 1 6 3 repeat
1 4 7 2 5 0 3 6 repeat
The condition we need in order to guarantee it works regardless of where the two pointers start is that the difference between the speeds is coprime to the cycle length.
Off the top of my head, because if you don’t know where the last element points back to, and there’s an even chance it could be any previous element, the median element is the one on the middle. Two at a time means when you reach the last element and loop back. The lagging pointer will point to the middle element so. Across a population on lists, it will perform better that ratios that favour the lagging pointer being closer to ge back end of the list when the leading pointer loops.
Also, with a 3 skip you could jump 2 ahead of the lagging pointer when you loop back behind and jump over it, which means when it advances it won’t catch you so they won’t meet.
The second part of this confused me as well, because the algorithm only detects the existence of the cycle, not its length. However, once you know a node in the cycle (which this algorithm will provide), you simply go around the cycle once more and you have the length.
Look in the Linux kernel to see how often linked lists are used. The list of a process’s memory mapping are, for example, maintained as linked list in parallel with a balanced tree. Lots of queues are maintained as linked lists. Free slabs in the slab allocator are kept in a linked list. The list goes on.
Yeah, until you need to remove an element from the middle in constant time - this is sometimes useful for things like the implementation of an LRU cache.
If you iterate over the list within an order of magnitude as often as you remove an element from the middle, an array will still be faster despite not having constant time removal, no matter the size of the list.
To restate, if you have a workload where you iterate over your collection ~500 times, and remove an element from the middle ~1000 times, an array will usually outperform a linked list on a modern computer, no matter the size of the list.
It is not until you do removes/adds far more often, and far from the end of the collection, that the linked list will perform better overall.
That's why I mentioned the LRU implementation. It does not iterate over the list, only adds elements to the front, deletes from the back and moves from the middle to the front. Refs to nodes are stored in a map, so iteration is not necessary to find an element.
You use a hash map, and make the linked list a linked list of hash entries. So, if you're using chaining (as opposed to the many variations on open addressing):
class Entry<K,V> {
Entry<K,V> younger; // LRU doubly linked list
Entry<K,V> older; // LRU doubly linked list
Entry<K,V> next; // for hash map bucket chaining
V value;
K key;
}
The naive solution using two separate data structures would look like:
class MapEntry<K,V> {
Entry<K,V> next;
DoublyLinkedListEntry<MapEntry<K,V>> LruEntry;
K key;
V value;
}
class DoublyLinkedListEntry<T> {
ListEntry<T> prev;
LintEntry<T> next;
T value;
}
The commenter mentions that in the last sentence. There is some separate data structure that also has references to nodes in the middle of the linked list.
He says because of cache misses due to each element being allocated individually.
But then working with references is bad in general? If your vector stores references to objects it's bad? Your object fields point to strings - bad too.
Should we ditch Lisp and Java?
Note, data should not be continuous to stay in cache. Cache is more like paged virtual memory, it consists of cache lines.
Before subscribing to the "never use linked lists" view I would like to see a nontrivial program rewritten from lists and references to the "allocate everything at once" approach and measure the performance.
If you want to avoid references and allocate everything in place, it may force your program to do a lot of copying.
Also, not only CPU performance matters - programmer performance is important too. So-called "high-level languages" try to optimise this.
>But then working with references is bad in general? If your vector stores references to objects it's bad? Your object fields point to strings - bad too.
It's not strictly bad, but it's useful to minimize the number of pointer derefrences you have wherever you can. A non-intrusive linked list will have 2 pointer dereferences to access any bit of data. You'll also have n+1 pointer dereferences to access element n. If you have fixed size small objects, then a vector of values is almost always better than a vector of pointers to the small objects. An intrusive linked list will save you a pointer dereference, but you still have the n dereferences to access element n.
>Should we ditch Lisp
Lisp's lists model a linked list with chains of cons cells, but there's no hard requirement for them to actually be implemented as linked lists. A typical approach is to implement lists in terms of Bagwell's VLists, which are a kind of middle ground between linked lists and vectors. You have reduced number of pointer dereferences, plus increased cache locality, whilst still being able to log n index, insert, delete, and not require large up-front allocations.
>and Java
If you subscribe to the "everything is an object" model religiously, then yes, you're probably doing harm. As always, there's no hard rules here and it always depends on your problem and data access requirements. You can usually get performance gains by using memory pools, arrays of structures/primitives, and entity systems instead of inheritance hierarchies.
You're misunderstanding the advice about contiguous. It's not that it's more likely to stay in cache, but if you're accessing data sequentially it's more likely the data you're going to access next is already in cache.
Most (all I've read/looked at) benchmarks in Java have data structures backed by linked lists utterly smashed by things implemented by an array.
There was in the last year or two a good c++ talk where a game or game engine changed their storage to be more array like and got roughly a 30% boost in performance. Memory locality is usually king, which is why linked lists are rarely used.
If you already know your data (which must be the case with arrays anyway), you can simply pre-allocate a pool from where the elements of the linked list will be taken. This guarantees that elements will be allocated from the same region in memory. It is a simple technique that I use whenever I believe that a linked list will be needed in a high-performance context.
Vectors are also usually more memory-efficient, period. A singly-linked list uses a pointer for each element an a doubly-linked uses two, while a vector has constant [a pointer and two integers (size and capacity)] overhead. (Unless the vector was dynamically resized and isn't near capacity.)
Cache consists of cache lines (usually 64 bytes). If an element in the list is smaller than the cache line a whole line is put in the cache anyway. This means that data is cached that you have not read yet. When using a contiguous array this unread data contains the next elements. So they are cached without ever having been read.
Essentially, many workloads access data sequentially and therefore modern cache architectures have special optimizations to make these memory accesses as fast as possible, by prefetching the next item in the sequence before it is actually needed.
Well, anyways I value code simplicity more than anything.
Maybe CPUs will learn to do cache prefetch for link oriented languages: if a "load and dereference" pattern is detected, CPU could prefetch the data referenced by pointers it has in registers or recently fetched cache line.
BTW, linked list elements are not necessary located far away from each other, if we allocated them one after another chance are they are near each other in memory.
Dependent on various internal arbitrary or intentional decisions, copying garbage collectors can in their normal workings place linked list cells in order in memory. While this does still take more memory than arrays, it can take advantage of the caching & sequential auto prefetch as well.
> data [needn't] be continuous to stay in cache. Cache is more like paged virtual memory, it consists of cache lines.
It sounds like you're saying everything will be OK so long as each chunk of 64 bytes (cache line) is used together. But one page is typically 4 KB, and if you use for example all 64 bytes of one cache line, but only one cache line per page, you will suffer from TLB misses.
If you read one field of all elements all at once, sure, a structure of arrays is probably right. But what if you instead read all fields of one element at once? Isn’t it advantageous to have those fields adjacent in memory with an array of structures?
The problem with trying to substitute lists with vectors is that their iterators behave differently. I.e. vector iterators point to positions rather than elements and are prone to being invalidated. So sometimes it's nice to have a vector that supports iterators that behave like list iterators[1].
You mean how it's implemented? Umm, well it's been a while, but basically an ipointer is a proxy for an iterator that is stored internally by the msevector. These "internally stored" iterators are updated when necessary. For example, when insert() or erase() is called.
One nice thing about it is that it roughly conforms to the principle of "only pay for what you use". That is, the run-time cost is roughly proportional to the number of ipointers you have and the frequency of operations that modify the size of the vector.
One caveat is that this mechanism is not thread safe. But whenever you need to share the vector among threads, you can swap it with a vector that is safe to share[1].
And for those that are into memory safety, there is also a memory-safe vector[2] that supports ipointers.
Is this the sort of explanation you're looking for?
Thanks, that's clear enough :-). In hindsight I can't imagine what alternative I was thinking of .. I had some idea you might have put the additional cost in the iterator by maintaining only an epoch counter in the vector, but that's obviously not enough to do the right thing in the presence of insert and erase.
Your library looks like a good toolset. While I still find the code pretty impenetrable, the number of tests I can see give me confidence. Bookmarked for reference when I'm using C++ again.
For instance, my current use case is an STM32F, which accesses CCM in a single cycle. And intrusive linked lists are a god send for managing pools in embedded systems without traditional memory management.
C++ folks tend to dislike linked lists because it's awkward to make C++ lists intrusive. The STL list containers store the pointers in a separate allocation from the object itself, so they're slower than they ought to be.
Where did you get that? The STL implementations I know definitely don't. See https://github.com/llvm-mirror/libcxx/blob/master/include/li... for an example - __list_node_base contains the pointers, __list_node contains the object and derives from __list_node_base, so allocating the node with object and pointers is one allocation.
Yeah, that's true for values you're willing to allocate inline. The key advantage of intrusive linked lists is being able to add and remove normally allocated objects without having to construct separate list nodes. The STL does have splice(), but it's awkward -- you always have to keep the object in a list and refer to it using iterators instead of normal references.
struct node* BuildWithLocalRef() {
struct node* head = NULL;
struct node** lastPtrRef= &head; // Start out pointing to the head pointer
int i;
for (i=1; i<6; i++) {
Push(lastPtrRef, i); // Add node at the last pointer in the list
lastPtrRef= &((*lastPtrRef)->next); // Advance to point to the new last pointer
}
// head == {1, 2, 3, 4, 5};
return(head);
}
> This technique is short, but the inside of the loop is scary. This technique is rarely used,
but it's a good way to see if you really understand pointers.
> This technique is never required to solve a linked list problem, but it will be one of the
alternative solutions presented for some of the advanced problems. The code is shorter
this way, but the performance is probably not any better.
(Incidentally, I found Nginx is written in the same style recently as I read it - manual linked list manipulation with this double pointer technique everywhere..)
I work with traditional RDBMS & and old hierarchical linked-list DB. The linked-list has great performance with super speedy Fortran, but it is a major pain to retrieve any information. Having to write complicated pointer commands is tedious at best. Select * is such a luxury.
A nice way to do this is to start with two pointers to the head to the list, and then advance one by a single node at a time, and the other by two nodes at a time. It is easy to see that they will eventually meet on the cycle. Once that happens, it is easy to count the cycle length.
Now suppose you are running a casino and you are using a random number generator. For any given seed, the generator will eventually cycle, but not necessarily to the seed. You can compute the cycle lengths for a given seed with two variables by analogy with the linked list solution stated above.