Linked List Problems (2002) [pdf]

herodotus · on Nov 25, 2017

One of my all-time favourites is this: suppose you have a linked list that eventually cycles: that is, the link of one of the nodes in the list points to a previous node in the list, but not necessarily the first node. Write a function to compute the cycle length of the cycle. Now if it was just a simple circular list, this would be trivial: set a pointer to a node, and move another pointer a node at a time until the two pointers meet. However, this list has an initial sequence of nodes of unknown length that is not in the cycle. So the trick is really to find a node that is guaranteed to be in the cycle.

A nice way to do this is to start with two pointers to the head to the list, and then advance one by a single node at a time, and the other by two nodes at a time. It is easy to see that they will eventually meet on the cycle. Once that happens, it is easy to count the cycle length.

Now suppose you are running a casino and you are using a random number generator. For any given seed, the generator will eventually cycle, but not necessarily to the seed. You can compute the cycle lengths for a given seed with two variables by analogy with the linked list solution stated above.

greenleafjacob · on Nov 25, 2017

As others have said, this is actually a, perhaps the best, canonical example of a terrible interview question [1].

[1] https://news.ycombinator.com/item?id=7953725

hota_mazi · on Nov 25, 2017

Totally agree.

If I ask this question (I never would) and I get the "tortoise and the hare" answer, it's much more likely that the candidate already knew the answer rather than deduced it themselves. And I have learned exactly nothing about that candidate, except that they probably looked up common interview questions before coming.

A more reasonable answer to this question would be to store the nodes you've already seen in a hash map, keyed to their position, and stop when you encounter the same node twice. This in turn leads to interesting discussions about list size, memory cost, and more general questions about the properties of a hash map.

rayiner · on Nov 25, 2017

Plus, someone should be able to get the hash table brute force answer coming in cold.

caseymarquis · on Nov 25, 2017

Only if you're looking for that specific answer? Seems pretty trivial to write a memory heavy solution with an array or hash which simply asks: "Have I seen this node before? Is it the first node?"

bastawhiz · on Nov 25, 2017

Does that tell you anything interesting about your candidate's ability to think critically and problem solve, though? There are far better questions (and indeed, types of questions) that can cover more ground and be more candidate-friendly.

michaelcampbell · on Nov 25, 2017

I liken these questions to a problem with a fancy car. If you have the exact tool it's trivial but if not damn near impossible.

And a horrible interview question. It shows only that the interviewee has heard the question before or not.

bjoli · on Nov 25, 2017

Who couldn't figure out the idea behind tortoise and the hare? Maybe not the implementation, but the concept is probably the simplest way of going about it.

I am no programmer, and I managed to implement it not too long ago all on my own. It took no more than 10 minutes from idea to working implementation.

auxym · on Nov 25, 2017

As the linked article in parent post points out, this was an open problem in CS for 12 years before someone came up with the tortoise and hare algorithm. You really expect a candidate to come up with it from the top of their head, in a few minutes, during a stressful interview?

bjoli · on Nov 25, 2017

No not the specific solution, but the concept of having to pointers moving at different speeds did be pretty clear no?

It is really not a hard problem. If you are dealing with cirular lists you will end up having to solve it sooner or later. The "open question in CS for 12 years" is probably easier explained by "nobody could be bothered" than "it is hard".

codefined · on Nov 26, 2017

Considering there are limited use cases for pointers moving at different speeds in other fields, no, I wouldn't say it would be easy.

The idea itself, once you hear it, instantly makes sense. But having the ingenuity to thinking of it during a stressful period shouldn't be used as a measure of how good a candidate would be for a position. Especially considering there are hundreds of different ways of approaching such problems.

bjoli · on Nov 26, 2017

I think the opposite. It is a simple problem. I wouldn't expect someone to write a working solution on a whiteboard, but in a reasoning discussion come up with the answer.

Say if you have one list that might be circular from the "starting" element or not circular at all. You simply save the pointer to the first element an compare every subsequent element to that. If you find the end of the list it is not circular. If you find the first element it is circular.

The step from that to a general approach for finding whether a list contains circular references should be simple.

I am however coming from scheme where linked lists are abundant.

Edit: not to say I think it is a good problem for an interview, but I would expect someone to be able to solve the problem in one way or another it without googling. What I am questioning is that it is a hard problem, or a very hard solution.

michaelcampbell · on Nov 27, 2017

Perhaps it wasn't, for you. It is for the VAST majority.

michaelcampbell · on Nov 27, 2017

> but the concept of having to pointers moving at different speeds did be pretty clear no?

> this was an open problem in CS for 12 years before someone came up with the tortoise and hare algorithm.

So, empirically, "no".

abalone · on Nov 25, 2017

I mostly agree with this, however if the job requires a C.S. degree or equivalent and you assume all qualified candidates have been exposed to this, then perhaps it can be a useful question for testing how well you can explain a solution.

Just look at how varied and complicated some of the explanations are for this out there. Some take mathematical approaches, others (me) try to boil it down to a few intuitive sentences... I can see it giving some insight into a candidate.

marcus_chang · on Nov 25, 2017

Unless you're interviewing Dijkstra, it's a ridiculous question to ask. Anyone who's heard it before knows the answer instantly (I knew the answer before I even finished reading the description) and anyone unfamiliar with it will likely flail.

abalone · on Nov 26, 2017

Nah, I wouldn't say everyone would flail but Dijkstra. There are simpler solutions to cycle detection.. maybe not as efficient as Floyd, but you could answer the question. For example you could start with something basic and unoptimized like keep a data structure of all the nodes you walk, and when you hit one that's in the structure then you've encountered a cycle. This is a fair approach to engineering solutions to novel problems -- don't panic, get something working, then optimize where necessary.

And if you already know the best practice, like I said, it's not just about whether you know the answer -- it's how simply you can explain it to a teammate and the theory behind why it works. That's also a useful skill to interview for, especially for more experienced candidates.

mort96 · on Nov 25, 2017

I’m not convinced anyone unfamiliar would likely fail. A pretty obvious solution is to just have a list of all nodes you’ve been to, and check if the current node is in that list. Not the most optimal solution, but one I’d imagine a lot of decent programmers would be able to come up with.

Not saying it’s a good interview question, just that most good programmers shoukd be able to come up with some form of answer if they understand linked lists.

trhway · on Nov 25, 2017

>Unless you're interviewing Dijkstra, it's a ridiculous question to ask.

like many ridiculously looking interview questions, they are actually a filter. For this particular types of questions young people tend to quickly solve them. In 1987, in 9th grade, during one of the first computer classes, never before seeing or touching programming (previous years of school in a small town in USSR), I myself came up with Dijkstra algorithm in less than 10 minutes when we were assigned task of finding shortest path. I remember the teacher looking impressed :) These days i treat it more like a signal - if people asks such puzzles a lot then they either intentionally filtering or even don't understand that and seriously think it is important CS(cience).

bpicolo · on Nov 25, 2017

> and you assume all qualified candidates have been exposed to this

The point is that that's an absurd thing to assume. It would almost never come up organically (and has only really come up in the context of awful interview question blogs). It's not a taught/talked about algorithm.

Rote memorization is also a poor metric for job performance.

pjmlp · on Nov 25, 2017

My CS degree had a duration of 5 years, and I graduated during the 90's.

Excuse me if I don't recall a complete year of Algorithms and Data Structures lectures and evaluation projects done about 25 years ago.

xroche · on Nov 25, 2017

This is the tortoise and the hare (https://en.wikipedia.org/wiki/Cycle_detection)

Expecting a candidate to actually know this trick is however condoning the fact that the technical interview is just a matter of cheating and cramming.

mhandley · on Nov 25, 2017

Whether it is a good or a bad interview question depends on how you use it. I've used a variant of this when interviewing CS PhD candidates. I don't expect anyone to know how to solve this immediately. If anyone did, I'd assume they'd seen it before, and ask something else.

What I want is to see how someone thinks about algorithmic problems. As they talk through what they're thinking, there are certain to be problems or limitations with their first suggestion. I'll probe on these, and see whether they understand those issues as they're pointed out. I'll give hints and see how they assimilate new ideas and information. In the end, 75% of candidates will get to a reasonable solution, and 25% will get to an optimal one, given some hints along the way. I'm much less interested in whether they get to the optimal one, or even in how fast they get there, than in what happens during the discussion along the path to a solution. You learn a lot about how someone thinks from that. It's also useful to learn how someone communicates - this is also an essential part of a PhD - there's a lot of back-and-forth goes on regarding possible research ideas, so seeing how someone communicates their ideas is useful.

In summary, in an interview, such questions can be good as the basis of a dialog, but are useless unless the interviewer understands this is what they're actually trying to achieve.

sidlls · on Nov 25, 2017

You're not getting any insight into the candidate's problem solving ability in general, though. You may be getting insight into how they think about a specific class of algorithmic problems, and possibly (although it's unlikely) algorithmic problems in general.

What this interview question does is confirm your own biases regarding what "algorithmic thinking" is and little more.

mhandley · on Nov 25, 2017

Of course you're correct, and this is why such a question only forms a part of such an interview. I'm much more interested in what someone has built. We'll have a good discussion about any software they've built, and what they found interesting or difficult about that. Or basically anything where they showed creativity.

Still, in my area of CS, algorithmic thinking is an important aspect of the sort of systems we design, so this sort of question does help me build up a picture of the person's skills and approach to problems.

I have found though that there's pretty good correlation between how someone goes about answering this question (the mental process they follow, not whether they can jump to a solution), and how interesting the systems they've previously built are. And, although my sample size is small, the people who did best on this question also did best over the next few years on the systems they built and analyzed during their PhD.

Bromskloss · on Nov 25, 2017

How does one, then, get an insight into the candidate's problem-solving ability in general?

sidlls · on Nov 25, 2017

By working with them on a number of non-trivial problems over the course of some time, usually measurable in weeks or months at least.

Bromskloss · on Nov 25, 2017

OK, I wholeheartedly agree with that. I was thinking, though, that discussing a problem such as the one with the linked list at least would give one sample point of the candidate's ability to reason. In any case, what are you supposed to do if you are restricted to making your decision based on interviews?

sidlls · on Nov 26, 2017

It is a data point, but of questionable value generally. Consider this: even academic settings give students multiple chances to demonstrate their true level of competence over a course of months in a very constrained subject and a highly controlled setting. I don't think it's rational to think a panel of interviewers can do it in a matter of hours.

I think interviews are basically just a way we fool ourselves into thinking what is essentially a random chance biased by a self-selected set of applicants is (relatively) more objective.

I don't know that there is a good alternative. I am quite sure the status quo is thoroughly broken.

kafkaesq · on Nov 25, 2017

I've used a variant of this when interviewing CS PhD candidates.

If in fact it's for an actual PhD qualifying exam (or something similar), then this question might be OK.

The problem with the current crisis in interviewing is that it's become almost standard practice to ask questions like this (or its siblings: knapsack, outré graph search or sorting questions, etc) for what are basically run-of-the-mill API monkey / grunt finance programming / etc jobs.

As if the message these companies intend to convey is: "Fuck, we have no idea how to assess these candidates, nor do we have the time. So if we just ask a few gee-whiz questions that 75% of them will fail, then that might be an indication that the other 25% just might be a little smarter. Or at least very good at cramming. Because after all, that's how we got through college, too."

catnaroek · on Nov 25, 2017

It's essentially impossible to require candidates not to prepare for a technical interview, but someone who can't discover the tortoise and hare algorithm on their own probably shouldn't have a computer science degree.

xroche · on Nov 25, 2017

> someone who can't discover the tortoise and hare algorithm on their own probably shouldn't have a computer science degree

This sentence is simply ridiculous. Most engineers aren't able to discover this trick. I wouldn't.

The same goes for the rod cutting problem or the maximum sub-array problem (Kadane algorithm): expecting an engineer to find them, especially in an interview, is totally inane.

catnaroek · on Nov 25, 2017

Of course, nobody expects engineers and scientists to have the same skills.

adamwk · on Nov 25, 2017

Then why is tortoise and the hair credited to Floyd if it’s so trivial?

catnaroek · on Nov 25, 2017

Attribution in science works in mysterious ways, and I don't claim to know how it works.

In any case, I didn't say “trivial”. If you can only derive trivial things, then you shouldn't be a computer scientist.

varjag · on Nov 25, 2017

Are you suggesting computer scientists should be able to derive non trivial results on job interviews? Really?

catnaroek · on Nov 25, 2017

This amount of nontrivial, yes. Similarly, I'd expect a computer scientist to come up with a proof of correctness for either binary search or merge sort (the out of place variant) in ~30 minutes. It's not particularly difficult.

microtherion · on Nov 26, 2017

The number of incorrect binary search algorithms published in the literature over the years would indicate otherwise…

I agree that reasoning about the correctness of a binary search algorithm is an important job skill for a computer scientist. Deriving the tortoise and hare algorithm on the fly, though, is not.

xamuel · on Nov 25, 2017

You didn't say the answer has to use O(1) space, so I'd just throw the nodes in a WeakSet, and when I reach a node that's already in the WeakSet, that's the start of the loop. Or better yet, use a WeakMap to associate nodes with their position, so that once I find the start of the loop I can immediately subtract to get the loop length, no second pass needed.

Retr0spectrum · on Nov 25, 2017

This algorithm can also be used for integer factorisation: https://en.wikipedia.org/wiki/Pollard%27s_rho_algorithm

itissid · on Nov 26, 2017

I always wonder if odd(and perhaps pointless) interview Q like this have been contorted from the original(much useful) problem like pollard's rho or it just happened to reduce to more useful ones.

Digging deeper into such questions yields fun observations which may be the only useful thing that comes from questions like these. Like the length of the Longest Increasing Subsequence of a sequence A, has a fun fact to do with the Chromatic Number of a graph derived formed from a matching between A and sorted(A).

Theoretical Math is beautiful and the answer to so many things.

jnwatson · on Nov 25, 2017

I recently received this as an interview question.

This is one of those tricks once you know it, you'll remember it forever.

pvg · on Nov 25, 2017

Which makes it a fun thing to be aware of but a really terrible interview question. It's been kicking around forever as such, unfortunately.

AzzieElbab · on Nov 27, 2017

I was asked this exact question during an interview for a senior scala dev position and nearly walked out. I asked the guy to show me how one would go about implementing a cycle with prepend only api and of course did not get an answer or the job

linkpoint1 · on Nov 25, 2017

A proof that the algorithm detects the existence of a cycle.

If there is a cycle of length n then eventually both pointers get into the cycle. Since at each step the distance between the two pointers is increased by one module n then the distance becomes zero. Also the time to detect the loop is bounded by the time it takes both pointers to get into the loop plus the cycle length. Also this proof shows that the key ingredient is that the difference between the pointers must be coprime with n but still n steps are required to guarantee that the difference eventually becomes zero. So there is no advantage in choosing other values for the increment of the pointers except to achieve that the slowest get faster into the cycle.

avodonosov · on Nov 25, 2017

Thank you, super.

linkpoint1 · on Nov 25, 2017

if you program in Lisp the function list-length gives nil for circular lists. Ninety-nine problems contains similar problems with lists.

I should have said that the difference between increments must be coprime with n to guarantee that the differece becomes 0 module n, that is both pointers get equal.

avodonosov · on Nov 25, 2017

> It is easy to see that they will eventually meet on the cycle.

Not easy for me, but thanks for the algorithm, maybe I will find time and energy to understand and believe. Why by two at a time and not by 3?

gamegoblin · on Nov 25, 2017

Consider a cycle of length n. Consider two integers, i and j that have arbitrary initial values in [0, n).

We want to prove that the following loop always terminates:

    while i % n != j % n:
        i += 1
        j += 2

If we represent this as an equation over time, where t is the number of loops, we want to prove that for some t the following is true:

    (i + t) % n = (j + 2t) % n

Solving for t:

    (i % n) + (t % n) = (j % n) + (2t % n)
    (i % n) - (j % n) = t % n
    (i - j) % n = t % n

We can generalize for arbitrary speeds of the pointers. Let's say i moves at speed x and j moves at speed y:

    (i + x*t) % n = (j + y*t) % n
    (i % n) + (x*t) % n = (j % n) + (y*t % n)
    (i - j) % n = t*(y - x) % n

This will always have a solution if (y-x) is coprime with n.

A trivial case where it doesn't work, for example, is x=2, y=4, n=any even number. Imagine if i starts on an odd number and j starts on an even number. It's clear that i will always be odd and j will always be even. Note that y-x (2 in this case) shares a prime factor (not coprime) with all even numbers.

avodonosov · on Nov 25, 2017

Thank you very much. I like the equation over time.

Your explanation is difficult to follow when you do distributivity of % over +. I mean when you move from

    (i + t) % n = (j + 2t) % n

to

    (i % n) + (t % n) = (j % n) + (2t % n)

Here I read the '=' as normal number equality. A simplified example to demostrate that distributivity is not valid here:

    (5 + x) % 3 = 4
    5 % 3 + x % 3 = 4

In the second line x=5 is a solution. But in the first line x is not a solution.

It's better to express your explanation using "= (mod n)" from the beginning.

    (i + t) % n = (j + 2t) % n 
    i + t = j + 2t (mod n)

tzs · on Nov 26, 2017

Note that if you add the requirement that i and j start on the same node, then no matter what step size you use for x and y, as long as they are integers > 0 and x != y, then i and j will meet again if and only if there is a cycle.

For particular values of x, y, and loop period there may be values of i and j such that going forward from those they never meet, but it is not possible to get into one of those positions if a past state had i == j.

Proof:

In the following I'll use your notation for the most part, except allowing for a "lead in" of length L from that common starting node to the start of the loop, so the nodes are numbered 0, 1, 2, ..., L-1, L, L+1, ..., L+n-1, with the nodes in the loop being {L, L+1, L+2, ..., L+n-1}.

At time t, i has taken xt steps and j has taken yt steps. Let t >= L/x and t >= L/y, so that i and j are both past the lead in into the loop.

Then i is xt-L steps into the loop, and j is yt-L into the loop. These will be at the same spot in the loop if xt-L = yt-L mod n, which happens whenever (x-y)t = 0 mod n. All multiples of n are solutions of this, so just pick any that are >= L/x and L/y, and you have a solution.

If i and j do not start at the same place, all bets are off. That 0 is replaced with the differences between the starting node numbers, and as your example shows may not have a solution for some combinations of parameters.

atsushin · on Nov 26, 2017

Hi! Just a question, since I've never been good with proofs. Why use the modulo operator in the while-loop? Couldn't we just test if i != j?

tomjakubowski · on Nov 26, 2017

The modulo represents the cycle.

Maybe it would be helpful to rewrite the loop with a condition as you described, and %= i and j with n after incrementing (by 1 or by 2) both variables in the loop body.

abalone · on Nov 25, 2017

Intuitive explanation: Once the pointers enter the cycle the fast one will eventually approach the slow one from behind like runners on a stadium track. If it's one node behind the slow one, then they'll meet on the next move (slow advances one, fast advances two). And if the fast one is two nodes behind, on the next move it'll be one node behind.

RomanPushkin · on Nov 25, 2017

3 will also work. People pick 2 intuitively, and it minimizes the overall runtime of the algorithm.

(google "Proof of Floyd's Cycle Chasing" for details)

tzs · on Nov 25, 2017

Another advantage of 2 is that it works even if the two pointers do not start on the same element.

Spivak · on Nov 26, 2017

It's not specifically 2, any two speeds that are coprime will have the same property.

tzs · on Nov 28, 2017

Counterexample: loop of 8 nodes, numbered 0, 1, ..., 7.

Start with one pointer at node 0, moving with speed 5. Start the second pointer at node 1, with speed 3 which is coprime to 5, the speed of the first pointer.

The pointer positions at each step are:

  0 5 2 7 4 1 6 3 repeat
  1 4 7 2 5 0 3 6 repeat

The condition we need in order to guarantee it works regardless of where the two pointers start is that the difference between the speeds is coprime to the cycle length.

simonh · on Nov 25, 2017

Off the top of my head, because if you don’t know where the last element points back to, and there’s an even chance it could be any previous element, the median element is the one on the middle. Two at a time means when you reach the last element and loop back. The lagging pointer will point to the middle element so. Across a population on lists, it will perform better that ratios that favour the lagging pointer being closer to ge back end of the list when the leading pointer loops.

Also, with a 3 skip you could jump 2 ahead of the lagging pointer when you loop back behind and jump over it, which means when it advances it won’t catch you so they won’t meet.

simooooo · on Nov 25, 2017

"It is easy to see that they will eventually meet on the cycle. Once that happens, it is easy to count the cycle length."

Is that so?

microtherion · on Nov 26, 2017

The second part of this confused me as well, because the algorithm only detects the existence of the cycle, not its length. However, once you know a node in the cycle (which this algorithm will provide), you simply go around the cycle once more and you have the length.

Spivak · on Nov 26, 2017

More specifically you stop moving the hare and send the tortoise around again until it meets the hare counting your steps along the way.

dgquintas · on Nov 25, 2017

"Just say no to linked lists!" https://youtu.be/fHNmRkzxHWs?t=2099

rayiner · on Nov 25, 2017

Look in the Linux kernel to see how often linked lists are used. The list of a process’s memory mapping are, for example, maintained as linked list in parallel with a balanced tree. Lots of queues are maintained as linked lists. Free slabs in the slab allocator are kept in a linked list. The list goes on.

praptak · on Nov 25, 2017

Yeah, until you need to remove an element from the middle in constant time - this is sometimes useful for things like the implementation of an LRU cache.

jackmott · on Nov 25, 2017

If you iterate over the list within an order of magnitude as often as you remove an element from the middle, an array will still be faster despite not having constant time removal, no matter the size of the list.

To restate, if you have a workload where you iterate over your collection ~500 times, and remove an element from the middle ~1000 times, an array will usually outperform a linked list on a modern computer, no matter the size of the list.

It is not until you do removes/adds far more often, and far from the end of the collection, that the linked list will perform better overall.

praptak · on Nov 25, 2017

That's why I mentioned the LRU implementation. It does not iterate over the list, only adds elements to the front, deletes from the back and moves from the middle to the front. Refs to nodes are stored in a map, so iteration is not necessary to find an element.

winstonewert · on Nov 25, 2017

But how do you find the element to be moved from the middle if not by iterating over the list?

KMag · on Nov 26, 2017

You use a hash map, and make the linked list a linked list of hash entries. So, if you're using chaining (as opposed to the many variations on open addressing):

    class Entry<K,V> {
        Entry<K,V> younger;  // LRU doubly linked list
        Entry<K,V> older;    // LRU doubly linked list
        Entry<K,V> next;     // for hash map bucket chaining
        V value;
        K key;
    }

The naive solution using two separate data structures would look like:

    class MapEntry<K,V> {
        Entry<K,V> next;
        DoublyLinkedListEntry<MapEntry<K,V>> LruEntry;
        K key;
        V value;
    }

    class DoublyLinkedListEntry<T> {
        ListEntry<T> prev;
        LintEntry<T> next;
        T value;
    }

tshaddox · on Nov 26, 2017

The commenter mentions that in the last sentence. There is some separate data structure that also has references to nodes in the middle of the linked list.

KMag · on Nov 26, 2017

Or, you've removed one indirection and made the linked list actually a part of another data structure, as in how LRU caches are generally implemented.

tshaddox · on Nov 26, 2017

The entire LRU used multiple data structures, but the ability to efficiently remove from the middle of the linked list is still important.

winstonewert · on Nov 26, 2017

Oy! I can't believe I missed that.

avodonosov · on Nov 25, 2017

He says because of cache misses due to each element being allocated individually.

But then working with references is bad in general? If your vector stores references to objects it's bad? Your object fields point to strings - bad too.

Should we ditch Lisp and Java?

Note, data should not be continuous to stay in cache. Cache is more like paged virtual memory, it consists of cache lines.

Before subscribing to the "never use linked lists" view I would like to see a nontrivial program rewritten from lists and references to the "allocate everything at once" approach and measure the performance.

If you want to avoid references and allocate everything in place, it may force your program to do a lot of copying.

Also, not only CPU performance matters - programmer performance is important too. So-called "high-level languages" try to optimise this.

sparkie · on Nov 25, 2017

>But then working with references is bad in general? If your vector stores references to objects it's bad? Your object fields point to strings - bad too.

It's not strictly bad, but it's useful to minimize the number of pointer derefrences you have wherever you can. A non-intrusive linked list will have 2 pointer dereferences to access any bit of data. You'll also have n+1 pointer dereferences to access element n. If you have fixed size small objects, then a vector of values is almost always better than a vector of pointers to the small objects. An intrusive linked list will save you a pointer dereference, but you still have the n dereferences to access element n.

>Should we ditch Lisp

Lisp's lists model a linked list with chains of cons cells, but there's no hard requirement for them to actually be implemented as linked lists. A typical approach is to implement lists in terms of Bagwell's VLists, which are a kind of middle ground between linked lists and vectors. You have reduced number of pointer dereferences, plus increased cache locality, whilst still being able to log n index, insert, delete, and not require large up-front allocations.

>and Java

If you subscribe to the "everything is an object" model religiously, then yes, you're probably doing harm. As always, there's no hard rules here and it always depends on your problem and data access requirements. You can usually get performance gains by using memory pools, arrays of structures/primitives, and entity systems instead of inheritance hierarchies.

SunnySkies · on Nov 25, 2017

You're misunderstanding the advice about contiguous. It's not that it's more likely to stay in cache, but if you're accessing data sequentially it's more likely the data you're going to access next is already in cache.

Most (all I've read/looked at) benchmarks in Java have data structures backed by linked lists utterly smashed by things implemented by an array.

There was in the last year or two a good c++ talk where a game or game engine changed their storage to be more array like and got roughly a 30% boost in performance. Memory locality is usually king, which is why linked lists are rarely used.

_qhtn · on Nov 25, 2017

CppCon 2014: Mike Acton "Data-Oriented Design and C++

https://youtu.be/rX0ItVEVjHc

https://en.m.wikipedia.org/wiki/Data-oriented_design

coliveira · on Nov 25, 2017

If you already know your data (which must be the case with arrays anyway), you can simply pre-allocate a pool from where the elements of the linked list will be taken. This guarantees that elements will be allocated from the same region in memory. It is a simple technique that I use whenever I believe that a linked list will be needed in a high-performance context.

stochastic_monk · on Nov 25, 2017

Vectors are also usually more memory-efficient, period. A singly-linked list uses a pointer for each element an a doubly-linked uses two, while a vector has constant [a pointer and two integers (size and capacity)] overhead. (Unless the vector was dynamically resized and isn't near capacity.)

avodonosov · on Nov 25, 2017

> if you're accessing data sequentially it's more likely the data you're going to access next is already in cache

Why?

JelteF · on Nov 25, 2017

Cache consists of cache lines (usually 64 bytes). If an element in the list is smaller than the cache line a whole line is put in the cache anyway. This means that data is cached that you have not read yet. When using a contiguous array this unread data contains the next elements. So they are cached without ever having been read.

yorwba · on Nov 25, 2017

Cache prefetching. (https://en.wikipedia.org/wiki/Cache_prefetching)

Essentially, many workloads access data sequentially and therefore modern cache architectures have special optimizations to make these memory accesses as fast as possible, by prefetching the next item in the sequence before it is actually needed.

avodonosov · on Nov 26, 2017

Ah, yes.

Well, anyways I value code simplicity more than anything.

Maybe CPUs will learn to do cache prefetch for link oriented languages: if a "load and dereference" pattern is detected, CPU could prefetch the data referenced by pointers it has in registers or recently fetched cache line.

BTW, linked list elements are not necessary located far away from each other, if we allocated them one after another chance are they are near each other in memory.

posterboy · on Nov 25, 2017

the time complexity of insert etc is superior and a good reason for abstraction on top of an array, I would think.

white-flame · on Nov 25, 2017

Dependent on various internal arbitrary or intentional decisions, copying garbage collectors can in their normal workings place linked list cells in order in memory. While this does still take more memory than arrays, it can take advantage of the caching & sequential auto prefetch as well.

pjmlp · on Nov 25, 2017

Common Lisp has value types, it is not all about lists.

Likewise Java has primitive types, arrays and eventually will get value types, because they already feel the pressure in FinTech of not having them.

jzwinck · on Nov 25, 2017

> data [needn't] be continuous to stay in cache. Cache is more like paged virtual memory, it consists of cache lines.

It sounds like you're saying everything will be OK so long as each chunk of 64 bytes (cache line) is used together. But one page is typically 4 KB, and if you use for example all 64 bytes of one cache line, but only one cache line per page, you will suffer from TLB misses.

avodonosov · on Nov 25, 2017

If you throw your computer out of window it may break. Why whould I access only 64 bytes per page? I suggest you to delete this comment.

We are not discussing VM here, I brought it as an analogy just to say cache lines are independent and need not be contineous.

vernie · on Nov 25, 2017

Yes, organizing your data in an array of structures is also bad for performance.

tomjakubowski · on Nov 26, 2017

Doesn’t this entirely depend on access patterns?

If you read one field of all elements all at once, sure, a structure of arrays is probably right. But what if you instead read all fields of one element at once? Isn’t it advantageous to have those fields adjacent in memory with an array of structures?

duneroadrunner · on Nov 25, 2017

The problem with trying to substitute lists with vectors is that their iterators behave differently. I.e. vector iterators point to positions rather than elements and are prone to being invalidated. So sometimes it's nice to have a vector that supports iterators that behave like list iterators[1].

[1] shameless plug: https://www.codeproject.com/Articles/1087021/Stable-Iterator...

saywatnow · on Nov 26, 2017

Can you briefly describe how the msevector + ipointer works? I tried to look at the code but dense C++ is not my forte.

duneroadrunner · on Nov 26, 2017

You mean how it's implemented? Umm, well it's been a while, but basically an ipointer is a proxy for an iterator that is stored internally by the msevector. These "internally stored" iterators are updated when necessary. For example, when insert() or erase() is called.

One nice thing about it is that it roughly conforms to the principle of "only pay for what you use". That is, the run-time cost is roughly proportional to the number of ipointers you have and the frequency of operations that modify the size of the vector.

One caveat is that this mechanism is not thread safe. But whenever you need to share the vector among threads, you can swap it with a vector that is safe to share[1].

And for those that are into memory safety, there is also a memory-safe vector[2] that supports ipointers.

Is this the sort of explanation you're looking for?

[1] https://github.com/duneroadrunner/SaferCPlusPlus#nii_vector

[2] https://github.com/duneroadrunner/SaferCPlusPlus#ivector

saywatnow · on Nov 27, 2017

Thanks, that's clear enough :-). In hindsight I can't imagine what alternative I was thinking of .. I had some idea you might have put the additional cost in the iterator by maintaining only an epoch counter in the vector, but that's obviously not enough to do the right thing in the presence of insert and erase.

Your library looks like a good toolset. While I still find the code pretty impenetrable, the number of tests I can see give me confidence. Bookmarked for reference when I'm using C++ again.

monocasa · on Nov 25, 2017

Eh, like anything it depends on your use case.

For instance, my current use case is an STM32F, which accesses CCM in a single cycle. And intrusive linked lists are a god send for managing pools in embedded systems without traditional memory management.

panic · on Nov 25, 2017

C++ folks tend to dislike linked lists because it's awkward to make C++ lists intrusive. The STL list containers store the pointers in a separate allocation from the object itself, so they're slower than they ought to be.

haeffin · on Nov 25, 2017

Where did you get that? The STL implementations I know definitely don't. See https://github.com/llvm-mirror/libcxx/blob/master/include/li... for an example - __list_node_base contains the pointers, __list_node contains the object and derives from __list_node_base, so allocating the node with object and pointers is one allocation.

panic · on Nov 25, 2017

Yeah, that's true for values you're willing to allocate inline. The key advantage of intrusive linked lists is being able to add and remove normally allocated objects without having to construct separate list nodes. The STL does have splice(), but it's awkward -- you always have to keep the object in a list and refer to it using iterators instead of normal references.

barrkel · on Nov 26, 2017

A data structure with the linked list inline is the other way around, e.g.:

    class Foo {
      Foo *m_next;
      Foo *m_next_mru;
      Foo *m_next_sibling;
    };

With the STL implementation, only one list can have the item inlined; all the other linked lists need two indirections to get to the next element.

jokoon · on Nov 25, 2017

Still wondering why they are being taught in programming courses.

Although they are so many things that are taught, that should never be used in the real world.

SoulMan · on Nov 25, 2017

Still, Amazon wouldn't stop asking LinkedList for in every damn interview.

xfs · on Nov 25, 2017

> 7. Build — Local References

    struct node* BuildWithLocalRef() {
      struct node* head = NULL;
      struct node** lastPtrRef= &head; // Start out pointing to the head pointer
      int i;
      for (i=1; i<6; i++) {
        Push(lastPtrRef, i); // Add node at the last pointer in the list
        lastPtrRef= &((*lastPtrRef)->next); // Advance to point to the new last pointer
      }
      // head == {1, 2, 3, 4, 5};
      return(head);
    }

> This technique is short, but the inside of the loop is scary. This technique is rarely used, but it's a good way to see if you really understand pointers.

> This technique is never required to solve a linked list problem, but it will be one of the alternative solutions presented for some of the advanced problems. The code is shorter this way, but the performance is probably not any better.

This is what Linus Torvalds called "good taste" code and used in Linux kernel. There was some argument about it though: https://news.ycombinator.com/item?id=5030845.

(Incidentally, I found Nginx is written in the same style recently as I read it - manual linked list manipulation with this double pointer technique everywhere..)

throwaway7645 · on Nov 25, 2017

I work with traditional RDBMS & and old hierarchical linked-list DB. The linked-list has great performance with super speedy Fortran, but it is a major pain to retrieve any information. Having to write complicated pointer commands is tedious at best. Select * is such a luxury.

abhimt · on Nov 25, 2017

For other data structures, refer http://www.techiedelight.com/list-of-problems/

darepublic · on Nov 25, 2017

I remember solving these in my spare time while bored at my first internship as a developer. Good memories

doiwin · on Nov 25, 2017

I have been coding for decades now and never had a use case for a linked list.

AzzieElbab · on Nov 27, 2017

I use lists daily. Why not? It is a stack or queue, if you reverse it

lucio · on Nov 25, 2017

What is LISP

delta1 · on Nov 25, 2017

In case you're serious, and too lazy to google it... [1]

LISP is a family of programming languages.

[1] https://www.google.com/search?q=LISP

lucio · on Nov 26, 2017

It was sort of a Jeopardy response. I'm sorry if it was too opaque. The OP's article just brought LISP into my mind.