Not a single solution has error handling. There actually exists a very good iterator pattern in the Mongo driver mgo. The idea is that you have some function Next, and Next takes a pointer. You pass a pointer into Next, and the iterator fills that address with some data, returning a boolean indicating success. Then you just put that in a for loop to loop over the whole thing. The whole thing, with error handling, looks like this:
var v someType
for iter.Next(&v) {
// do something with v here
}
if err := iter.Err(); err != nil {
// the iterator had an error, handle that here.
}
For the vast majority of cases, I would expect panic/recover to work for iteration over in-memory data structures. For something like a database iterator where the possible errors are more varied and likely, I agree an explicit iterator object is better. I probably should have defined the scope of problems I was thinking about better.
Iterators should really disappear and only be used when they are really needed. Just take a look at all the examples in the article and you will see that the iterators are noise and unnecessary. Most of the time, all you want is getting all the values in a collection, why would you even need to read the word "iterator" in that code?
Modern languages should all support some form of foreach, e.g. Java:
for (Customer c : customers) { ... } // no iterator
C#, Scala and many others get this right as well. Go doesn't.
And, this:
close(ch) // Remember to close or the loop never ends!
You need them because the examples are intentionally simplified. Even a data structure as simple as a dictionary has a useful variety of iterators: over keys, over values, and in some languages, over (key,value) tuples. Explicit iterators are really only noise for the "natural" iterator, i.e. in order iteration over an array.
I agree that having it hidden by a language feature is nice, but somewhere there is an iterator -- the class that supports that iteration must at least specify how it's accomplished. I would prefer if there were an easy way to integrate with the range keyword in Go.
EDIT: Adding to reply to some added comments. The close() one in particular really threw me. It makes sense in terms of how channels work, but is really unfortunate. When I first saw the channel-based approach, I thought it was a neat solution. But then I found all the problems with it (including the best example I could find of it online missing that close(), which is why I pointed it out!).
I think Go is close to getting it right if they would do what I mention at the end of the article: provide a way to hook into the range keyword.
> Explicit iterators are really only noise for the "natural" iterator, i.e. in order iteration over an array.
Not really, they are really noise in a lot of situations. Right now, I can only think of one case where they are necessary: lazy traversal. Either the collection is huge or each element is expensive to materialize, so you want to do that one at a time. Iterators are perfect for this.
They are unnecessary in all other cases. For example, for your (key, value) example:
for (Entry<String, String> entry: map.entrySet()) { ... }
Go exposes too many implementation details to my taste.
Fair enough. My day-to-day coding with Go right now involves large data sets where materializing something like that would be expensive. I'll admit my view of the issue is strongly biased by my very real need for iterators or something equivalent that can avoid that cost!
I believe that's what the parent is talking about. There isn't any reason for Go not to have similar syntactic sugar for something as common as looping through an iterator.
To be honest, every release of Go feels more and more like Java anyway. I won't be surprised when Go eventually ends up being a JVM-less Java in the same way a new OS will end up being unix. After all, "Those who don't understand Unix are condemned to reinvent it, poorly." – Henry Spencer
Maybe Java is the same way? Not to say unix (plan 9?) or java (scala?) are perfect, but there is a reason they are popular.
Indeed. Disdain toward Java is strange, given Go's popularity and its similarity to earlier Java versions.
Sure, the JVM has long startup time (why should we care for most applications?) and gc/gcc-go produces nice binaries. But the Java ecosystem has many attractions as well: a common runtime for different programming languages, a wealth of high-quality libraries, a mature GC, great package management (via Maven or sbt), good IDEs, hotswap, instrumentation, etc.
Most of Go's advantages (lightweight threads, channels, good compile times) are also available in Java and the JVM.
IMHO it would've been more interesting if Google had pushed Java AOT compilation forward instead. But I guess the point of 20% time is that people can do what the heck they want. Perhaps RoboVM will push the envelope instead ;).
I wonder if they disapproval of the JVM is mostly shaped by slow and ugly AWT/Swing applications and huge frameworks that require lots of XML configuration. (Swing an be pretty, see IntelliJ. I try to avoid frameworks ;).)
Have to agree that the JVM's long startup time is pretty overblown. It is a fraction of the total startup time of a typical app and there are quite a few caching solutions e.g Drip already.
And IMHO one technology is responsible for much of the hatred towards Java and that is Spring.
I think Google's disapproval of the JVM is shaped by Oracle's lawsuits and is purely political, as they were extremely in favour of the JVM before those occurred.
That's probably true, but some of the Go creators (Pike for example) have been saying very negative things abut Java since long before the Oracle lawsuits. Often in fairly trollish ways.
It's very important when looking at the channel method to realise that goroutines are NOT garbage collected. If there is any chance of the loop calling break or panicking, there is a chance of leaking goroutines. Since you might not know how your code is going to be used, this is an issue.
A possible solution is to use two channels, the items channel and a fail channel. Whenever your iterator sends, it needs to use a select which either sends the item on the channel, or receives from a fail channel (in which case it exits the function.)
I think while the channel method looks nice because of Go's special treatment of it's basic types (for better or worse,) it's not practically the best answer you should be using in your code.
I mentioned this pitfall, but didn't show how to solve it. I haven't seen a complete example yet and didn't take the time to work out the details yet. I'm less concerned about complexity in the iterator implementation, but handling this correctly also seemed to require changing the caller code since it needs to send something on the fail channel. Then the onus is on the caller to remember to do so, which is just as bad as having to remember not to call break.
Of course, whether this is a real issue or not also depends on whether you're writing internal or external APIs. There are lots of other ways to generate uncollectible garbage too...
If there is a break or panic in the iterator, not the loop using the iterator. If the loop using the iterator closes the channel, the program will crash.
The sender is supposed to close the channel to signify to the receiver that there is no more data to be sent. The receiver is not supposed to close the channel. (You could hack around this by recovering from the panic, but this is bad style.)
Note that there is no requirement that you close channels-- they aren't like file descriptors. They will be garbage collected after they're no longer needed.
goroutines, on the other hand, will stick around unless you clean them up. goroutines aren't iterators, and you shouldn't use them as iterators. They're more analogous to threads.
The most obvious pattern, a simple for loop and a separate iterator type, is imho missing. Something like "for iter.Next() { fmt.Println(iter.Value()) }" is commonly used in Go and doesn't look that bad either...
I'll buy that -- it's sort of mentioned in the post as an object-based wrapper around the closure version, except I listed the functions as HasNext() and Next(). More boilerplate for the iterator implementer, but I agree, it reads better for the caller.
type Iterator interface {
// Next returns true if the iterator contains subsequent elements
// and advances its state to the next element if that is possible.
Next() (ok bool)
// Value returns the current value.
Value() interface{}
}
which you then use like this:
iter := s.Iterator()
for iter.Next() {
fmt.Printf("%s\n", iter.Value())
}
I think this is a pretty good pattern: it's concise, looks less ugly than the callback and closure based iterators, and it's not using anything heavyweight like channels, so there should be no significant performance degradation.
Out of curiosity, I added this pattern to OP's test code as StatefulIterator (https://github.com/ewencp/golang-iterators-benchmark/pull/1), in 2 versions: one in which the iterator is a struct, and one in which the iterator is an interface. The results look like this (the 4 last rows are my additions):
As you can see, the struct version is the fastest of the lot: 2653265 ns for ints and 3582051 ns structs (versus the callback version with 3478962 ns and 4437885 ns, respectively). The interface version is slower, but still in the same ballpark as the callback and closure based ones, while being much more readable (but that's only my personal opinion). The not-so-stellar performance of interfaces is a bit disappointing, but that is a known issue.
Seriously? Using a green thread is a pattern for iteration over an array? Or that creating a new function object per iteration is efficient? The "patterns" 2 & 3 listed in the article are grossly inefficient to be considered suitable to implement fixed or dynamic iterators.
To build on thrownaway2424's response, the point in the article was to demonstrate the pattern for building the iterator. Very simply structured data helps do that. In reality you'd probably be iterating over a more complicated data structure, usually where the iteration requires knowledge of the data structure that should be considered private to the implementation.
Sometimes raw performance is not really called for. A lot of programmers are looking for ways to, for example, iterate over lots of different kinds of things that all provide the interface to do so, regardless of their actual types or the details of their allocation.
Sure, if you're about to iterate over a terabyte of double-precision height map data or something like that, please use an array.
Without such advice, that article is dangerous. A "lot of programmers" are going to read that and assume these patterns differ between them only in terms of "performance" which equates to "ns/op".
Also, I wouldn't use an array to iterate over a terabyte of double-precision height map -- have you tried that? I have, kind of. In a genome sequence visualization application. Try it sometime. It's fun.
I'm curious because that's what makes python the best language for me in a lot of cases.
Iterators that don't load 100,000,000 rows at a time for a result set are... useful. And that's what you get with the simple (iterator/generator func)/db cursor pattern in python.
Something similar in Go would go a long way in helping me get over the fixed braces position go enforces. :)
"I wrote two versions of each. One directly iterates over a []int, the other introduces a struct and uses a []*struct which it needs to extract the data from. The goal was to hopefully avoid benefits from cache coherence"
Can someone explain to me that last sentence? What's cache coherence?
Cache coherence is when data that you need close in time is physically close in memory, and therefore loading the first piece of data loads the second into cache so that it will be quick to load. The performance difference between fetching from CPU cache and main memory is about a factor of 10, so this is a significant effect.
When you're writing performance critical code this is a good thing to have happen, and it can be worth a lot of effort to make sure that you benefit as often as possible. However general purpose code usually does not wind up with data that is optimally organized in memory. But in a benchmark the usage patterns are simple enough that it is trivial to get well-organized data in memory without even intending to. Whether this is likely to be predictive of real world performance depends on how carefully coded the real world program is.
Cache coherence can also be a problem for performance, leading to what is called false sharing.
When reading some value in RAM, values in the same "cache line" (that are right next to it) get cached. However, if another thread has to modify those values that just got cached, then the cache needs to be invalidated. This often happens in producer/consumer implementations.
If this is discovered to be a problem, it's often solved with what is called cache padding. I.e. if you've got variables A and B that are right next to each other and they are modified from different threads, then you add fake variables (padding) between them that's the size of a cache-line (which is between 32 bytes and 256 bytes, depending on architecture).
The author is actually a little bit confused here.
Cache coherence means that CPU cores with separate caches, but accessing the same RAM have a coherent view. Usually implemented on hardware level with cache coherence protocolls (MESI etc).
What the author wanted to say: The goals was to hopefully avoid cache misses. With []int the values are sequential in memory, so the probability is high that the next value is already loaded into the cache. With []*struct there is a sequential array of pointers which have to be dereferenced. This dereference means random access to memory locations all over the RAM. Probably more cache misses.
The author is correct to measure this, because cache miss patterns are hard to predict and this optimization might not be worth it.
Ah, sorry, I was using it in the locality sense (as in, coherent access). Admittedly less common, but not uncommon, in algorithms; downright confusing if you're in an architecture frame of mind.
Testing performance by iterating over an array can be dangerous because all the data is conveniently grouped together such that the processor will have the next data it needs in cache once it's loaded the first item (most of the time; when it crosses cache lines it'll cause another miss, then a bunch more hits). This can lead to huge performance boosts, but rarely reflects how data is laid out in memory unless you're very lucky.
The change made was to still iterate over a slice (i.e. an array), but the array now contains pointers to structs, which are allocated separately. This should scatter them across memory such that each step of the iteration needs to choice a different pointer to a different chunk of memory. This is a benchmarking issue which has bitten me in the past, so I made at least a feeble attempt to account for it.
However, I actually don't know enough about Go's memory allocation to really make sure I could force the conditions I wanted, so it's still possible all the data was lined up nicely, but just spread a bit farther apart. (In fact, this may even be likely since the allocation approach was very simple. Trying to do a lot of allocations and deallocations and filling in the data over a longer period would probably do a better job of addressing this concern.)
What's wrong with returning a populated slice? It's basically the same operation as returning a pre-sized, pre-filled buffered channel. Then you can "range" on the returned value.
The only downside is that you can't defer computation of values until they are asked for.
Don't underestimate the time cost of coroutining (literally goroutining, and using a channel, or to-and-fro with HasNext and Next). And the unfinished generator is using memory, too. Code runs fastest when it can sit in cache, so creating a slice in a tight loop and returning it is probably optimal if the actual data you just materialized is small.
Sure, it's better than the goroutine approach, but that doesn't say much. It becomes quite suboptimal if you have a large amount of data. I also think that you're likely underestimating the cost of the additional memory allocations.
Iterator callbacks, generators, or functors feel very un-Go-like. It's trying to force a Python or OCaml idiom on to Go. Basically, what you want is to implement a HasNext and a Next function and write a "for" loop. All you need is this:
for ; iter.HasNext(); val = iter.Next() {
// do stuff
}
There's no reason to look down on the humble for loop. Complexity is a bad thing, usually.
Channels and goroutines are also a very Go-like way to process data. But APIs that return channels assume that the caller wants concurrency, so be sure that that's true.
Right, but you have to consider the complexity of the iterator implementation as well. Every time I implement iterators for non-trivial data structures in languages without generators I want to hit myself in the face for using that language.
> There's no reason to look down on the humble for loop. Complexity is a bad thing, usually.
No reason? By that you mean that there are no downsides? a hasNext()/next() often means that calling next() when there is no next element (hasNext() == false) will result in a nullpointer exception. That's fine if you always do a hasNext() check before calling next(), but people do make mistakes like that. You might say that something like a loop with a hasNext() conditional is simple enough, but on the other hand I'm not too fond of even using indexing loops in Java (for (int = i = 0; i < array.length; i++) {} ) if you only need to traverse the array one item at a time; you don't need the expressive power of indexing the array rather than simply iterating over it, and this increased power makes you more likely to introduce bugs (wrong initial value of index variable; wrong loop conditional; wrong increment operation: mixing up the indexing variables in a nested loops (the two similar-looking characters 'i' and 'j' are often used in that case). Even ignoring this argument, you might have to make a more complicated loop where the hasNext() or next() is a bit scattered due to some conditionals. (EDIT: since you are talking specifically about for loops this point may not be very relevant.)
I would say that a construct that doesn't encapsulate enough so that you might get a runtime error is in a way worse than a construct that encapsulates enough to guarantee that that mistake can not happen at runtime. It might not be overall worse, but it does have a downside, IMO.
You gosters are planning on writing a lot of for loops, huh?
Y'all can't be just plain ignorant about what you can do in a language that has the features to support a proper collection library... So what's up with that? You like them for loops? Or does everything that's so awesome about go make it worth the apparently complete crap state of collections?
I suspect all this has to do with a choice of greater type safety over expressiveness, but I'm not sure. I know there's this big generics discussion in the go community. The thing about generics is you sprinkle butt-ugly and cross-eyed syntax on top of expressive and you get...expressive ugly cross eyed butt syntax collections.
Help me out here -- I read these rave go reviews and then I see this kind of thing and I think "well, it'd be better than Java -- maybe way better -- but it feels like Java redux all over again reheated."
I'm just learning Go, I'm certainly no evangelist. One huge difference for me, however, doesn't have to do with the language. It's deployment. It's really nice for deployment to be scp rather than a bunch of package installs, making sure your classpaths are right, etc.
That said, one of the language features that drew me in is structural typing. I'd yet to see a well supported statically typed language that used structural typing, and it seemed like it would make software maintenance a lot easier. Of course, tools are often a better solution than language features, so maybe that's a bad reason to use a language.