Your CPU has 4 cores, but 8 threads due to Hyper-threading. It's quite normal that not all programs will benefit from the additional threads. It would be interesting to see how the compiler performs on more physical cores.
Compiling programs is usually one of the cases that does benefit from filling all the hyperthreads, though? That's why people often cargo cult running "make -j<cores * 2>".
In case other people don't see why this would be the case, hyperthreads are good for running a second thread while the first one stalls while waiting for something such as a memory read. Compilers often work with indirect data structures such as irregular graphs and symbol tables which don't cache perfectly and may cause memory stalls.
That is a quote from the CL, not the hardware of the person you replied to. That being said, I'm pretty interested to see how this would perform on other systems, specifically production servers.
This is not only a nice enhancement to the Go compiler and thus for all Go users, but also it shows why Go as a language is so relevant. Single thread performance of processors has not increased very much in the recent years, while the core count continued to grow - I just configured a 20 core 40 thread Xeon workstation and got prices between $4k and 5k.
Go is a great language which combines C-like speeds for low-level code with a good concurrency model, making your program comparatively easy to parallelize on a per-function basis. As the work on the Go compiler shows, it is not trivial for complex programs, but definitely possible.
Task parallelism is fine with the goroutine model, but not data parallelism, which is just as important if not more so. (Although note that task parallelism depends on having good locking strategies, and having more static checking than Go offers in this department is always nice...)
Why do we use GPUs, for example? Data parallelism. What have all the advances in video/audio codec implementation been the result of? Data parallelism again. What makes hashing fast? You guessed it...
> Go is a great language which combines C-like speeds
I am using Go in production. It does not combine C-like speeds (I've comparison after rewriting one of the services from C to Go), but there is a lot of space for additional optimizations in future. Go will never be as efficient as C because of additional indirection in certain cases and additional bookkeeping. But in many cases it's faster than Java.
I am also using Go in production, and I am really getting C like speeds. Low level code is basically the same efficiency - the Go compiler does not optimize as well as good C compilers, but that gap closes constantly, and there is some overhead due e.g. bounds checking. But many those checks can be optimized away and of course, safe C code would have explicit checks as part of the programs usually.
Go is incapable of ever achieving C like speeds, and that's not because of runtime safety checks like bounds checking. It's because Go still has a garbage collector, and that always comes with a price tag.
Go also often has to copy structs and buffers that would be zero-copy in C, C++ and Rust code. Passing slices to functions which requires arrays or converting between strings and byte slices requires copy for instance.
There's also the matter of poor control of stack vs. heap allocation and poor escape analysis to mitigate that.
All in all, Go provides very impressive performance when compared to other high-level languages, but calling it C-level performance is pure hype.
> It's because Go still has a garbage collector, and that always comes with a price tag.
This is not really true, at least not as such a general claim.
1. malloc()/free() can be either faster or slower than garbage collection. It depends on implementation details and is situational. Note in particular the issues of memory locality and temporary allocations.
2. This assumes an omniscient programmer who has perfect knowledge of when to allocate or free. In reality, memory management is a traditional cross-cutting concern, and in order to deal with it manually in a modular fashion, you will typically introduce additional overhead. Examples:
- in C++, m[a] = b, where m is a map, and a and b are strings requires that both a and b are being copied. This is not necessary in a GCed language.
- naive reference counting has significant overhead, significantly more than a modern tracing GC.
- even unique_ptr has overhead (though less so than reference counting).
Garbage collection allows you to have the necessary global knowledge to (mostly) eliminate memory management as a cross-cutting concern, because the GC is allowed to peek behind the scenes and ignore abstraction boundaries.
3. Things that C is fast at is usually also allocation-free code, i.e. iterating over arrays or matrices. The performance of such code is not going to be any different in a GCed language, assuming an optimizer of comparable caliber.
I agree with almost everything in your post, but I'm surprised about your comment that unique_ptr has overhead. Since unique_ptr is a single item struct with no vtable I'd expect no space overhead at all. Similarly, while it does need to run code in its destructor, I'd expect a trivial, non-virtual destructor like that to be reliably inlined and the result to be no worse than a manual call to delete. If that's not true, though, I'd definitely like to know, so please let me know if I've missed something.
It has runtime overhead whenever a move assignment occurs. You have to check if the target of the assignment is null (or otherwise call the destructor) and you have to assign null to the source. This is all overhead compared to a plain pointer assignment. An optimizing compiler will of course try to eliminate most of that overhead, but that's not always possible.
>> - in C++, m[a] = b, where m is a map, and a and b are strings requires that both a and b are being copied. This is not necessary in a GCed language.
1. You are assuming here that a and b will not be used anymore and are optimizing for a special case.
2. The general problem that I was trying to illustrate is that shared ownership requires either reference counting or avoiding it through copying if you don't have automatic memory management.
Your example didn't had any assumptions.
What you are assuming then? If you need to reuse variable in Go then it will also copy it inside the map so your point is even more invalid in that case.
> In a GCed language, both the map and the original code can safely share a reference without having to worry about coordinating lifetime.
Go is GCed language. Show me the code that will do what you claim (map[var1] = var2 and use var1(type string) and var2(type string) in later code) in Go.
I don't know OCaml but I asked you for Go code in thread about Go and you didn't deliver. You wrote that I am covering only special case in my C++ code but you are doing exactly the same.
What I see in your code are pointers to static part of memory as you are using string literals, show me code+asm of function that takes unknown values of strings at compile time and then it mutate (with third string supplied to function not known at compile time) and return those values later in the function.
> I'm honestly baffled why you think that there is a need to copy the variables or why you can't use them later or whatever you believe there.
What you wrote is SPECIAL case in SOME of the GC'd languages
that have immutable strings. If you have language with mutable strings then this will not work. I know it will not work in Go also and they have immutable strings (so even not all gc'd languages with immutable strings optimize this) that's why I've asked for Go code in Go thread.
> I don't know OCaml but I asked you for Go code in thread about Go and you didn't deliver.
I don't really use Go and made a general comment about GC that was not limited to Go in a subthread about fairly general observations about GC, because somebody made a general statement about GC that was not limited to Go.
The language is not relevant for the observation I made. Nor does it really matter if we're storing strings or other heap-allocated objects. It's a question of lifetime management.
> What you wrote is SPECIAL case in SOME of the GC'd languages that have immutable strings.
OCaml strings are mutable, actually. But you seem to be confusing lifetime issues with aliasing issues, anyway.
The underlying problem is that without copying C++ would not know when to free the strings. It would be whenever they went out of scope in the calling function or when they were deleted from the map, whichever is later; the alternative is reference counting (also expensive, and not used by std::string). A GCed language can avoid that because the garbage collector will only free them when they are no longer reachable from either (or any other location).
You replied "No" to that what I wrote about Go which is not true. I never even once wrote about lifetimes. Don't you see it? Look closely again what I wrote in my comments. I think you are confused about what I was disagreeing because you didn't read carefully what I wrote. I agree with what you wrote about lifetimes 100% and I was before that discussion started because I write C++ and those things are basic knowledge. But it wasn't about lifetimes from the beginning...
> The language is not relevant for the observation I made. Nor does it really matter if we're storing strings or other heap-allocated objects.
No, because what you wrote is only true depending on the language and depending on where those strings are stored and in which language, because in some cases they will be copied so what you wrote at the beginning would not be entirely true. And I disagreed with that only, I didn't mention lifetimes even once.
If instead of what you wrote you would write (big letters to show diff):
- in C++, m[a] = b, IN SPECIAL CASE where m is a map, and a and b are strings AND A AND B OUTLIVE MAP ASSIGNMENT requires that both a and b are being copied UNLESS YOU USE SHARED OWNERSHIP. This is not necessary in SOME OF GCed languages in SPECIAL CASES.
Then I would have no reason to disagree with you in the first place.
> I agree with what you wrote about lifetimes 100%
Then you wasted a lot of time for both of us by getting sidetracked by a detail that I hadn't even mentioned in my original comment. This is about the issues with having one object being referenced from multiple locations. I gave a concrete example to illustrate this issue, and you spent an entire subthread arguing the specifics of the example without seeming to understand the issue it was meant to illustrate. This is not specific to maps or strings. It's a general issue of manual memory management whenever you're dealing with shared ownership of the same object.
> You replied "No" to that what I wrote about Go which is not true.
I think you are reading to much into an expression of disagreement. You kept fixating on the specifics of go, I was trying to get back to the semantics of GCed languages vs. manual memory management. That's what my "no" was about.
Edit: I also did a quick test for Go, just to end that part of the argument, too. Contrary to your statement, Go doesn't seem to copy strings when adding them to maps, either. While there's no way to take the address of a string literal in Go, you can add lots of instances of a very large string and check the memory usage (compared to an empty string). It turns out to be nearly the same.
You could exactly see what I was writing about, you just didn't read it carefully enough or choose to interpret as you fit. I was right in what I wrote from the beginning. You are generalizing too much which brought this whole discussion in the first place and when I argued about specific details which I wrote in my comments you didn't address them at all, you were fixated on lifetimes whole time ignoring what I was writing. So if you want to blame someone for wasting the time, blame yourself for not being specific enough and then not replying to what I really wrote.
> I think you are reading to much into an expression of disagreement. You kept fixating on the specifics of go, I was trying to get back to the semantics of GCed languages vs. manual memory management. That's what my "no" was about.
There was only one sentence in that comment to which you could write No, next time if you want to get back to lifetimes in discussion with someone, write "I want to discuss lifetimes" instead of writing "No" to something that someone wrote, you see a difference? Words matter.
> I also did a quick test for Go, just to end that part of the argument, too. Contrary to your statement, Go doesn't seem to copy strings when adding them to maps, either. While there's no way to take the address of a string literal in Go, you can add lots of instances of a very large string and check the memory usage (compared to an empty string). It turns out to be nearly the same.
No, you are wrong. Look at runtime hashmap implementation and generated assembly (go tool objdump binary_name > asm.S) for function that do what I was writing in my comments. You will see that there is a copy.
> You are generalizing too much which brought this whole discussion in the first place and when I argued about specific details which I wrote in my comments you didn't address them at all, you were fixated on lifetimes whole time ignoring what I was writing.
Because you were derailing the discussion and I was trying to get it back on track. My whole original comment was about lifetimes and GC vs. manual memory management. You were getting sidetracked by details that weren't relevant for that point, which was specifically about GCed languages vs. manual memory management in general and didn't even mention Go [1].
> No, you are wrong. Look at runtime hashmap implementation and generated assembly (go tool objdump binary_name > asm.S) for function that do what I was writing in my comments. You will see that there is a copy.
If this were true, the following program would take >65 GB of memory. In reality, it requires some 110 MB.
package main
import ("strings"; "fmt")
func main() {
m := make(map[int]string)
value := strings.Repeat(".", 65536)
for i := 0; i < 1000000; i++ {
m[i] = value
}
fmt.Println(len(m))
}
Note that even if it were as you said, nothing would prevent Go from changing to an implementation that doesn't copy strings.
You are doing the same again, I wrote 3 times about use case to test and you are ignoring it, living in your own bubble. This was my last message, replying to you is waste of my time.
The garbage collector does not impact the speed of tight code - when it runs it of course takes cpu power, but it runs in parallel to the program code, so this comes pretty cheap. And the GC only consumes any resources, when you allocate heap space. This is generally not necessary, and also comes with a steep price when done in C. Coding efficiently in Go requires some different best practices in C, so this might have gotten you the impression of inefficiencies, but once you have adapted to the good Go practices, there is very little difference.
Exactly. If you write in a way that allocations are controlled, GC becomes irrelevant to performance. Well-optimized Go without allocations is indistinguishable from C (modulo compiler differences).
- in general, understand which types are values, and which have to be allocated. E.g. the slice structure itself is a value type (typically 24 bytes), but it points to an array, which might be heap allocated. Re-slicing creates a new slice structure, but does not touch the heap - extending a slice might.
- to analyze your code, use go build -gcflags="-m", this will print out optimization information, especially which allocations "escape to heap", which means, they are heap allocated and not stack allocated.
- Interfaces are really great, but copying values of interface{} does involve heap allocation of 16 bytes. Also method invocation via interfaces has some overhead. Method invocation of non-interface types is a fast a plain function call.
This is true. But much of the GC can be done on parallel threads, reducing the amount of total work that can be done, but keeping the threads that do work nice and speedy.
> Single thread performance of processors has not increased very much in the recent years, while the core count continued to grow
Frequencies have not increased much, but performance certainly has. Better single-core performance at the same frequency is the main advantage Intel CPUs currently have over AMD ones.
Yes, I was so happy to learn that AMD is finally pushing 8 cores for mainstream processors, they were too long stuck on 4 cpus. Of course, building software fully using those cpus is the challenge, but thats where Go comes in.
It's not just the building, it's the ability to run multiple virtualised environments (for example at the moment I'm running Xubuntu, two vagrant machines and Microsoft's Windows 10/Edge testing VM).
As I elaborated on elsewhere, if you focus on task parallelism to the exclusion of data parallelism, like Go tends to do, then you are not making good use of the hardware.
What does it mean for a programming language to "tend to focus on task vs data parallelism"? In my mind, Go gives you primitives from which to build either data or task parallel algos, and the programmer can choose the kind of algorithm that they find appropriate (perhaps data parallelism is always faster, but perhaps some developers find task parallelism easier and their time is not well spent squeezing out the performance delta?). Is there some set of primitives that would cater more naturally to data parallelism? If these primitives exist, are they really 'better' than Go's primitives, or do they simply make task parallelism harder without making data parallelism easier? These aren't rhetorical questions; I'm genuinely curious to hear your opinion.
You want SIMD and parallel for, parallel map/reduce, etc. Goroutines are too heavyweight for most of these tasks: you will be swamped in creation, destruction, and message passing overhead. What you need is something like TBB or a Cilk-style scheduler.
"with a good concurrency model, making your program comparatively easy to parallelize on a per-function basis"
Actually Go's shared-memory multithreading concurrency model encourages program architectures that are not easily parallelizable at all. And it's harmful to praise it. We have much better share-nothing models that encourage parallelizable programs from the start.
Compilation is not a good example of this, though. Parallel compilation is something that any language that allows for separate compilation has supported for decades (make -j N at its most basic, with some language implementations providing more sophisticated support).
Taking a thread-oblivious program and making it multithreaded can be very hard. It requires global transitive knowledge of the function call graph and all shared data structures.
How was this achieved? Was it by a dev who was intimately familiar with the code base? Was it just applying the race detector repeatedly until it stopped squawking? Something else?
A reason I've heard against parallelizing compilers is that you get parallelism for free in large builds as your build system will build separate code-gen units (files in C, crates in rust, I believe in go it's packages) that don't depend on eachother in parallel anyways. Adding it to the compiler as well just creates synchronization overhead.
Still, the benchmarks seem promising. 40% improvement on std lib packages is nothing to shake a stick at. The end of the CL explains that this is a best case scenario, one that will mostly benefit recompilation of a single package, which is the most common compilation activity.
I would agree. Basically all decent build systems will allow you to compile units that aren't dependent on each other in parallel. Threading the compiler may provide some gains, but it's likely that for a good portion of your build time your CPU is already fully utilized anyway (And without any locking overhead).
I feel like the real gains would be with a threaded linker instead, but I can't say I have any data to back that up. But I mean, you can't link until everything is compiled, so presumably there are free CPU cores that could be used.
Go's compilation unit is a package, not a file. This limits the amount of parallelism you can get compared to say C outside of the compiler. You have bottleneck packages like the runtime which have to be compiled before other packages, and in the case where you do incremental compilation you probably only have a few packages to compile anyway. Some of this is briefly touched on in the CL (see comments on the amount of concurrently that should be enabled), and there are issues on the bug tracker that go into more detail.
Since most builds are incremental and only touch a single package, this should still be a win. And it may be a win whenever there is just one package on the critical path. (I don't know if go packages must be built after their dependencies, but it stands to reason if so.)
I got very interested in Go recently, but I have question, how do you do your debugging in Go? I am using VSCode, and although VSCode support for Go is the best support between editors, but there is no official support for debugging , it seems.
this reflects my team's experience, initially skilled programmers (from C# mainly) complains almost hourly about lack of IDE features, in particular - Debugging. 14 months in now with 9 major services in production, I hear no more conversation on this, and still no one yet bothered to setup dlv. 'Breakpoints', to our surprises, is not actually indispensable. Also I noticed positive cultural change around, we move faster and team talks about business problem solving more than programming itself, to a point I believe rich functionality IDE over complicated things we do here.
> Also I noticed positive cultural change around, we move faster and team talks about business problem solving more than programming itself, to a point I believe rich functionality IDE over complicated things we do here.
I think that this point about team conversations is incredibly important. Apart from Ruby, Go is the only language that I know where the technical conversations seem to consistently focus on the problem at hand, rather than coding concerns like the finer points of language features. The company that I work for does both Ruby and C#, and the difference in the conversations around the two is very noticeable.
unrelated question for any Googlers looking at this, I like how PolyGerrit has an "Old UI" option at the bottom of the page. When I enabled PolyGerrit on my internal gerrit instance that button did not show up, and docs were rather sparse.
Question - is it possible to turn on PolyGerrit with ability to switch back to Old UI in open source gerrit? I found the option to turn on PolyGerrit here[0]
If you're compiling a package with lots of functions in it, Go can now have multiple cores working concurrently on (much of the work of) compiling different functions.
Go would already work on more than one _package_ at once, using multiple processes, which helped big builds covering many packages (think building Juju, Kubernetes, or Docker from scratch). The benefit here is in the common situation where you make a change in one package and recompile: more of your cores can be put to use getting that package rebuilt.
This is a changelist for review, after preliminary work to move state out of globals, etc. If you click "Show more" it will show much more, including discussion of the internals and wall- and CPU-time benchmark results.
The Go compiler is already much faster than your typical C and C++ compiler.
Now the compilation step, which is done on those projects via make -j4, can be done natively in the go compiler within modules by itself, compiling functions in one module in parallel.
Modules itself would be already parallelizable by make -j4, but since go build is its own tool independent of make,
they re-invented it by themselves. It's significantly faster than make already, but comes with maintenance costs. go build is much stricter than traditional dependency systems.
It's not really significant at all. Compile-times of bigger modules will get a bit faster, such as e.g. with zapcc, which got C++ compile times 50% faster by caching.
There's a Download button that pops up both a bunch of git commands and a link to .patch files. Git fetching seems to work for me. If you're interested in built binaries, that I don't know.
All benchmarks are from my 8 core 2.9 GHz Intel Core i7 darwin/amd64 laptop.
First, going from tip to this CL with c=1 costs about 3% CPU and has almost no memory impact.
Comparing this CL to itself, from c=1 to c=2 improves real times 20-30%, costs 5-10% more CPU time, and adds about 2% alloc.
From c=1 to c=4, real time is down ~40%, CPU usage up 10-20%, alloc up ~5%
Going beyond c=4 on my machine tends to increase CPU time and allocs without impacting real time.