On Stack Overflow's recent battles with the .NET Garbage Collector

david_a_r_kemp · on Oct 28, 2011

Let me prefix this rant by saying I Love C# and .Net. I think it's a really good platform to build on, as demonstrated by it's growth outside of the Microsoft world.

This article highlights why (good) C/C++ have such a hard time moving to C#/.Net. In .Net memory management is, 99.9999999999% of the time, fire and forget, whereas a big chunk of learning and programming C/C++ is about memory management. Most C, and some C++, programmers would have baulked at having a massive object graph in memory - releasing it becomes a massive headache, and that's before you even get into synchronisation issues, and the .Net team have done a really good job of taking this headache away (yes, I realise they 're not the first). However, we're now at the point where there are a lot of C# programmers who've never heard of a linker (by default, csc compiles and links in one go), have little concept of static vs dynamic linking, and, more importantly, don't have the first clue about memory management. This manifests itself in things like not using IDisposable properly/consistently, using the string.+ operator rather than using StringBuilder/StringWriter, and developers have no idea about the difference between class and struct, especially when it comes to parameter handling. Even books like (the excellent) C# in Depth are pretty light on the memory management side of things, deferring this to books about the CLI and DNR. Whereas, the truth is, if you're going to write a big application, you're going to have some concept of this.

sams99 · on Oct 28, 2011

I agree with your criticism, many times developers of managed platforms (Java, .NET and even the less complex GCs like Ruby) forget that memory needs to be allocated and deallocated.

Often we think the magic memory fairy is going to make everything OK, it is easy to fall in this trap even if you understand the underlying concepts, this is clearly not the case at high scale.

MatthewPhillips · on Oct 28, 2011

Could you explain what you mean about string + vs StringBuilder? I'm guilty of this one.

artmageddon · on Oct 28, 2011

marcgravell's post answers it pretty well, but I think he meant to say "there's absolutely nothing wrong with using + for individual expressions". Strings in C# are immutable, so when you do a concatenation by using the + operator, you're creating another instance of the string in memory. The first one still exists even though you now have the new string.

Much like marc said, it's absolutely fine for small amounts, but when you're doing a lot of string operations involving a loop, using StringBuilder will be faster and more memory efficient.

This is a good read about it, I highly encourage it: http://www.dotnetperls.com/stringbuilder-1

marcgravell · on Oct 28, 2011

Actually, there's absolutely wrong with using + for individual expressions; it is actually the correct way of doing it (a + b + c + d compiles to String.Concat(a,b,c,d) ) - the problem is if you are doing this in a loop, where you get a lot of throwaway strings from intermediate concatenations. In a loop, you should concatenate via StringBuilder.

stonemetal · on Oct 28, 2011

In C# strings are immutable , so string + string creates a new string. This means coping both strings into the new result string. That is a lot of copying and can be pretty slow if you do it on large strings, or often on small strings. Especially if you concatenate multiple strings at once (A + B + C + D means that A is copied 3 times, B 3 times, C twice, and D once) StringBuilder is mutable so appends don't recopy data. It is really only a problem if you do it on large strings more than a few times.

On the other hand I am not sure if this is a problem in .NET. I know the semantics look bad but no one said the execution model had to match the semantics. If I remember correctly the JVM has the capability to rewrite the code to the faster version, though I am not sure if the CLR does.

m_myers · on Oct 28, 2011

The JVM does rewrite individual concatenations as StringBuilders (so a+b+c+d results in one StringBuilder with three .append() operations), but the real performance problem is concatenating in a loop.

    String result = "";
    for (String str : myStrings) {
        result += str + " "; // on each execution, a new StringBuilder is created
    }

as opposed to

    StringBuilder result = new StringBuilder();
    for (String str : myStrings) {
        result.append(str).append(" ");
    }

The JVM can't hoist the generated StringBuilder out of loops to produce the second, faster piece of code. I believe the CLR is the same.

njw45 · on Oct 28, 2011

Later JVMs do escape analysis, though - so there's not necessarily any difference between the two methods...

m_myers · on Oct 28, 2011

At least in Oracle Java 1.6 update 24, which is supposed to have escape analysis enabled by default, it is a huge difference. I don't know if Java 7's analysis is smarter.

smiler · on Oct 28, 2011

I think stackoverflow is the first high volume .net site that have gone in depth publically with how they are solving their performance issues. Microsoft should be paying them for the continued great PR

riffraff · on Oct 28, 2011

it's great that they share their findings and part of their code, even if I'm not a .NET dev I find the blog posts really interesting and their approach/tools inspirational.

But wasn't slashdot the first high volume site sharing their solutions to performance issues?

smiler · on Oct 28, 2011

Notice I said .NET. You can find plenty of presentations / blogs on php performance, ruby / rails performance, python / django performance but not many on .NET - I guess simply because most high-volume sites don't use .NET

noselasd · on Oct 28, 2011

I'd say it's rather because most high volume sites using .NET are more commercialized , and often run by companies that's typically not been very open about anything they do internally.

dbattaglia · on Oct 29, 2011

Also StackOverflow are one of the few .Net sites hackers/devs flock to and care about. I work on a large scale .Net web app, but I don't think many hacker news readers would really care about the application itself (and, as you pointed out, my company would probably kill me if I started blogging about the internals!).

riffraff · on Oct 28, 2011

Oops, sorry I missed it

marshray · on Oct 28, 2011

Yes, I think I remember what you were talking about.

Seemed like /. was one of the first millions-of-hits-per-day site that was heavily dynamic content. All with Perl and MySql, right? I still don't really understand how they do it, but I remember feeling grateful that they were describing it openly.

mrb · on Oct 28, 2011

I always assumed Slashdot was pre-generating static files (every few seconds or so, at least of the main top stories) that were served to the users. Isn't that why their comment form said that your comment might not appear until a few seconds after submitting it?

marshray · on Oct 30, 2011

Yes, but there were also profile options to modify your own view, right? So the main page seems at least partially dynamic too.

ddorian43 · on Oct 28, 2011

Another .net http://highscalability.com/plentyoffish-architecture

recoiledsnake · on Oct 28, 2011

I wish Newegg shares how they do it. The site is always extremely fast, even under heavy load and the pages seem to be all data heavy.

benologist · on Oct 28, 2011

MySpace had a big writeup on their transition to .NET years ago, I don't think they went into any follow up detail though.

sunchild · on Oct 28, 2011

.NET for all your massive downscaling needs.

cookiecaper · on Oct 28, 2011

I wonder whether all those gymnastics were more prudent than simply writing a C module to filter questions and feeding the results back to the C# web app. Opinions?

melling · on Oct 28, 2011

Jeff doesn't know C. Should he learn C was an ongoing topic in Jeff and Joel's StackExchange podcast, which was actually somewhat entertaining. Time for another episode.

rbanffy · on Oct 28, 2011

Which episode was that?

melling · on Oct 28, 2011

It has been a while so I can't remember. They were getting people to do the transcripts for their shows.

http://meta.stackoverflow.com/questions/19244/community-podc...

If your Google-Fu is good.

jcheng · on Oct 28, 2011

I was thinking the same thing. This problem seems extremely well suited for C/C++, and C#/.NET makes it very easy to call directly into native libraries.

sams99 · on Oct 28, 2011

I agree this is a totally valid option, trouble is we have no c experts on core team so it would raise the cost of maintaining. Also the code is fairly complex, moving to structs was fairly simple compared to a c port.

cookiecaper · on Oct 28, 2011

Sounds like the perfect opportunity for someone to become a C expert. Something like this just throws up a lot of red flags. If Microsoft changes some of the methodology in its GC or further demotes structs, your solution may fall apart. It just sounds like something that requires tight manual memory management to get right and remain stable.

dbattaglia · on Oct 29, 2011

Well, in essence what they did is just store their tag data right on the stack, using CLR structs (which are pretty similar to a C struct or c++ struct/class - allocated on the stack by default). I guess they could have also wrote their own memory allocator, but I'm not so sure it would have been any simpler than the custom allocator they would have implemented in C#, had they went with the AllocHGlobal route. And then they would still have to deal with data marshaling between their native and managed code. Not saying that is necessarily a problem, but it's not always trivial either (especially marshalling custom struct types without causing boxing).

chadgeidel · on Oct 28, 2011

Or, at the very least perhaps drop to unmanaged code?

Griever · on Oct 28, 2011

Fantastic read. Thanks for posting it!

After reading this article I felt extremely humbled. As a developer doing relatively "simple" work day to day, seeing this level of programming proficiency is very impressive. There isn't a chance I'd be able to figure out a problem such as this. At least certainly not within the timeframe that the SO team fixed it in.

It's articles like this that make me want to continue pushing my limits as a developer.

bhrgunatha · on Oct 28, 2011

> There isn't a chance I'd be able to figure out a problem such as this.

Don't sell yourself short. You generally don't know what your actual limits are until you really push hard against them or break through them. I've had to hunt down and fix some very unusual problems - with few or no references or no results from google.

Those kind of mind-bending problems actually give a great deal of satisfaction. I'm serious that you never really know your limits until you break them. I think the trick is to keep growing and expanding your abilities and knowledge.

Griever · on Oct 28, 2011

Very true. This is generally why I appended that with "at least not within the timeframe that the SO team did".

I think that I have gotten to where I am by constantly pushing myself, but reading articles like this just reminds me that I still have a ways to go.

bluelu · on Oct 28, 2011

Why not write a script that periodically takes a machine of the loadbalancer, forces a GC on each of the servers, and then adds it back again.

I think this would be a cleaner approach than changing the entire code base.

nick_craver · on Oct 28, 2011

This wasn't a change of an entire codebase, only one portion of it (the tag engine). But more importantly, when only 6 web severs are running stackoverflow.com there's a decent chance a non-triggered GC happens stalling the servers in rotation anyway, granted this lessens the chance of a user-facing stall overall. Also, it complicates the build processes since that queue of rotating servers out gets involved, for example if 3-4 are in a GC rotation then 1-2 are starting the build loop, you're running only on 5 and 6, while that's fine and doesn't even hit 50% CPU, it's a little risky. Worse is you have the very real possibility of taking ALL servers down during a build and GC combo, which we'd never want to happen.

wizard_2 · on Oct 29, 2011

They rewrote a relatively small part of their code base, a small part of their in memory cache. For a team who a) wrote the code base and b) change it every day, this wasn't a big deal more then it was an interesting problem. From an operations standpoint I'd consider app servers that need to be rotated every hour or so a very brittle architecture. (I don't let that kind of think go without protesting wildly.)

scott_s · on Oct 28, 2011

I disagree. Your solution attempts to fix the symptom. Their solution fixes the root cause of the problem.

ajross · on Oct 28, 2011

It does? What is the "root cause" though? You seem to take it to mean the generation of references that must be traversed in gen2, but that sounds like just as much a hack to me: they modify the code in ways that don't represent the semantics of the problem to work around a limitation of the implementation.

Hell, you could argue the "root cause" is the use of a garbage-collected environment in the first place. All popular GC implementations have latency issues like this. All of them. If you can't deal with occasional high latencies, you should identify that requirement before choosing Java or C#.

All the solutions provided are just patching around that issue. I don't see anything in the post that looks like a root cause, and the GP's post has the advantage of being much simpler to implement.

scott_s · on Oct 28, 2011

I consider the root cause to be a mis-match between the memory management needs of the application and the assumptions of the GC. Their solution matches the application's memory usage with the assumptions of the GC.

What would be more useful to them is to be able to provide either hints to the GC, or to actually have some control over memory management. That is, keep the application the same, but let it influence GC behavior. They're forced to go at it the other way: keep the GC the same, but change the application so what the GC does becomes the right thing.

they modify the code in ways that don't represent the semantics of the problem to work around a limitation of the implementation.

That is always true when you start optimizing for performance. And nick_craver explains why the proposed suggestion may be easier to say, but there are hidden complexities. Personally, once you start introducing external control to your application like that, alarm bells go off in my head.

MichaelGG · on Oct 28, 2011

.NET 3.5 SP1 actually has GC notifications that are intended for this purpose: http://msdn.microsoft.com/en-us/library/cc713687.aspx

Sukotto · on Oct 28, 2011

  I can not stress how important it is to have 
  queryable web logs.

I'd love to read some (actionable) articles on how to set up some of the benchmarking / logging stuff that you guys have. For sure I'll check out .NET Memory Profiler and mini profiler linked in the article, but I'd be grateful if anyone wants to post more comprehensive info on where to start.

steve8918 · on Oct 28, 2011

I'm not one to hate on C#, or be a programming language snob but this certainly looks like a case where poor performance and a lot of extra work were created simply because of a deficiency in a language. Languages like C# are advertised as being simple to use, and for the most part they are. They fit the bill in a lot of different use cases.

But it certainly seems like to me that if you want to build a high-performance, scalable piece of software, you don't want to have to bang your head against wall and fight against deficiencies within the language implementation itself in order to get your software working properly.

mwsherman · on Oct 28, 2011

You are conflating the language with the runtime, first. And second, we were experiencing the occasional page taking maybe one second, which we considered unacceptable but 99% of the world wouldn’t notice.

(NB, languages like Ruby and PHP are lovely, but are over 10x slower than C# and Java.)

Roboprog · on Oct 28, 2011

I'm so tired of benchmarks that play with ints and floats which show how fast Java is.

Most business app work is string manipulation. If you think Ruby is slow, look at the timings for "sequential" operation in the table on this page (yes, it's my site):

http://roboprogs.com/devel/2009.12.html

If you think that I have badly blundered in my methodology, get the benchmark code, fiddle with it, and run it on your own hardware:

https://github.com/roboprog/mp_bench

FYI: I'm not trying to say that "Ruby is the most awesomest language, like ever, d00dz!!!". I like different tools for different jobs. I'm just tired of seeing people judge implementation speeds based on bit twiddling benchmarks, rather than stuff that at least churns through a large number of strings, if not other object types, and does some I/O.

qw · on Oct 29, 2011

I noticed some differences in the code that I think are not equivalent

Perl: print $local_process_var, " ", &gen_pg_template(), "\n"

Python: print local_process_var, " ", gen_pg_template()

Java: System.out.println( timestamp + " " + genPgTemplate() );

I'm not sure, but I think the Perl/Python code sends the strings as arguments to print that will just output them directly. In Java you will concatenate 3 strings before sending it to "print". This adds some extra CPU usage that the other versions don't have

Roboprog · on Nov 3, 2011

One item of possible note: It wasn't possible to have list driven output syntax in Java before 1.5. I'm not sure if they added an output method with elipses in it then, or not. Even if so, I'm not sure it would be faster.

Still, I could modify the scripting output lines to bring them down to Java's level :-)

Roboprog · on Oct 31, 2011

Thanks for the observation. Not sure when I will get to revisiting this, but that seems like an easy inconsistency to clean up, and I probably should.

jbooth · on Oct 28, 2011

You're comparing the library-supplied date formatter for each language, not the language's speed at processing Strings. And the Java threading code is mostly benchmarking the speed of thread creation - in a real-world implementation you'd be using a thread pool with fewer lines of code.

Roboprog · on Oct 28, 2011

Agreed about the thread usage: the code I put together for threads in various languages is as naive as possible. Having a thread pool in proportion to the number of CPUs and pulling tasks from a queue (a la "Nginex") would be much more efficient.

Interesting comment about the date formatter. I might put in something some time to compare the relative time spent in formatting vs the other concatenation. I suppose I'd be grateful if somebody made a fork with versions showing the split of time between those steps.

steve8918 · on Oct 28, 2011

Yes, sorry, I meant implementation of the language, ie. CLR.

That being said, having to rewrite an entire engine because of the underbellies of the CLR seems to be dubious, especially in the case of a development environment that purports to make life easier for the developer.

I have nothing against C#, one of the trading frameworks I use uses it exclusively (Ninjatrader) and it's pretty nice. It's just that I would hate to have invested weeks/months into a software project, using what I imagine would be tried-and-true software techniques, and then have to rework them to get over the particulars of the implementation.

That being said, I guess all languages have similar problems. For example, when I was working on a backend server back in 2001, we found that our product was "leaking" memory on HP-UX, even though none of memory leak detectors found any problems, and the memory leak problem didn't occur in Solaris, AIX and Windows (we had a cross-platform server). It turns out on high-load systems, our app was allocating a bunch of small strings all over the place, and the base memory allocator for HP-UX was causing memory fragmentation over time. By switching to another memory allocator that handled this situation, it fixed the problem.

I guess there aren't any C/C++ books that I've come across that say "don't allocate a millions of small memory blocks because your memory might fragment, even though you free them properly." So maybe I misspoke, I guess to some degree it does manifest itself in all implementations of languages.

underwater · on Oct 28, 2011

Isn't that the issue Firefox was having with their so-called "memory leaks"?

jpeterson · on Oct 28, 2011

How is this a defficiency in the C# language, exactly? Sounds more like an issue in the current version of the .Net CLR to me.

brown9-2 · on Oct 28, 2011

What is the deficiency? How do other languages solve problems created by unfortunate coincidences between your usage pattern and the GC's timing?

mrgoldenbrown · on Oct 28, 2011

Some languages solve this problem by not having a GC at all :) However, with a problem involving huge in memory structures like this, even manually allocating and deallocating everything properly would probably be a non trivial thing.

kristiandupont · on Oct 28, 2011

So would you go with c or c++? Because otherwise I would say that the consensus is that c# is in the higher end of the performance scale.

Roboprog · on Oct 28, 2011

I can't say that the Java Virtual Machine would have done much better, since it operates under pretty much the same theory (generational garbage collection).

At least Ruby has the option to mark some strings as outside of garbage collection -- to make them permanent symbols/atoms.

We need an option in these modern GC languages to select a reference counting option, or perhaps for portions of code to kick it old school and simply ask for some objects to allocated outside the GC heap on the assumption that they will be around for a while and that the programmer will manually throw that subset of objects away if they become unwanted.

TMTOWTDI. (and all ways seem to have their place at times)

kmontrose · on Oct 28, 2011

You can intern strings manually in .NET (I think you can in Java too), which effectively makes them ineligible for collection. There are serious memory concerns with that approach, since nothing in the tag engine is a constant.

In C# you can use the unsafe keyword to get more or less back into the C/C++ world. While your new'd objects are still GC'd, you have the option of just grabbing an allocator and getting unmanaged memory if you feel like it.

It's worth noting that the .NET GC works quite well for typical cases, and only noticeably chokes on this one use case we have and only under load that most websites will never see (and most users don't notice even on SO, but we watch the logs religiously to make sure these things don't become noticeable). There are also some improvements coming in .NET 4.5's GC that may mitigate this, but we're naturally not willing to wait for it.

I worry people may be taking "don't use .NET, its GC sucks" away from this blog post, which really isn't the point (or accurate).

dbattaglia · on Oct 29, 2011

I worry people may be taking "don't use .NET, its GC sucks" away from this blog post, which really isn't the point (or accurate).

I get the feeling some already came in with that opinion, so they will take away what they want to. I personally got the impression that this is something that would have caused pain in almost any language with a generational GC.

Roboprog · on Oct 31, 2011

I definitely do NOT mean to imply that the .NET GC is inferior. The very fact that you have options of how to manage memory at all in C#/.NET makes it superior to the JVM in that regard.

Admission: I'm am reluctant to want to work on Windows, though. But since that is the Stack Exchange environment, C#.NET seems hands-down the best platform for it.

dos1 · on Oct 28, 2011

Pray tell, what language or framework would not experience minor performance issues running at the scale they do?

MatthewPhillips · on Oct 28, 2011

Slightly off topic, for any SO employees here, have you all ever experimented with using mono-fastcgi-server for any part of your sites and what were your findings? Assuming any of your code compiles under mono.

sams99 · on Oct 28, 2011

We have not, has anyone ran any benchmarks/tests comparing the mono GC to the Microsoft GC? I am pretty sure we would need to change large chunks of our system to run it in mono.

kmontrose · on Oct 28, 2011

My understanding (from discussions with some Fog Creek devs who've done the mono dance for FogBugz) is that ASP.NET on mono is an awful lot of trouble, and doesn't run under IIS.

http://www.mono-project.com/FAQ:_ASP.NET

Seems to validate the IIS bit.

ableal · on Oct 28, 2011

These two phrases, early in the article, could give someone an interesting problem to solve:

Our initial tag engine design was a GC nightmare. The worst possible type memory allocation. Lots of objects with lots of references that lived just enough time to get into generation 2 (which was collected every couple of minutes on our web servers).

Interestingly, the worst possible scenario for the GC is quite common in our environment. We quite often cache information for 1 to 5 minutes.

(Curiously, this is under a header mentioning "abuse of the garbage collector". I'd put it the other way around ...)

mattmanser · on Oct 28, 2011

Less a stop watch in the brain, more likely a case of simply pressing F12 in the browser. All of the consoles have profilers these days.

I must admit I am a little surprised that tags are stored as a free text field.

sams99 · on Oct 28, 2011

The issue popped up once in a hundred times, in general people do not notice.

Tags are stored both in a posttags table and free text. Fts is faster for many queries.

sams99 · on Oct 28, 2011

I do admit it surprised me that we stored tags in an fts field when I started, but after a few months it made quite a bit of sense, the queries are just way faster.

jroseattle · on Oct 28, 2011

To SamS and the SO team: what drove the choice to Free Text Search? Any other options evaluated at the time?

platonichvn · on Oct 28, 2011

Great post. I always learn something new when the SO guys share their experiences with .NET in a large scale web application. Thanks for the link to the memory profiler. Hopefully it will be better than struggling with windbg.

iam · on Oct 28, 2011

If they used an array of structs they would've cut down on the GC's graph traversal significantly.

Another option I was surprised they didn't mention was to run GC.Collect periodically to clear out gen2.

mrgoldenbrown · on Oct 28, 2011

Running Collect periodically probably won't help much in this scenario - the huge data structure isn't actually being collected during the GC.Collect. But the GC doesn't know this until it spikes the CPU while it walks a bajillion references.

Since nothing is actually getting collected, if you ran GC.Collect periodically, you would just be incurring more lag with no benefit. I think your idea might help more in a different situation, one where you do have large amounts of memory getting reclaimed with each GC sweep.