Could you expand on why export LC_ALL=C would "make all your commands 3x faster"...

barrkel · on July 15, 2013

If your text manipulation programs are locale-aware, they may be interpreting the input as a multibyte encoding, and need to do a lot more work in preprocessing to get semantically correct operation. For example, a Unicode-aware grep may understand more forms of equivalence, similarly for sorting. See e.g. http://en.wikipedia.org/wiki/Unicode_equivalence

With the C locale, text is more or less treated as plain bytes.

bcantrill · on July 15, 2013

Actually, it was more like 2000X[1] -- and I believe that it still stands as Brendan Gregg's biggest performance win.

[1] http://dtrace.org/blogs/brendan/2011/12/08/2000x-performance...

gnosis · on July 16, 2013

According to the comments in that thread, this issue was fixed in GNU grep 2.7 (my system currently has grep 2.14 on it, so this must have been some time ago).

foobarbazqux · on July 15, 2013

Gnu grep is or was very slow with the UTF-8 locale. Not sure about other commands, perhaps anything that processes text, awk and sed maybe?

dsberkholz · on July 15, 2013

That was mostly fixed. http://savannah.gnu.org/bugs/?14472

thrownaway2424 · on July 15, 2013

It's no longer quadratic in so many cases, but it's still true that UTF-8 string operations require, in the best case, several CPU cycles per character consumed, even when the input is an ASCII subset. LC_ALL=C pretty much guarantees one or fewer CPU cycles per input character. Basics like strlen and strchr and strstr are significantly faster in "C" locale.