Hacker News new | past | comments | ask | show | jobs | submit login

Could you expand on why

   export LC_ALL=C
would "make all your commands 3x faster"?



If your text manipulation programs are locale-aware, they may be interpreting the input as a multibyte encoding, and need to do a lot more work in preprocessing to get semantically correct operation. For example, a Unicode-aware grep may understand more forms of equivalence, similarly for sorting. See e.g. http://en.wikipedia.org/wiki/Unicode_equivalence

With the C locale, text is more or less treated as plain bytes.


Actually, it was more like 2000X[1] -- and I believe that it still stands as Brendan Gregg's biggest performance win.

[1] http://dtrace.org/blogs/brendan/2011/12/08/2000x-performance...


According to the comments in that thread, this issue was fixed in GNU grep 2.7 (my system currently has grep 2.14 on it, so this must have been some time ago).


Gnu grep is or was very slow with the UTF-8 locale. Not sure about other commands, perhaps anything that processes text, awk and sed maybe?



It's no longer quadratic in so many cases, but it's still true that UTF-8 string operations require, in the best case, several CPU cycles per character consumed, even when the input is an ASCII subset. LC_ALL=C pretty much guarantees one or fewer CPU cycles per input character. Basics like strlen and strchr and strstr are significantly faster in "C" locale.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: