Hacker News new | past | comments | ask | show | jobs | submit login

Whenever I review C code, I first look at the string function uses. Almost always I'll find a bug. It's usually an off by one error dealing with the terminating 0. It's also always a tangled bit of code, and slow due to repeatedly running strlen.

But strings in BASIC are so simple. They just work. I decided when designing D that it wouldn't be good unless string handling was as easy as in BASIC.




In the case of C, it's a design decision Denis Ritchie made that came down to the particular instruction set of PDP-11, that could efficiently process zero terminated strings.

So a severely memory limited architecture of the 70s led to blending of data with control - which is never a safe idea, see naked SQL. We now perpetuate this madness of nul-terminated strings on architectures that have 4 to 6 orders of magnitude more memory than the original PDP-11.

It's also highly inefficient, because a the length of string is a fundamental property that must me recomputed frequently if not cached.

Bottom line, unless you work on non-security sensitive embedded systems like microwave ovens or mice, there is absolutely no place for nul-terminated strings in today's computing.


Mr. Bright, I just want to thank you for creating D.

It is by far my favorite language, because it is filled with elegant solutions to hard language problems.

As a perfectionist, there are very few things I would change about it. People rave about Rust these days, but I rave about D in return.

Just wanted to say thanks (and that I bought a D hoodie).


Your words have just convinced me to try out D. Maybe some good will come out of it :)


Thanks for the kind words!


Hello Walter! All things considered, you are probably the best person to ask for tips on string handling in C.

Would you might sharing the things that you look for, from the obvious to the subtle? I would love to see some rejected push requests if possible. If I were writing C under your direction, what would you drill into me?

Thank you, it is an honour to address you here.


1. whenever you see strncpy(), there's a bug in the code. Nobody remembers if the `n` includes the terminating 0 or nor. I implemented it, and I never remember. I always have to look it up. Don't trust your memory on it. Same goes for all the `n` string functions.

2. be aware of all the C string functions that do strlen. Only do strlen once. Then use memcmp, memcpy, memchr.

3. assign strlen result to a const variable.

4. for performance, use a temporary array on the stack rather than malloc. Have it fail over to malloc if it isn't long enough. You'd be amazed how this speeds things up. Use a shorter array length for debug builds, so your tests are sure to trip the fail over.

5. remove all hard-coded string length maximums

6. make sure size_t is used for all string lengths

7. disassemble the string handling code you're proud of after it compiles. You'll learn a lot about how to write better string code that way

8. I've found subtle errors in online documentation of the string functions. Never use them. Use the C Standard. Especially for the `n` string functions.

9. If you're doing 32 bit code and dealing with user input, be wary of length overflows.

10. check again to ensure your created string is 0 terminated

11. check again to ensure adding the terminating 0 does not overflow the buffer

12. don't forget to check for a NULL pointer

13. ensure all variables are initialized before using them

14. minimize the lifetime of each variable

15. do not recycle variables - give each temporary its own name. Try to make these temporaries const, refactor if that'll enable it to be const.

16. watch out for `char` being either signed or unsigned

17. I structure loops so the condition is <. Avoid using <=, as odds are high that'll will result in a fencepost error

That's all off the top of my head. Hope it's useful for you!


So many potential pitfalls to string functions. But memcpy and friends can have pitfalls too.

I was working on a RISC processor and somebody started using various std lib functions like memcpy from a linux tool chain. I got a bug report - it crashed on certain alignments. Made sense - this processor could only copy words on word alignment etc.

So I wrote a test program for memcpy. Copy 0-128 bytes from a source buffer from offsets 0-128 to a destination buffer at offset 0-128, all combinations of that. Faulted on an alignment issue in code that tried to save cycles by doing register-sized load and store without checking alignment. That was easy! Fixed it. Ran again. Faulted again - different issue, different place.

Before I was done, I had to fix 11 alignment issues. A total fail for whomever wrote that memcpy implementation.

What was the lesson? Well, writing exhaustive tests is a good one. Not blindly trusting std intrinsic libraries is another.

But the one I took with me was, why the hell isn't there an instruction in every processor to efficiently copy from arbitrary source to arbitrary destination with maximum bus efficiency? Why was this a software issue at all! I've been facing code issues like this for decades, and it seems like it will never end.

</rant>


The x86 does have a builtin memcpy instruction. But whether it is best to use it or not depends on which iteration of the x86 you're targeting. Sigh.


>why the hell isn't there an instruction in every processor to efficiently copy from arbitrary source to arbitrary destination with maximum bus efficiency?

Uh, you're not an hardware designer and it shows.. What if there's a page fault during the copy, you handle it in the CPU? That said, have a look at RISC-V vectors instruction (not yet stable AFAIK) and ARM's SVE2: both should allow very efficient memcpy(among other things) much more easily than with current SIMD ISA.


Do they manage alignment? Say a source string starting at offset 3 inside a dword, to a destination at offset 1? That's the issue. Not just block copy of align register-sized memory.

Page fault is irrelevant. It already can happen in block copy instructions.


No, they don't provide alignment but they provide a way to write code once whatever the size of the implementation's vector registers.

As for block copy instruction AFAIK there's no such things in RISC-V for example.


So, no, they don't have anything like an arbitrary block copy that adjusts for alignment. Not surprising; nobody does. So we struggle in software, and have libraries with 11 bugs etc.


strncpy() suffers from its naming. It never was a string function in reality. It is a function to write and clear a fixed size buffer. It was invented to write filenames in the 14 character buffer of a directory entry in early Unix. It should have been name mem-something and people would have never come to the idea to use it for general string routines.


If it respects null terminator, then it is a string function.


It basically expects a string as the source and a fixed-size, not necessarily zero-terminated, buffer as the destination.


Super insightful list.

What will be the alternative for strncpy/strncat? I thought they're a safer strcpy/strcat but now I need something to replace them.

I assume snprintf for sprintf, vsnprintf for vsprintf.

No idea what to do with gmtime/localtime/ctime/ctime_r/asctime/asctime_r, any alternatives for them too?


My alternative is to do a strlen for each string, then use memcpy memset memchr instead.

> I thought they're a safer strcpy/strcat

Let's look at the documentation for strncpy, from the C Standard:

"The strncpy function copies not more than n characters (characters that follow a null character are not copied) from the array pointed to by s2 to the array pointed to by s1."

There's a subtle gotcha there. It may not result in a 0 terminated string!

"If the array pointed to by s2 is a string that is shorter than n characters, null characters are appended to the copy in the array pointed to by s1, until n characters in all have been written."

A performance problem if you're using a large buffer.

Yeah, always prefer snprintf.

The time functions? I'm just very careful using them.


Also most of the time people have serious performance regressions with strncpy() as the function overwrites with 0 all the test of the buffer.

     char buffer[2000];
     strcpy(buffer, "hello", sizeof buffer);
writes "hello" and 1995 0 to the buffer.


Thank you Walter! I will be sure to internalize this. There are some terrific tips in here, such as using shorter array lengths for debug build and avoiding <= as a loop condition. And I don't recall ever seeing char signed, but now I'm terrified.

Thank you, have a great weekend!


char being signed used to be commonplace. But it is allowed by the C Standard, and it's best not to assume one way or the other.


Thank you for the great list. Could you give examples of 8. subtle errors in online documentation?


The trouble stems from the C Standard being copyrighted. Hence, anyone writing online documentation is forced to rewrite and rephrase what the Standard says. The Standard is written in very precise language, and is vetted by the best and most persnickety C programmers.

But the various rewrites and rephrases? Nope. If you absolutely, positively want to get it right, refer to the C Standard.

printf is particularly troublesome. The interactions between argument types and the various formatting flags is not at all simple.

Other sources of error are the 0 handling of `n` functions, and behavior when a NaN is seen.


With so many gotchas, it irks me when they still teach C for the undergraduates.


IIRC, in the early days of the Commodore PET, it used a method of keeping track of strings that was fine in an 8k machine but was too slow in a 32k machine. They had to make a change that avoided quadratic time on the larger machine. So string handling in BASIC wasn't always that simple.


It always blows my mind when I remember 8-bit computers had garbage-collected strings.


+1 for the PET mention since it was my first "computer". much overlooked in favour of the 64


Ah, yes. I recall the luxury of a Commodore of my very own (a C128), after using PETs in school. We had a whole three of them at the time, with a shared, dual-floppy drive for the set.

Naturally, our teacher wisely pushed hard on figuring what you could out on paper first.


  > Naturally, our teacher wisely pushed hard on figuring what
  > you could out on paper first.
Specifically in the case of the Commodores (I grew up on a C128) I find this observation backwards. Sure, if you only had three machines for twenty students then time on the machine was valuable. But on those machines there was so much to explore with poke (and peek to know what to put back). From changing the colours of the display to changing the behaviour of the interpreter.

I think that I discovered the for loop at eight years old just to poke faster!


I’m not sure if it was the original purpose of C, or of it’s what made C popular, but compared to BASIC, processing strings in C was much faster.


Everything was faster in C - it was compiled and BASIC was interpreted.

Better comparison would be between C and Turbo Pascal strings in DOS times. TP strings were limited to 255 characters but they were almost as fast as C strings, in some operations (like checking length) they were faster, and you had to work very hard to create a memory leak or security problem using them.

I've learnt Pascal before C and the whole mess with arrays/strings/pointers was shocking to me.


UCSD and Turbo-Pascal had it easy with the 255 byte strings. They had real strings but these were compiler extensions. Real Pascal didn't have string support and you could only work with packed array of chars of fixed size and as the language was extremely strong types, to packed chars types of different lengths were considered different types, so you had to write procedures and functions for all used packed array sizes.


Brian Kernighan on "Why Pascal is not my Favorite Programming Language" (https://www.lysator.liu.se/c/bwk-on-pascal.html) [1981].

Turbo Pascal wasn't released until 1983, if the wiki is to be believed.


I find it strange that he complained about Pascal's lack of dynamic arrays, when the Pascal solution is to use pointers (exactly what C does for all arrays and strings anyway).

Many of his other points are solved by Turbo Pascal and Delphi/Object Pascal.

But of course nowadays there are better languages for real world programming. It's just a shame that there's nothing as simple and elegant for teaching programming ().

() lisp is even more elegant, but it has a lot of gotchas and it's so far from mainstream that using it for teaching isn't a good idea IMHO


I learned C before Pascal and having to write so much code to deal with 255 character limits was kind of jarring.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: