Whenever I review C code, I first look at the string function uses. Almost always I'll find a bug. It's usually an off by one error dealing with the terminating 0. It's also always a tangled bit of code, and slow due to repeatedly running strlen.
But strings in BASIC are so simple. They just work. I decided when designing D that it wouldn't be good unless string handling was as easy as in BASIC.
In the case of C, it's a design decision Denis Ritchie made that came down to the particular instruction set of PDP-11, that could efficiently process zero terminated strings.
So a severely memory limited architecture of the 70s led to blending of data with control - which is never a safe idea, see naked SQL. We now perpetuate this madness of nul-terminated strings on architectures that have 4 to 6 orders of magnitude more memory than the original PDP-11.
It's also highly inefficient, because a the length of string is a fundamental property that must me recomputed frequently if not cached.
Bottom line, unless you work on non-security sensitive embedded systems like microwave ovens or mice, there is absolutely no place for nul-terminated strings in today's computing.
Hello Walter! All things considered, you are probably the best person to ask for tips on string handling in C.
Would you might sharing the things that you look for, from the obvious to the subtle? I would love to see some rejected push requests if possible. If I were writing C under your direction, what would you drill into me?
1. whenever you see strncpy(), there's a bug in the code. Nobody remembers if the `n` includes the terminating 0 or nor. I implemented it, and I never remember. I always have to look it up. Don't trust your memory on it. Same goes for all the `n` string functions.
2. be aware of all the C string functions that do strlen. Only do strlen once. Then use memcmp, memcpy, memchr.
3. assign strlen result to a const variable.
4. for performance, use a temporary array on the stack rather than malloc. Have it fail over to malloc if it isn't long enough. You'd be amazed how this speeds things up. Use a shorter array length for debug builds, so your tests are sure to trip the fail over.
5. remove all hard-coded string length maximums
6. make sure size_t is used for all string lengths
7. disassemble the string handling code you're proud of after it compiles. You'll learn a lot about how to write better string code that way
8. I've found subtle errors in online documentation of the string functions. Never use them. Use the C Standard. Especially for the `n` string functions.
9. If you're doing 32 bit code and dealing with user input, be wary of length overflows.
10. check again to ensure your created string is 0 terminated
11. check again to ensure adding the terminating 0 does not overflow the buffer
12. don't forget to check for a NULL pointer
13. ensure all variables are initialized before using them
14. minimize the lifetime of each variable
15. do not recycle variables - give each temporary its own name. Try to make these temporaries const, refactor if that'll enable it to be const.
16. watch out for `char` being either signed or unsigned
17. I structure loops so the condition is <. Avoid using <=, as odds are high that'll will result in a fencepost error
That's all off the top of my head. Hope it's useful for you!
So many potential pitfalls to string functions. But memcpy and friends can have pitfalls too.
I was working on a RISC processor and somebody started using various std lib functions like memcpy from a linux tool chain. I got a bug report - it crashed on certain alignments. Made sense - this processor could only copy words on word alignment etc.
So I wrote a test program for memcpy. Copy 0-128 bytes from a source buffer from offsets 0-128 to a destination buffer at offset 0-128, all combinations of that. Faulted on an alignment issue in code that tried to save cycles by doing register-sized load and store without checking alignment. That was easy! Fixed it. Ran again. Faulted again - different issue, different place.
Before I was done, I had to fix 11 alignment issues. A total fail for whomever wrote that memcpy implementation.
What was the lesson? Well, writing exhaustive tests is a good one. Not blindly trusting std intrinsic libraries is another.
But the one I took with me was, why the hell isn't there an instruction in every processor to efficiently copy from arbitrary source to arbitrary destination with maximum bus efficiency? Why was this a software issue at all! I've been facing code issues like this for decades, and it seems like it will never end.
>why the hell isn't there an instruction in every processor to efficiently copy from arbitrary source to arbitrary destination with maximum bus efficiency?
Uh, you're not an hardware designer and it shows.. What if there's a page fault during the copy, you handle it in the CPU?
That said, have a look at RISC-V vectors instruction (not yet stable AFAIK) and ARM's SVE2: both should allow very efficient memcpy(among other things) much more easily than with current SIMD ISA.
Do they manage alignment? Say a source string starting at offset 3 inside a dword, to a destination at offset 1? That's the issue. Not just block copy of align register-sized memory.
Page fault is irrelevant. It already can happen in block copy instructions.
So, no, they don't have anything like an arbitrary block copy that adjusts for alignment. Not surprising; nobody does. So we struggle in software, and have libraries with 11 bugs etc.
strncpy() suffers from its naming. It never was a string function in reality. It is a function to write and clear a fixed size buffer. It was invented to write filenames in the 14 character buffer of a directory entry in early Unix. It should have been name mem-something and people would have never come to the idea to use it for general string routines.
My alternative is to do a strlen for each string, then use memcpy memset memchr instead.
> I thought they're a safer strcpy/strcat
Let's look at the documentation for strncpy, from the C Standard:
"The strncpy function copies not more than n characters (characters that follow a null character are not copied) from the array pointed to by s2 to the array pointed to by s1."
There's a subtle gotcha there. It may not result in a 0 terminated string!
"If the array pointed to by s2 is a string that is shorter than n characters, null characters are appended to the copy in the array pointed to by s1, until n characters in all have been written."
A performance problem if you're using a large buffer.
Yeah, always prefer snprintf.
The time functions? I'm just very careful using them.
Thank you Walter! I will be sure to internalize this. There are some terrific tips in here, such as using shorter array lengths for debug build and avoiding <= as a loop condition. And I don't recall ever seeing char signed, but now I'm terrified.
The trouble stems from the C Standard being copyrighted. Hence, anyone writing online documentation is forced to rewrite and rephrase what the Standard says. The Standard is written in very precise language, and is vetted by the best and most persnickety C programmers.
But the various rewrites and rephrases? Nope. If you absolutely, positively want to get it right, refer to the C Standard.
printf is particularly troublesome. The interactions between argument types and the various formatting flags is not at all simple.
Other sources of error are the 0 handling of `n` functions, and behavior when a NaN is seen.
IIRC, in the early days of the Commodore PET, it used a method of keeping track of strings that was fine in an 8k machine but was too slow in a 32k machine. They had to make a change that avoided quadratic time on the larger machine. So string handling in BASIC wasn't always that simple.
Ah, yes. I recall the luxury of a Commodore of my very own (a C128), after using PETs in school. We had a whole three of them at the time, with a shared, dual-floppy drive for the set.
Naturally, our teacher wisely pushed hard on figuring what you could out on paper first.
> Naturally, our teacher wisely pushed hard on figuring what
> you could out on paper first.
Specifically in the case of the Commodores (I grew up on a C128) I find this observation backwards. Sure, if you only had three machines for twenty students then time on the machine was valuable. But on those machines there was so much to explore with poke (and peek to know what to put back). From changing the colours of the display to changing the behaviour of the interpreter.
I think that I discovered the for loop at eight years old just to poke faster!
Everything was faster in C - it was compiled and BASIC was interpreted.
Better comparison would be between C and Turbo Pascal strings in DOS times. TP strings were limited to 255 characters but they were almost as fast as C strings, in some operations (like checking length) they were faster, and you had to work very hard to create a memory leak or security problem using them.
I've learnt Pascal before C and the whole mess with arrays/strings/pointers was shocking to me.
UCSD and Turbo-Pascal had it easy with the 255 byte strings. They had real strings but these were compiler extensions. Real Pascal didn't have string support and you could only work with packed array of chars of fixed size and as the language was extremely strong types, to packed chars types of different lengths were considered different types, so you had to write procedures and functions for all used packed array sizes.
I find it strange that he complained about Pascal's lack of dynamic arrays, when the Pascal solution is to use pointers (exactly what C does for all arrays and strings anyway).
Many of his other points are solved by Turbo Pascal and Delphi/Object Pascal.
But of course nowadays there are better languages for real world programming. It's just a shame that there's nothing as simple and elegant for teaching programming ().
() lisp is even more elegant, but it has a lot of gotchas and it's so far from mainstream that using it for teaching isn't a good idea IMHO
But strings in BASIC are so simple. They just work. I decided when designing D that it wouldn't be good unless string handling was as easy as in BASIC.