The advice given in this article is bad. Reducing the length of the copy by one will still fail to null-terminate the string if the source exceeds the destination length. You need to add
dst[sizeof(dst)-1] = 0;
or memset the array to zero beforehand. You are not guaranteed to have a zero at the end of the input buffer otherwise (neither local variable arrays nor malloc’d arrays are guaranteed to be zeroed).
strncpy sucks for a second reason: it writes n bytes no matter what the source is. That means if you write
this will fill 16K of memory with zeroes despite only needing to copy a 6-byte string.
strlcpy/strlcat do the right thing (copy up to n-1 bytes and null-terminate); shame they aren’t standardized. In their absence, I suggest snprintf instead:
snprintf(buffer, sizeof(buffer), “%s”, src);
Because snprintf returns the number of bytes that would be written, it can be used to detect overlong input strings, reallocate the buffer as necessary, and also to implement efficient concatenation. It’s also surprisingly fast in most implementations.
Its a shame that many of the solution to those exist as non-standardized calls and seems split between BSD and GNU. Personally I use asprintf quite a lot, and I recall strndup to be quite useful to create a identical copy of a buffer. strndup should have an identical performance to allocating a buffer and then strlcpy, since it too do copy up to n-1 bytes and null-terminate.
Yes, it seems I lied when I said sprintf is surprisingly fast. For most practical purposes it's fine, but if you need to do string manipulation in a hurry, strcpy/strlcpy are going to be faster.
If you’re going to allocate the memory you might as well just use the library functions that allocate exactly the space you need and copy the string into it. These tricks are for when you use the stack as an ‘optimization’.
With a long src, it fails to null-terminate dest. dest[7] will be whatever the contents of uninitialized memory were, so reading dest as a string is likely to run past the end.
strlcpy has a better API, though sadly it's not standard on Linux.
Yes. “snprintf(dest, sizeof dest, "%s", src);” where dest is a char array almost is the programmer-friend version of strncpy() that does exactly what one would expect, neither more nor less. It always does '\0'-terminate its output as long as it is passed a size >0.
The only issue with that idiom is that it is only defined to pass a '\0'-terminated string for src. Although it will only write the specified number of characters, it will read from src until the end of the string, and invoke undefined behavior if src does not point to a well-formed string:
A mnemonic is that snprintf needs to return the length of the string that would have been output if there had been enough room (not counting the terminating '\0'), so it needs to compute the length of the string pointed by src.
A close variant, below, avoids this issue but instead has the problem that the printf star argument is typed as an int, which is narrower than size_t on a typical 64-bit compilation platform.
> but instead has the problem that the printf star argument is typed as an int, which is narrower than size_t on a typical 64-bit compilation platform.
If you are using >2 GB buffers and anticipate >2 GB strings, it probably makes sense to track lengths explicitly and use memcpy() etc instead of string routines anyway.
I was more thinking of the case where >2GiB strings are not useful for normal use and the programmer does not anticipate them, but a malicious user can cause such strings to happen, for instance by sending them over the network in minutes or hours, causing unforeseen behavior.
However, while it may also be possible, in some code, for a malicious user to control the buffer size, the int precision argument, as used in this construct, derives from the buffer size, and not the input string.
If the user can control the buffer size, then yes, we get the very undesireable buffer overflow via overflow from positive to negative[0]:
A negative precision is taken as if the precision were
omitted.
An annoyance with snprintf is that it has a failure mode, and in particular it can fail with ENOMEM under memory pressure.[1] This isn't merely theoretical as glibc implements snprintf by reusing the stdio machinery--effectively instantiating a temporary FILE structure, buffer, etc. glibc tries to stack allocate these objects using magic constants, but the code is incredibly complex. It's been awhile since I dove into the code, but IIRC snprintf could fail if you try to compose strings longer than the magic internal constants and malloc fails.
strlcpy() has the same semantics as snprintf(buf, sizeof buf, "%s", src) but without the failure mode. Because glibc is so stubborn, and rather than trying to wrap snprintf() to abort on OOM, I invariably include this simple implementation in a common header,
which is substantially simpler than the canonical version from OpenBSD.
A related problem with snprintf is the mixture of signed and unsigned types for communicating object size. Mixing signed and unsigned types is, IME, error prone, so if I have code that uses snprintf more than a few times I usually wrap it in a function that separates the status from the [logical] size return values. People complain that strlcpy is problematic because it similarly communicates status (i.e. truncation) and object size through the same channel. But 1) it's not as nearly as error prone as mixing signed and unsigned types and 2) idiomatic use of strlcpy is easy to code and mentally parse and I've never felt the urge to wrap strlcpy in a helper routine. If truncation is always a failure than I'll simply use a two-line routine that returns an error code. But more often than not I rely on silent truncation. IMO, C code shouldn't be doing complex string operations using native C strings[2], and where it does make sense to use C strings it's usually things like configuration values where garbage in is garbage out; if an overlong property name is truncated it's no different than if it was misspelled. (People seriously overestimate the utility of supporting dynamic object sizes everywhere, and underestimate the inherent complexity it causes, which is significant even in low-level languages like C++ or Rust that make it safer and more convenient.)
[1] Linux's overcommit doesn't save you from, e.g., process resource limits.
[2] In fact, IMO complex parsing and composition shouldn't even be done with any kind of generic string API. If you get to the point where you're parsing and composing highly structured text, you should be using proper techniques with specialized data types. Writing ad hoc string munging code is error prone and a maintenance nightmare in any language, including scripting languages. But if I must than not only do I avoid C, I avoid any low-level, statically typed language. Scripting languages were literally invented for writing ad hoc string munging code.
Another important caveat to snprintf: because glibc's snprintf might use dynamic memory allocation it's not async-signal-safe. OpenBSD notably rewrote their snprintf implementation to be async-signal-safe (with the exception of floating-point formatting). They did this so extension functions like dprintf(), commonly used in signal handlers for debugging or logging, would be easier to implement, and also because a lot of software assumes that (or doesn't even consider whether) snprintf is async-signal-safe.
I wonder whether glibc's dprintf() is async-signal-safe....
The first is inefficient. The declaration statement writes eight zeros to dest. strncpy() writes len bytes of zeros to dest, then copies the string, stopping at fewer than len bytes if a NUL is found in the src.
I think it's not genuine to describe the behavior of strlcpy() as "silently truncating". strlcpy() is documented as "If the return value is >= dstsize, the output string has been truncated. It is the caller's responsibility to handle this."
Careful! Errno can be set even when there is not an error.it's value is only relevant when the return value of the function indicates an error.
That said there are some functions where the possible return values can't indicate an error. In those cases errno should be set to zero before the function call and then a change during will alert you to an error.
Why it's like this I have no idea but it's pretty annoying.
> Why it's like this I have no idea but it's pretty annoying.
The bit about errno possibly being non-zero but not indicating error unless the function does? I guess because the called function can itself call functions / libs which set errno and properly handle those (or just not care). Without the "errno only has meaning if the function returns error" guard, every single function needs to always reset errno to zero before returning, which is annoying and will almost certainly not happen in the majority of cases.
When the function has no way to communicate error, there is no other option but to use errno. But it's good that it's the exception, I imagine.
Ok, but, we've already got that kind of foot-shooting enabled by strncpy(), and glibc can't remove it as it is part of ISO C. If glibc must contain a truncating string copy, I'd much rather have strlcpy than strncpy. Do you find that persuasive?
On the plus side, Drepper left Redhat in 2010 and has been out as the glibc DFL since 2012.[0]
Amusingly, a request for strlcpy was the very 2nd comment on that LWN article.
Unfortunately, glibc still hasn't added it — Roland rejected it again in 2015.[1] Also unfortunately, RH hired Drepper back in 2017. Because they hate their other employees, I guess. Fortunately, no one is eager to give him back the reins of glibc.
The C11 Annex K functions, strcpy_s et al., have a better API than even strlcpy, and are (an optional) part of the C11 standard. They are what you should be using.
It's worth noting that strlcpy and strlcat are quite small, very stable, and can be brought into any program that needs them. Yes, it's an extra step but not an onerous one. Of course it's nice to autoconf them if practical.
Especially since `strncpy` takes a `size_t`, meaning that `buflen - 1` will turn from `-1` to `SIZE_MAX` due to the signed to unsigned conversion. I think OP just slightly misremembered what is on the man page:
If my count is good, this is correction of correction of correction of original article's correction. And average developer is expected to get it right the first time? It's a miracle anything built in C actually works.
Yes. An average developper with knowledge in C will. Honestly, around my 3rd year i created a lib containing a struct for strings (well, two, and a couple functions working with them) and i don't think i ever used an strX function again.
I was pretty proud of the "string for already allocated buffer" part of this lib tbh.
Anyway, i'll never take C for a four-hour coding challenge, but for bigger, non-web project, C is really awesome, thank to BLAS and to the fact that most languages implement mostly painless C ffi.
I would add that while I do normally use regular C strings rather then my own library, I still can't remember the last time I used any of the random `str*` functions. I don't do much string composition anyway, but the little I do is almost always using `snprintf` or variants (Which have a non-braindead API, unlike `strncpy`). Tons easier to read and much more effective then a bunch of `strcpy`, `strcat` and such.
The only thing I wish the standard library would have is an `asprintf` which would return an allocated buffer. I've written my own, but the easy version utilizing regular `sprintf` requires calling `sprintf` twice (Once to get the length, and once to actually put the string in the allocated buffer). A regular supported version would be much more efficient.
Huh, well that's embarrassing. It looks like it's still in there too. I googled the man page but I guess I saw an older version. I might submit a patch if I find time. It should really be something like this:
if (n > 0) {
strncpy(buf, str, n - 1);
buf[n - 1] = '\0';
}
Of course, the error isn't super huge considering if you're using a zero-length buffer as a null-terminated string you're going to have other problems. But they still do check for 0, so IMO it's worth correcting.
He's assuming that dest is zero-initialized, which I believe it will be as written. However, he should mention that because there will be plenty of similar cases where dest will not be initialized.
edit: As asveikau notes below, this is not the case
No, only global variables are zero-initialized. Neither stack variables nor malloced memory will be zero-initialized. OP should have memset to 0 or calloced the memory.
I just don't understand why C hasn't been blessed with a proper string type yet. Object Pascal has had one (actually, several) for decades now and it doesn't hinder the language's ability to handle low-level memory manipulation (you can still manually copy string memory, convert them back/forth into raw Char pointers, etc.), and generally serves to make string handling much, much safer for most applications. It does however, result in slightly more memory consumption for the length/reference count tags and does incur some overhead in the form of compiler-generated reference count checks at the end of functions. But, IMO, the advantages outweigh the disadvantages for general-purpose C programming and you're always free to fall back to the more manual methods of handling character arrays.
So, am I missing something and there is some concrete reason why this can't be implemented ?
Strings-as-char-* are part of the public APIs of damn near every C library, including the standard library, so there's no way to change that idiom without breaking the world or causing a huge migration tax. The kind of people using C are often doing so specifically because they want to avoid that kind of churn and want a maximally-stable platform, even if what it's stabilized to is sub-optimal.
Yes, but none of that would need to change. I hate to keep harping on Object Pascal, but it really is a nice implementation: in OP, you can pass a string to a C API, such as the Windows APIs, like this:
where FileName is a String and pChar is the more traditional C-style pointer to an array of characters. The compiler will prevent you from passing a Unicode/Wide string to an API call that expects an ANSI string pointer, or vice-versa. So, interop with the system-level APIs of Windows or Linux is seamless and easy.
A built in slice type (i.e. a (pointer, length) tuple) would prevent many of the issues with C strings without departing so far from the language design.
If done from the beginning, slices could have gotten the `T[]` syntax and array-pointer decay could have been replaced by the easier to use and safer array-slice decay. But of course backwards compatibility prevents this now.
Backwards compatibility doesn't prevent adding a slice type with specialized syntax. The stumbling block is 1) defining sane semantics agreeable to a majority of people and 2) implementing it in a major implementation.
I hold out hope it'll happen, but I'm probably in denial. The fact that VLA function parameters were made optional in C11 doesn't bode well, but perhaps that's because it's incomplete--syntax and semantics stops short of what's needed to spur adoption of safer APIs. With a proper slice construct that was easy to use and integrate into idiomatic C code then there'd be more demand for implementing and using the necessary VLA compiler machinery, if not VLAs themselves.
Easy, see how Go devs are reluctant to adopt features from other modern languages?
Travel back in time, when they were busy implementing C.
Compare how ESPOL, NEWP, PL/I, PL/S, PL/X, PL.8, BLISS, Algol 68 implemented arrays, strings and unsafe code blocks, and how C decided to go their own way.
BCPL was originally designed as means to Bootstrap CPL, not to be used alone.
See a pattern there?
Then AT&T could not sell UNIX, gave it away at a symbolic price of about $100 and the rest is history.
How often does C add things like this in general? I want to say... basically never?
You're free to roll your own String type that has the length. No clue why people don't just do that though (I assume many do, but obviously a ton do not).
Most somewhat larger C shops will have their own internal utility library or they'll use one of the more popular batteries included packages which typically takes care of memory allocation, string handling, lists and a bunch of other useful stuff.
:-) You're certainly right here, but given that everyone thought that C was "going to be sent to the farm" a while ago and that shows zero signs of being true any time soon, it might be in the best interest of everyone to add such improvements, as long as they don't affect any existing C code. As I stated in my other reply, the only issue that I can see with rolling one's own String type is the handling of the reference counting for implicit deallocation.
> given that everyone thought that C was "going to be sent to the farm" a while ago
For certain definitions of "everyone," maybe... I am no longer primarily a C programmer, and haven't been since around the end of 2013, and I'm still firmly in the "C isn't going any where" camp. I just happen to liken this stance to reality.
Just to clarify, I am definitely also not in the camp that thought that C was going away, rather that my general feeling is that most developers are afraid of it because of issues like this, and more sane semantics regarding one of the more widely-used aspects of any language would allow for greater adoption while not sacrificing reliability or backwards-compatibility. IOW, I think it would be a win-win.
C doesn't really support properly encapsulated user-defined types, so yes, one can create their own String type but it will still be a pain to use and error-prone.
One example is the stretchy buffer library that I saw posted recently here.
There are many object oriented string types for C in various independent GitHub repositories, but they're not really interesting for C programmers because they enjoy the simplicity of null-terminated strings and moving data around with `for` loops.
I understand that (one of my favorite books on my shelf is "C Interfaces and Implementations" that shows some of the cool stuff that you can do with any C implementation, including "proper" strings), but they're not something supported in the compiler, which is, unless I'm mistaken, necessary for the reference count checks.
My point is that implementing a new string type has zero effect upon existing C programs if they don't use the new string type, so I'm confused as to why it hasn't been done. If a C developer doesn't want to use them, then "no harm, no foul".
Well, then you have two string types. The language will instantly become complicated as we indecisively choose between two different string types each function in our APIs.
I see no harm in adding more functions around the string type we already have, but as I mentioned in https://news.ycombinator.com/item?id=17248446, `snprintf` is the mother-of-all-string-functions that does everything you need, so not much else is needed.
I would argue that you still have one string type, while the traditional C "string" type is actually an array of characters, or a pointer to an array of characters. :-)
Re: snprintf - yes I saw that and it definitely does do most of the heavy lifting, but it still is something that the developer needs to handle manually (I know, I know, not everyone should be using C...).
No, char* is definitely a string type. By having another one, that would be two.
Believe me, working with C and C++ code and converting back and forth between std::string and char* is a nightmare. Let's not design that into the language itself.
It seems[1] that since C++11 .data() and .c_str() are the same function. c_str() is also documented as having constant complexity. If it made a copy, wouldn't it have to be linear?
In practice, all modern C++ compilers and standard libraries just set a null character to the c_str()[len] position and reallocate the string if the capacity of the string buffer cannot contain the extra byte. It is never a linear operation unless you're working with old or niche C++ compilers. In C++11 this is required in the standard.
Yes, it is easy to go back and forth between std::string and char* using the std::string constructor and ::c_str(). The "nightmare" part is that when interacting with C from C++, you can never work with the internal std::string data, so you have to manage copies of buffers and copy it back into a std::string every single time you interact with C functions. It's really nasty to look at in large quantities.
You can call snprintf twice, the first time with a NULL buffer and zero-sized length, and the second with a newly allocated buffer with the size equal to the return value of the first snprintf call.
Use strlcpy/strlcat instead.[0] strlcpy takes the full size of the destination buffer, limits the copy to N-1, and nul-terminates the result for you. It's like the "correct" example in TFA, but with less annoying boilerplate.
Some more verbose design/rationale for the really curious.[1]
Another thing to keep in mind is the sometimes surprising behavior of strncpy(large_buffer, short_string, sizeof(large_buffer))[2]:
If the length of src is less than n, strncpy() writes additional null
bytes to dest to ensure that a total of n bytes are written.
Strlcpy doesn't do that. Just use strlcpy. On Linux, it can be found in the libbsd package.
Probably back in 1970s `(characters, \000)` looked "elegant", and `(counter, characters)`, wasteful.
This decision, if not the single most prolific, is likely one of the top 3 sources of security exploits in C code. All for the want of saving a few bytes.
Up until the early '90s, almost all the different Pascals had a (byte counter, character) string, which meant that if your string was longer than 255 bytes you had to do exactly the same kind of acrobatics that C did, and then some (for various reasons related to Pascal's type system).
Then they added 16-bit strings. And eventually 32-bit strings. Which are useless to keep that 5GB file in memory you process these days. I'm not up-to-date - does pascal have a 64-bit string type?
C, on the other hand, still uses the same error-prone and exploit inducing strcpy, strcat and friends - and the work for those 12GB gene texts and .csv files.
C never looked elegant, but it was extremely effective in the limited memory / limited CPU days of yore, and that's why it won against e.g. Pascal.
> I'm not up-to-date - does pascal have a 64-bit string type?
Seems like an easy fix is a single type of string with dynamic size depending on the first byte, like UTF-8:
0bbbbbbb : 1 byte header for a string up to 127 bytes long
10bbbbbb + hh : 2 byte header for a string up to 16384 bytes long
110bbbbb + hhhhhh: 4 byte header for a string up to 512 MB long
1110bbbb + hhhhhhhhhhhhhh: 8 byte header for a string up 1.15 EB long
... etc. (so as not to repeat the mistakes of Pascal, who knows how long will 1.15 EB be enough for everybody)
You would optimize the functions for the fast path with a single bit test for short <128 byte strings, and end up with a sane, safe string until the end of time. With minimal memory impact and much faster than testing for nul terminators, strlen() all over the place etc.
Sure it was, just not with the FOSS licenses of today, apparently you are too young to remember.
AT&T was forbidden to sell UNIX, so they provided the source code to universities at the symbolic price of $100, which was basically free when compared at how much something like VMS or z/OS would cost in licensing.
When Linux came into the scene, C was already on its way out, being slowly replaced by C++ on OS/2, Windows, Mac OS, Symbian, BeOS, NewtonOS as the way to go for writing applications. The later three were even written in C++ (Newton also had NewtonScript).
In 1994 we were already teaching C++ to first year CS students at my university. Proper C++, following Bjarne's ARM book, already talking about RAII and stronger type checking.
It was UNIX FOSS with the assertion that portable software had to be written in C that eventually pushed C everywhere, even to the platforms that were already getting cosy with C++.
No, but some people do; And some C programs written 30 years ago do support it, but no Pascal program does (not from 30 years ago, not those written 20 years ago, not those written 10 years ago, and not those written yesterday unless Pascal now has 64-bit length strings).
Furthermore, using mmap (which was always the right way to do it), it doesn't matter if it's 512KB or 1GB or 5GB or 20GB, and is supported by C string practices but none of the Pascal strings (or those suggested on this thread).
On the other hand, the Pascal code that can't handle 5GB strings also doesn't have the same pervasive security risk from buggy string manipulation that C code does. Not a bad tradeoff, especially given that you probably want to work on substrings within that 5GB and slicing in C essentially doesn't work because null termination either doesn't exist for your substrings (making them incorrect and potentially unsafe, depending on what you're doing) or the substring null termination breaks your larger string and prevents it from actually behaving like a 5GB string.
Safe string manipulation in C devolves into always passing a char array and a count anyway, which is not coincidentally exactly what Pascal strings look like.
Indeed safe strings in C are hard to do, but the nul termination is not the end-all-be-all of C strings; indeed, there is often a way to specify maximum string length (e.g. %.#s format in the prints family, strncpy). So mmapped strings are actually useful.
The bigger problems is embedded nuls, which can’t easily be sidestepped in the C strong framework.
Regardless, C won when the world was not yet well connected, and vulnerabilities mattered much less. It might not have won if the match was held today. The wake up calls were the Morris worm (intentional) and the AT&T outage (unintentional, and more of a logic error iirc) but by then it was too late.
Note how it's not a feature of a programming language per se, but rather a detail of implementation of a common data structure. Likely strings not being their own type but being directly represented as `char[]` also looked elegant and commendably parsimonious back in the day.
Can't blame people who worked on machines with 128 kilobytes of RAM for that. They likely did not expect that their well-meaning hacks will take over most of the computing world.
Also many of those micro-controllers are more powerful than the computers where we ran CP/M on, where we had plenty of system languages to choose from.
They are have become overwhelmingly C due to synergy effects.
No default runtime checks for array bounds (allows to alter arbitrary data and often code), and no default runtime checks for bounds of stack allocation ("smashing the stack"). Both easily lead to RCEs.
There are many worse string formats than C's NUL terminated:
Last byte of string has bit 7 set.
First byte of string has length, so strings are limited to 255 bytes in length.
First two bytes of string has length, so strings are limited to 65535 bytes in length.
Strings are stored in fixed length buffers with space padding to the end.
Length prefixed strings are stored in a fixed length buffer, so you are limited to the buffer length. I think this was the case for PL/I "varying" strings.
Back in the day, C was better than PASCAL because it had strdup, meaning it had a heap and you could put strings in it.
C++ string is mediocre. I solves some problems, but what if:
You have very long strings and you are worried about heap fragmentation. So you are better to have something like a linked list of segments each in their own malloc block. But can you extend std::string? Nope, oh well.
You want strings to be semipredicates. I mean that strings should be able to have a NULL value, as I can do with C. (return NULL for 'char *'). Can std::string do this? Nope. Can it be extended? Nope.
"since C++17", I see.. I'm curious if it uses more space than "char *"
Also it's not great because I should be able to pass such a string through functions that expect std::string. A NULL string should act just like an empty string except that you can test it for NULL.
Yes. std::optional<T> allocates the T in-place. Just checked compiler explorer, and on g++ 8.1, sizeof(std::string) == 32, sizeof(std::optional<std::string>) == 40.
There isn't one (because what you want for long strings makes short string operations too slow, or because you need strings stored in a certain format or in a certain place for other reasons).
So I like the idea of polymorphism so you can adjust the implementation. C++ has polymorphism, but for some reason not enabled for strings (they needed to use all virtual member functions to allow it).
Truncation is not as bad as a buffer overflow. However, it is still not correct. You have to properly handle the case. And if truncating is the correct answer, make that explicit.
In practice, I almost never use fixed size buffers for strings unless I know the size at compile time.
Newer? I thought strncpy dates back to the time Unix filenames were 14 characters, max, adding padding zeroes when needed in some fixed-length kernel structures.
That’s also the reason strncpy always writes len bytes; not keeping garbage content in those 14-byte buffers allows the system to use memcmp to compare file names.
Looking at the comments in this post, I'm resigning myself that there simply is no correct solution for copying or concating strings in C. Null-terminated strings are a fundamentally broken concept. I think the long-term solution is simply to move to a different language (Rust, C++, D, Go, whatever) where we have the benefit of hindsight and have (pointer, length) string types, which solve all the problems null-terminated strings introduce.
You could also validate on interface transition and use your own type internally; which is often what happens when using another language and exporting C style library bindings.
I'm just shouting into the void here, but why does anyone find it acceptable that C is almost fifty years old– a half-century– and we still have new articles published about the correct way to copy memory. And then, immediately following them, comments in responds to those articles saying the article is wrong and that you should actually do it this other way. Nobody has figured this out in 50 years?
A few new people have to learn this stuff. That's not just to maintain legacy code. Computers do not have nicely behaved, safe, garbage collected strings. Someone has to understand the code for how that stuff is bootstrapped, and that code is going to have gunky memory copies in it where a one word mistake will bring down the show.
C suffers from a macho culture, where many developers deeply believe that only the others do mistakes.
Even Dennis was quite clear that C was not supposed to be used without help from lint (developed in 1979).
"Although the first edition of K&R described most of the rules that brought C's type structure to its present form, many programs written in the older, more relaxed style persisted, and so did compilers that tolerated it. To encourage people to pay more attention to the official language rules, to detect legal but suspicious constructions, and to help find interface mismatches undetectable with simple mechanisms for separate compilation, Steve Johnson adapted his pcc compiler to produce lint [Johnson 79b], which scanned a set of files and remarked on dubious constructions."
An arbitrary number of arbitrary bytes that you hope ends in a null, but you'll never know unless you check, and even when you check, do you really know? Was that null really where that string was supposed to end?
There's never going to be a single, universally valid and correct way to deal with a "type" that is structurally just a step above random garbage.
I'm with you. I started with C some ~20 years ago, and you learn very early on about null terminating strings. You are constantly thinking about it for every manipulation you do. This article isn't exactly ground breaking news.
It smells more like an amateur just got bit and decided to write up a blog about it.
It's safe (C99, C++11) and easily extendible. Format strings are fun!
Not the fastest, but if the bottleneck of your program is concatinating strings, just do it manually.
This is a safe, if slow, alternative for strncpy, but it does not safely replace strncat. The C standard does not define the behavior if the code "sprintf(buf, "%s some further text", buf);" is used.
and I meant that `snprintf` can replace that. But if you actually only have two strings, one with a larger buffer than what it contains, and one to be concatinated, then you can't use `snprintf` like that.
Strncpy gives back the length copied which you can use to append more text:
char buf[1024];
size_t pos = 0;
int ret = snprintf(buf, sizeof(buf), src1);
if(ret >= 0) { pos += ret } else { error() }
ret = snprintf(buf + pos, sizeof(buf) - pos, src2);
...
C99 also introduced swprintf for wide char strings, with a different return value convention. Just to add to the pain when you change char to wchar.
Under swprintf, and related functions, %s still takes a char pointer, not wchar_t! So when you make everything wide, you have to edit all the %s to %ls.
Worth noting that strncpy doesn't stand for secure string copy or anything like that. Using strncpy for copying strings would be a mistake, even if technically you can do that.
Rather, it's a fixed size string copy function. This structure is very rare in regular environments, but they can happen in embedded environments. For instance, if you want to have a string in binary file which is at most 10 bytes, you may want to avoid storing the termination byte when the string is exactly 10 bytes long. For instance, such a structure was used in UNIX to store file names, as they used to be limited to 14 bytes, and storing terminator would be a waste of space.
I like learning about these caveats, but I have been asked tricky stuff like this in interviews before with gets() and the like.
As a person who interviews other people, I find that it's waaay more valuable that someone is generally aware that they should watch out for this class of pitfalls than that they know any specifics about a given function.
I've met people who basically had memorized the description of this phenomenon for gets(), but then their preferred solution was just to replace it with fgets() but then they don't know about checking for newlines or have any thoughts on what to do when individual lines are too long.
I'd much rather hire someone who says to herself, "Oh, I need to read some characters from an input source using C. RED ALERT! Let me really research the specifics here."
Instead of someone who thinks, "Oh, I need to read some characters from an input source using C. Good thing I memorized that trivia about gets() and can totally solve this in the best way immediately with the highest upvoted Stack Overflow solution of fgets() that I didn't bother to deeply grok."
I find that when interviews are geared towards puzzle solving or esoteric trivia, the people who do well are mostly of the second type (the ones I wouldn't want to hire).
Whereas someone of the first type might flounder around and struggle in a 20-minute programming task to process strings in C, directly because that person cares more about having a bigger picture point of view of what's actually going on rather than esoteric memorization of specific function signatures and usage mechanics.
In other words, if I gave some kind of C string processing question in an interview for 20-30 minutes, one very excellent answer should be, "sorry man, not gonna try to do this in 20 minutes because in reality I know there are string handling landmines I would need to research and slowly process, and I would never believe this is worth committing to memory for a short interview."
Interesting how the "solutions" to the buffer overflow problems don't provide for all of the modern assumptions of programming with strings. I would love to know the history of the development of strncpy.
I suspect that strncpy was intended for filling in fields in records to be written to files, where you want all the bytes to be clean, with no random junk (potentially sensitive) after the null byte. For instance struct utmp in Unix or something of its ilk.
Exactly. Or the fixed-length directory entries in old Unix file systems. If the file name is exactly 14 chars long, you don't care about NUL termination but if it is any less you want to zero out the remaining bytes. Strncpy is made for that.
It's not restricted to writing to disk. When these structures cross the userspace/kernel boundary on a system call, you really don't want to leave uninitialized bytes following the NUL terminator and return them to some user process.
The modern use case for strncpy is for filling in the .sun_path member of struct sockaddr_un. Most people assume that the path needs to be NUL terminated, but the BSD Sockets API actually relies on the declared sockaddr length parameter. It's not superfluous and the kernel will only read .sun_path up to the end of the declared size of the sockaddr structure; it doesn't expect NUL termination though it will obey internal NUL termination.
Moreover, the statically declared size of .sun_path in the libc headers doesn't limit the maximum length of the path. On most implementations you can create domain socket paths larger than this. Indeed, when you use an API like getsockname() you normally should check for truncation by comparing the returned sockaddr length with the size of the buffer you passed. Just like with snprintf() and strlcpy(), if the returned logical length is greater than your buffer size the path was truncated. IIRC, not all implementations (or any?) include a NUL byte as part of the length so you can very well end up with a .sun_path that isn't NUL terminated if your buffer only barely fit the path. Likewise if you didn't 0-initialize the path buffer and the actual path was shorter, though IIRC kernels handle this second case differently--some might NUL terminate for good measure if there's space.
Furthermore, on Linux, there is an extension: the first byte of sun_path can be null. In that case, the rest of the path is still valid up to the given length and specifies an "abstract address": it's a namespace outside of the filesystem. Sockets bound to abstract addresses automatically disappear on the last close.
This drives home the idea that "damn it, this is not a null terminated string; here is the null-byte-based extension to prove it!"
Not even this works. They forgot to force the last byte as a NULL, which is a classic bug in C. Either that or memset the char array before using it. But what the blog poster did is a pure bug.
There is no absolutely foolproof way to code in C such that no matter how someone changes the program, they will be spared from making a mistake.
If we just program for today, sizeof buffer is much better than proliferating a preprocessor constant that may or may not correctly reflect the object being overwritten.
For the silly mistake of changing an array to a pointer without taking care of sizeofs, GCC gives us some diagnostics:
-Wsizeof-pointer-memaccess
Warn for suspicious length parameters to certain string and memory
built-in functions if the argument uses "sizeof". This warning
warns e.g. about "memset (ptr, 0, sizeof (ptr));" if "ptr" is not
an array, but a pointer, and suggests a possible fix, or about
"memcpy (&foo, ptr, sizeof (&foo));". This warning is enabled by
-Wall.
-Wsizeof-array-argument
Warn when the "sizeof" operator is applied to a parameter that is
declared as an array in a function definition. This warning is
enabled by default for C and C++ programs.
Well, this is not very helpful advice, because strncpy(a, b, sizeof(a)) is in no way more safe than strncpy(a, b, sizeof(a)-1), because the latter is not 0-terminated either. And from malloc(), as in the examples, comes no 0-termined buffer, but random garbage memory. What would be safer is to alway 0-terminate the buffer after copying, and using the simplest copy possible:
strcpy(a, b, sizeof(a));
a[sizeof(a)-1] = 0;
But this is more boilerplate and hence more error-prone.
Even safer, use strlcpy() (if available) or snprintf() which both 0-terminate (except under Windows, maybe). (But beware when preparing something for copying from trusted to untrusted: strncpy() clears the rest of the buffer while strlcpy() and snprintf() do not, so you might leak info via uninitialised memory behind the end of the string if you copy out that buffer across a trust boundary. Actually, the authors 'sizeof()-1' solution is less secure in this context.) So, use:
snprintf(a, sizeof(a), "%s", b);
And don't tell me anything about speed, please. Your main concern with C is not micro optimisations but robustness and avoiding undefined behaviour (and that snprintf() is not too slow).
And for multiple concats, use multiple snprintfs(), like so:
char *i = a, *e = a + sizeof(a);
i += snprintf(i, e-i, "%s", b1);
i += snprintf(i, e-i, "%s", b2);
i += snprintf(i, e-i, "%s", b3);
This is the most concise way I know to write this that works without buffer overflow (your main enemy, even more vile than missing 0-termination), without thinking too much, without writing too much boilerplate, and that is relatively robust against breaking in code restructuring (like, appending more stuff in the middle). The idiom also resembles a bit old style C++ iterators ('i' and 'e').
Oh, and a truncated string is usually not good anyway, be it 0-terminated or not. So you do need to check for that after all that stringing stuff:
Don't miss that '-1' there. Off-by-one is another enemy to know well. And dispite that check handling missing 0-termination, do not be tempted to fall back to strcpy(), because missing 0-termination is bad(tm).
Phew!
C is bad with strings. The above resembles C++ iterators ('i' and 'e') and works fine with any good snprintf implementation (i.e., probably not under Windows).
And do not copy structs with memcpy, just assign them! memcpy() is for arrays only. This is not going to go away, is it?
Mostly good advice, but the multiple `snprintf` example is wrong - it contains a buffer overflow. snprintf returns the number of characters that would be written, so when you do
i += snprintf(i, e-i, "%s", b1);
i would end up past e if b1 is overlong. Then in the next line
i += snprintf(i, e-i, "%s", b2);
e-i is negative, but snprintf takes a size_t so this will overflow badly.
The best solution, AFAICT, is the following:
int res;
res = snprintf(i, e-i, "%s%s%s", b1, b2, b3);
if(res >= e-i) {
/* handle overflow */
return -1;
} else {
i += res;
}
/* subsequent snprintf's here */
strncpy sucks for a second reason: it writes n bytes no matter what the source is. That means if you write
this will fill 16K of memory with zeroes despite only needing to copy a 6-byte string.strlcpy/strlcat do the right thing (copy up to n-1 bytes and null-terminate); shame they aren’t standardized. In their absence, I suggest snprintf instead:
Because snprintf returns the number of bytes that would be written, it can be used to detect overlong input strings, reallocate the buffer as necessary, and also to implement efficient concatenation. It’s also surprisingly fast in most implementations.