Hacker News new | past | comments | ask | show | jobs | submit login
Git's list of banned C functions (github.com/git)
876 points by muds on March 4, 2021 | hide | past | favorite | 613 comments



Its really wild, as a person coming from other languages who has written maybe ten lines of C in his life that the functions that seem to be massive footguns in C are, like, "format a string" or "get time in GMT." That's... really scary.


Unfortunately, much of the pain with C surrounds dealing with strings. It’s been a bit of a theme on Hacker News for the past few days, but it’s actually a pretty good spotlight on something I feel is not always appreciated - strings in C are actually hard, and even the most safe standard functions like strlcpy and strlcat are still only good if truncation is a safe option in a given circumstance (it isn’t always.)

(~~Technically~~ Optionally, C11 has strcpy_s and strcat_s which fail explicitly on truncation. So if C11 is acceptable for you, that might be the a reasonable option, provided you always handle the failure case. Apparently, though, it is not usually implemented outside of Microsoft CRT.)

edit: Updated notes regarding C11.


Whenever I review C code, I first look at the string function uses. Almost always I'll find a bug. It's usually an off by one error dealing with the terminating 0. It's also always a tangled bit of code, and slow due to repeatedly running strlen.

But strings in BASIC are so simple. They just work. I decided when designing D that it wouldn't be good unless string handling was as easy as in BASIC.


In the case of C, it's a design decision Denis Ritchie made that came down to the particular instruction set of PDP-11, that could efficiently process zero terminated strings.

So a severely memory limited architecture of the 70s led to blending of data with control - which is never a safe idea, see naked SQL. We now perpetuate this madness of nul-terminated strings on architectures that have 4 to 6 orders of magnitude more memory than the original PDP-11.

It's also highly inefficient, because a the length of string is a fundamental property that must me recomputed frequently if not cached.

Bottom line, unless you work on non-security sensitive embedded systems like microwave ovens or mice, there is absolutely no place for nul-terminated strings in today's computing.


Mr. Bright, I just want to thank you for creating D.

It is by far my favorite language, because it is filled with elegant solutions to hard language problems.

As a perfectionist, there are very few things I would change about it. People rave about Rust these days, but I rave about D in return.

Just wanted to say thanks (and that I bought a D hoodie).


Your words have just convinced me to try out D. Maybe some good will come out of it :)


Thanks for the kind words!


Hello Walter! All things considered, you are probably the best person to ask for tips on string handling in C.

Would you might sharing the things that you look for, from the obvious to the subtle? I would love to see some rejected push requests if possible. If I were writing C under your direction, what would you drill into me?

Thank you, it is an honour to address you here.


1. whenever you see strncpy(), there's a bug in the code. Nobody remembers if the `n` includes the terminating 0 or nor. I implemented it, and I never remember. I always have to look it up. Don't trust your memory on it. Same goes for all the `n` string functions.

2. be aware of all the C string functions that do strlen. Only do strlen once. Then use memcmp, memcpy, memchr.

3. assign strlen result to a const variable.

4. for performance, use a temporary array on the stack rather than malloc. Have it fail over to malloc if it isn't long enough. You'd be amazed how this speeds things up. Use a shorter array length for debug builds, so your tests are sure to trip the fail over.

5. remove all hard-coded string length maximums

6. make sure size_t is used for all string lengths

7. disassemble the string handling code you're proud of after it compiles. You'll learn a lot about how to write better string code that way

8. I've found subtle errors in online documentation of the string functions. Never use them. Use the C Standard. Especially for the `n` string functions.

9. If you're doing 32 bit code and dealing with user input, be wary of length overflows.

10. check again to ensure your created string is 0 terminated

11. check again to ensure adding the terminating 0 does not overflow the buffer

12. don't forget to check for a NULL pointer

13. ensure all variables are initialized before using them

14. minimize the lifetime of each variable

15. do not recycle variables - give each temporary its own name. Try to make these temporaries const, refactor if that'll enable it to be const.

16. watch out for `char` being either signed or unsigned

17. I structure loops so the condition is <. Avoid using <=, as odds are high that'll will result in a fencepost error

That's all off the top of my head. Hope it's useful for you!


So many potential pitfalls to string functions. But memcpy and friends can have pitfalls too.

I was working on a RISC processor and somebody started using various std lib functions like memcpy from a linux tool chain. I got a bug report - it crashed on certain alignments. Made sense - this processor could only copy words on word alignment etc.

So I wrote a test program for memcpy. Copy 0-128 bytes from a source buffer from offsets 0-128 to a destination buffer at offset 0-128, all combinations of that. Faulted on an alignment issue in code that tried to save cycles by doing register-sized load and store without checking alignment. That was easy! Fixed it. Ran again. Faulted again - different issue, different place.

Before I was done, I had to fix 11 alignment issues. A total fail for whomever wrote that memcpy implementation.

What was the lesson? Well, writing exhaustive tests is a good one. Not blindly trusting std intrinsic libraries is another.

But the one I took with me was, why the hell isn't there an instruction in every processor to efficiently copy from arbitrary source to arbitrary destination with maximum bus efficiency? Why was this a software issue at all! I've been facing code issues like this for decades, and it seems like it will never end.

</rant>


The x86 does have a builtin memcpy instruction. But whether it is best to use it or not depends on which iteration of the x86 you're targeting. Sigh.


>why the hell isn't there an instruction in every processor to efficiently copy from arbitrary source to arbitrary destination with maximum bus efficiency?

Uh, you're not an hardware designer and it shows.. What if there's a page fault during the copy, you handle it in the CPU? That said, have a look at RISC-V vectors instruction (not yet stable AFAIK) and ARM's SVE2: both should allow very efficient memcpy(among other things) much more easily than with current SIMD ISA.


Do they manage alignment? Say a source string starting at offset 3 inside a dword, to a destination at offset 1? That's the issue. Not just block copy of align register-sized memory.

Page fault is irrelevant. It already can happen in block copy instructions.


No, they don't provide alignment but they provide a way to write code once whatever the size of the implementation's vector registers.

As for block copy instruction AFAIK there's no such things in RISC-V for example.


So, no, they don't have anything like an arbitrary block copy that adjusts for alignment. Not surprising; nobody does. So we struggle in software, and have libraries with 11 bugs etc.


strncpy() suffers from its naming. It never was a string function in reality. It is a function to write and clear a fixed size buffer. It was invented to write filenames in the 14 character buffer of a directory entry in early Unix. It should have been name mem-something and people would have never come to the idea to use it for general string routines.


If it respects null terminator, then it is a string function.


It basically expects a string as the source and a fixed-size, not necessarily zero-terminated, buffer as the destination.


Super insightful list.

What will be the alternative for strncpy/strncat? I thought they're a safer strcpy/strcat but now I need something to replace them.

I assume snprintf for sprintf, vsnprintf for vsprintf.

No idea what to do with gmtime/localtime/ctime/ctime_r/asctime/asctime_r, any alternatives for them too?


My alternative is to do a strlen for each string, then use memcpy memset memchr instead.

> I thought they're a safer strcpy/strcat

Let's look at the documentation for strncpy, from the C Standard:

"The strncpy function copies not more than n characters (characters that follow a null character are not copied) from the array pointed to by s2 to the array pointed to by s1."

There's a subtle gotcha there. It may not result in a 0 terminated string!

"If the array pointed to by s2 is a string that is shorter than n characters, null characters are appended to the copy in the array pointed to by s1, until n characters in all have been written."

A performance problem if you're using a large buffer.

Yeah, always prefer snprintf.

The time functions? I'm just very careful using them.


Also most of the time people have serious performance regressions with strncpy() as the function overwrites with 0 all the test of the buffer.

     char buffer[2000];
     strcpy(buffer, "hello", sizeof buffer);
writes "hello" and 1995 0 to the buffer.


Thank you Walter! I will be sure to internalize this. There are some terrific tips in here, such as using shorter array lengths for debug build and avoiding <= as a loop condition. And I don't recall ever seeing char signed, but now I'm terrified.

Thank you, have a great weekend!


char being signed used to be commonplace. But it is allowed by the C Standard, and it's best not to assume one way or the other.


Thank you for the great list. Could you give examples of 8. subtle errors in online documentation?


The trouble stems from the C Standard being copyrighted. Hence, anyone writing online documentation is forced to rewrite and rephrase what the Standard says. The Standard is written in very precise language, and is vetted by the best and most persnickety C programmers.

But the various rewrites and rephrases? Nope. If you absolutely, positively want to get it right, refer to the C Standard.

printf is particularly troublesome. The interactions between argument types and the various formatting flags is not at all simple.

Other sources of error are the 0 handling of `n` functions, and behavior when a NaN is seen.


With so many gotchas, it irks me when they still teach C for the undergraduates.


IIRC, in the early days of the Commodore PET, it used a method of keeping track of strings that was fine in an 8k machine but was too slow in a 32k machine. They had to make a change that avoided quadratic time on the larger machine. So string handling in BASIC wasn't always that simple.


It always blows my mind when I remember 8-bit computers had garbage-collected strings.


+1 for the PET mention since it was my first "computer". much overlooked in favour of the 64


Ah, yes. I recall the luxury of a Commodore of my very own (a C128), after using PETs in school. We had a whole three of them at the time, with a shared, dual-floppy drive for the set.

Naturally, our teacher wisely pushed hard on figuring what you could out on paper first.


  > Naturally, our teacher wisely pushed hard on figuring what
  > you could out on paper first.
Specifically in the case of the Commodores (I grew up on a C128) I find this observation backwards. Sure, if you only had three machines for twenty students then time on the machine was valuable. But on those machines there was so much to explore with poke (and peek to know what to put back). From changing the colours of the display to changing the behaviour of the interpreter.

I think that I discovered the for loop at eight years old just to poke faster!


I’m not sure if it was the original purpose of C, or of it’s what made C popular, but compared to BASIC, processing strings in C was much faster.


Everything was faster in C - it was compiled and BASIC was interpreted.

Better comparison would be between C and Turbo Pascal strings in DOS times. TP strings were limited to 255 characters but they were almost as fast as C strings, in some operations (like checking length) they were faster, and you had to work very hard to create a memory leak or security problem using them.

I've learnt Pascal before C and the whole mess with arrays/strings/pointers was shocking to me.


UCSD and Turbo-Pascal had it easy with the 255 byte strings. They had real strings but these were compiler extensions. Real Pascal didn't have string support and you could only work with packed array of chars of fixed size and as the language was extremely strong types, to packed chars types of different lengths were considered different types, so you had to write procedures and functions for all used packed array sizes.


Brian Kernighan on "Why Pascal is not my Favorite Programming Language" (https://www.lysator.liu.se/c/bwk-on-pascal.html) [1981].

Turbo Pascal wasn't released until 1983, if the wiki is to be believed.


I find it strange that he complained about Pascal's lack of dynamic arrays, when the Pascal solution is to use pointers (exactly what C does for all arrays and strings anyway).

Many of his other points are solved by Turbo Pascal and Delphi/Object Pascal.

But of course nowadays there are better languages for real world programming. It's just a shame that there's nothing as simple and elegant for teaching programming ().

() lisp is even more elegant, but it has a lot of gotchas and it's so far from mainstream that using it for teaching isn't a good idea IMHO


I learned C before Pascal and having to write so much code to deal with 255 character limits was kind of jarring.


I teach at university as external lecturer. Teaching strings in C is the hardest thing I have to do every time. The university decided to explain C to first year student without previous experience. My feedback was to do a precourse in Python to let them relax a bit with programming as a concept and then teach C in a second course.


> I teach at university as external lecturer. Teaching strings in C is the hardest thing I have to do every time.

But if you keep up the good work you will one day go from

  extern void *lecturer;
to

  static const lecturer;


More commonly

     volatile unsigned short lecturer;


Actually it usually ends up being much simpler than a compiled language. Something like this:

    delete from schema.hr.employee

    where employee.employee_type = 'Lecturer'

    having rownum = cast(dbms_random.value(1,count(*)) as int)
Most Deans' computers have it mapped to alt-delete. They don't even know what it does-- it's just called the "reduce budget function". Which is really unfortunate because when they hit ctrl-alt-delete on a frozen system, but miss the ctrl key by accident, some poor lecturer gets fired and at the end of the semester the Dean says "Huh, wonder where that budget surplus came from.".

Once an entire physics department was disbanded when their Dean's keyboard had a broken ctrl key.


In C++ we'd have to decide if lecturer needs to support move semantics.


Probably just delete.


Not if they're tenured. Then you can assume they'll never move.


Minor detail: lecturers don't get tenure.

The job role of 'professor' may be able to get tenure (I think these roles usually do) but 'lecturer' really means 'full time temporary teacher, with a contract for a specified amount of time.


I occasionally adjunct. What students call me at the beginning of the semester is always awkward:

Them: "Hello Professor"

Me: "Technically I'm not a professor."

Them: "Okay, we'll just call you Doctor."

Me: "Yeah, about that... not a doctor either."

Them: "So why are we paying you?"

Me: "Technically, you're paying the school. And the school is paying me... very little"

Them: "Answer the question"

Me: "Because I know stuff that you don't."

Mostly they still just call me professor and I feel awkward every time.


I knew someone who was TA'ing a class back when they were in grad school. I heard a story about him - to get ahead of this uncertainty he gave the class three options for what to call him:

1) 'Steve' (his first name)

2) 'Mr. Wolfman' (his last name)

3) 'Darth Wolfman' (funny, obvious not meant to be taken seriously, option)

Guess what the class overwhelmingly voted for? :)


I don't think you should feel awkward. I refer to all my teachers in emails as professor ( unless I want to list more detailed honorifics ). My current analytics guy is clearly very smart, seems to be in that adjunct zone, but I address him as professor out of sheer respect.


For all the complicated social protocols in that neck of the woods, this would be simple in Japan. You're just 先生 (sensei) and that's it.


If they ever ask "What do we call you?" you should answer,

"God-Boss."

(Pace Steven Brust.)


I teach first-years in Australia, where boys from private schools call me "sir". When I'm feeling mean, I tell them to drop and give me ten pushups.


> ten pushups

I'm guessing you don't teach computer science


OK. 10 push_backs() then.



Not the case in the UK at least.


    Professor(Professor&&) = delete;


I'm not really const. I'm definitely volatile depending on the budget. It's definitely a side gig.


I need a side gig, for shits and giggles. I miss uni a lot, for the community of it. Would you recommend it?


I love teach students what I know. I would love it to be a full time job. But then I realized I got it due to my work experience so...


I can confirm, this is exactly what happened to me.


In my school, we had two days to understand the basics of text editors, git (add, commit, rebase, reset, push) and basic bash functions (ls, cd, cp, mv, diff and patch, find, grep...) + pipes, then a day to understand how while, if/else and function calls work, then a day to understand how pointer work, then a day to understand how malloc(), free() and string works (we had to remake strlen, strcpy, and protect them). Two days, over the weekend, to do a small project to validate this.

Then on the monday, it was makefiles if i remember correctly, then open(), read(), close() and write(). Then linking (and new libc functions, like strcat) . A day to consolidate everything, including bash and git (a new small project every hour for 24 hours, you could of course wait until the end of the day to execute each of them). And then some recursivity and the 8 queen problem. Then a small weekend project, a sudoku solver (the hard part was to work with people you never met before tbh).

The 3rd week was more of the same: basic struct/enums exercises, then linked list the next day, maybe static and other keyword in-between. I used the Btree day to understand how linked list worked (and understand how did pointer incrementation and casting really work), and i don't remember the last day (i was probably still on linked lists). Then a big, 5-day project, and either you're in, or you're out.

I assure you, strings were not the hardest part. Not having any leaks was.


This heavily filters for people who have had experience with programming in high-school or even before that, there's no way for a programming novice to pass that grueling routine.

And then people rhetorically ask themselves why students coming from economically disadvantaged households are under-represented in this industry (one of the best paying industries in this time and age). Stuff like that has got to change.


> one of the best paying industries in this time and age

Medicine is still better paid and better paid universally. Silicon valley is really the outlier here, most of Europe and the world programmers don't get paid that much in comparison.


Software development is usually better paid in Poland than medicine. Medicine starts to pay good way later in life, and only in certain specialisations.


And that's ultimately why we had the most surplus deaths in 2020 in EU.


Medicine also requires, after college, medical school and a residency - typically 6 to 9 years work. Programming requires none of this.


In the US. In many other places, medicine is an undergraduate field of study, you're in a hospital from your first year, and by year 3 you're being paid.


> Programming requires none of this.

But it requires you to refresh your knowledge constantly so from this point of view it's similar


If you're arguing that medicine does not, then I hope you're doing software engineering.


> so from this point of view it's similar


In The Netherlands this seems to be true. However, as a programmer you can work from home in many cases, especially now. So suppose that a junior psychiatrist makes 5000 EUR gross in NL [1] and a junior developer 2600 EUR gross [2].

A few things though:

1. A psychiatrist has to commute 1 to 2 hours per day. So that salary is not for 8 hours per day, but 9 hours at minimum. Adjusting their salary to an 8 hour basis, it needs to be multiplied by 8/9 or higher like 8/10.

2. The psychiatrist has to be on location. The cost associated with that is hard to quantify, but it is there. For example, I always sleep during the afternoon for 20 minutes, a psychiatrist can't do that. Also, I can take a break whenever I want, a psychiatrist can be on call for 24 hours straight in severe cases. Let's suppose this gives a cost of 1/16 as a multiplier (half an hour of extra work per day).

So the minimum overhead a psychiatrist has is 16/19, their salary is then 4200 EUR. This can be amazing or not so much, considering your own personal preference. My personal multiplier is 0.8 on top of all of this, so for me a 5000 EUR salary is worth 3360 EUR if it's working as a psychiatrist.

As a developer I experience something different, which is:

1. I do not have to commute, I can if I want to, but don't have to.

2. I do not have to be on location, nor do I have a strict schedule for going client after client. I can take random breaks during the day if it helps me be more productive.

So a developer's salary for 2600 EUR is much more like an actual 2600 EUR in that sense. Moreover, my personal multiplier for being a developer is a 1. There are some things I dislike and some things I absolutely love about being a dev (e.g. being a true netizen in the sense that you can randomly act with APIs if you want to).

To conclude: the absolute values are far apart, but the relative values might not. It differs on a person by person basis, and I haven't discussed the whole picture of course (e.g. needing to stay sharp as a dev, I don't know how that works for psychiatrists).

[1] https://www.monsterboard.nl/vacatures/zoeken/?q=Psychiater&w...

[2] https://www.glassdoor.nl/Salarissen/junior-web-developer-sal...


At what age are you a junior developer and at what age are you a junior psychiatrist in NL? A bachelor's developer could be as young as 21 I guess, but at least for most jobs in medicine you can't work independently until much later. Maybe it's different for psychiatry?


It's not different for psychiatry. You need to have done a bachelor + master in medicine and on top of that a specialization. I don't know how long that takes though, but I wouldn't be surprised if it's 8+ years.


Junior developers aren't really a thing in many places. They're just developers.


In NL they are.


Medicine has been more poorly paid than FAANG software engineering in the last two places I've lived (South Africa, Australia)


FAANG engineers are also massively over paid, compared to your average software engineer in any random country.

When people on HN discuses salaries, or I see a job posting from a Silicon Valley company I can't help think that we don't even pay our CTO that much. Frequently you could get two developers for the same price here in Denmark.


Those companies have one heck of a combined market position and it is all built with software. I would say their software engineers are paid more but I doubt that "overpaid" applies.

Think about what your CTO could do in that setting and realize that he's probably worth more to FAANG shareholders than to you hence the salary differential.

For the record, I do not work at a FAANG.


> Medicine has been more poorly paid than FAANG software engineering in the last two places I've lived (South Africa, Australia)

Interesting. I'm in South Africa, right now. The largest offer for a senior C# dev *right now* on www.pnet.co.za is R960k/a.

Twelve years ago, the GP that I was dating, who worked in a *state hospital* (i.e. not making as much as she could have in private practice) was making more than that.

I don't believe that doctors' salaries over the last 12 years have effectively been lowered. OTOH, if you know of places where they are offering more than R1.8m/a for senior developers, then by all means give me their contact details.


Happy to refer you to AWS! I was making over 1m rand (TC) as a mid level non dev position. I had friends in development making above the number you're talking.


Medicine has other filtering systems.


Having gone through the same experience, I can tell you that it isn't necessarily the case. More often than not, those who had some programming experience in some high-level language would often get discouraged with the difficulty and drop out.

In the end, it was mostly those that didn't get discouraged and socialized with the other students that would remain in the end.

I myself did not have any programming experience before going through that ordeal.


My experience with C courses with this structure of automatically validated home works not only filter "the weak" but also people with previous (especially C on Unix) experience, because nobody with any kind of practical Unix experience will write code that will pass these kinds of rigorous C-standard conformance and memory leaks checks, because for practical applications doing all that is actually not only unnecessary bud also detrimental to runtime efficiency.


I think a passing test suite, no diff after clang-format, clean valgrind and clang-analyze checks are not too much to ask for. As long as the requirements are documented and the system is transparent and allows resubmission.

But I agree there is a risk of academic instructors going way overboard in practice, e.g. by flagging actually useful minor standard conformance violations (like zero length arrays or properly #ifdef'd code assuming bitfield order).


My aversion to such systems is primarily motivated by the fact that every one of such system somehow penalized resubmissions. I probably don't have anything against "you have to write program that is compiled by this gcc/llvm commandline without producing any diagnostics and then passes this intentionally partially documented test suite". But in most cases the first part ends up meaning something like "cc -Werror -std=c89 -ansi -strict" where the real definition of what that really means depends on what exactly the "cc" is and the teachers usually don't document that and don't even see why that is required (ie. you can probably produce some set of edge-case inputs to gcc to prove that gcc is or isn't valid implementation of some definition of C, but this conjecture does not work the other way around).


In most of my courses that did something like this there was no resubmission.* The professor supplied a driver program, sample input the driver would use to test your program, expected output, and a Makefile template that gave you the 3 compilers + their flags that your program was expected to compile against and execute without issue. His server would do the compile-and-run for all 3 against the sample input and against hidden input revealed with the grade. He used the same compiler versions as were on the school lab computers.

* As a potentially amusing aside, a different course in a different degree program had a professor rage-quit after his first semester because he didn't want to deal with children -- he had a policy of giving 0s on papers with no name or class info on them, and enough students ("children") failed to do that correctly but complained hard enough to overturn the policy and get a resubmit.


You shouldn't underestimate the novice. The professors who do such weeder classes will have the data though, so you don't have to believe anyone's experiences if you can instead ask a professor... For what it's worth though I'll add to the sibling comments and state in my experience too prior programming experience is less correlative than you seem to think. (I had it, though I quickly found out after my first week "I thought I knew C, I do not know C.")

Those who have been subjected to such programs can also probably agree that the filtering of the first semester (and there is a filter, but again we think it's a fair one not dependent on prior programming experience or other such privilege) ends up normalizing everyone, for the benefit of everyone. For the people who started at 0, they're now Somewhere nearby everyone else, ready for the next (harder) material, and for people who started with some "advantages" they've discovered they... are also now Somewhere, not Somewhere Else ahead of everyone like they might have been at the very start. In these sorts of programs, people with prior experience find that they couldn't sleep through their classes and get A's like they might have pulled off in high school, their advantages were not actually that significant after all, and indeed some from-nothings can and do perform better than they.

For anyone who just wants access to the software industry's wealth, I'd encourage them to ignore college entirely. There may be a case-by-case basis to consider college, especially if you need economic relief now in the form of scholarships/grants/loans only accessible through the traditional college protocol, but in general, avoid.

(If you want something besides just access to the wealth, you have more considerations to make.)


I went through a very similar gauntlet in my first undergrad computers class. I didn’t know anything about programming or linux, but it was fine.

I think the filter is more effective for finding those who can quickly adapt, learn, and grok a methodical mindset. Not necessary characteristics to be a programmer, but necessary characteristics to excel at programming.


Does it though, or is it more survivorship bias and maybe lucking into finding someone who will spend hours mentoring you?

I've mentored quite a few first semester students (in my spare time, to help. Not as a job) and there is no way some of them would've passed without serious help.

At some point I used to think privately that CS should have a programming test as an admission exam, because these students did drag everyone down. If medicine and law have admission restrictions, why not CS too?

But I have changed my opinion because I think everyone deserves a real opportunity, and our school system does not provide a level playing field sadly. (Also the medicine & law admission criteria are GPA based and that is the last thing I'd want for CS.)

Anyway the real filter was always maths.


I don't understand your concern. How is teaching programming in school discriminatory? Would it be better to not teach programming at all?


Depends how you teach it. Imagine teachers at primary schools started teaching English by analyzing Shakespeare.

Or imagine math was taught by giving kids all the axioms and requiring them to derive the other rules needed to solve tasks as needed :)

Kids from well off families would be ok - it would just be considered another random thing you have to teach your kids to help them make it.

But other kids would suffer and think "English and math is not for me".


I understand what you mean. This has nothing to do with programming, it's a general (and difficult!) concern regarding everything that is taught at schools.

By that same argument, schools should not teach anything that is not widely known by 100% of the parents of each kid. Otherwise, it would be discrimination to those kids whose parents cannot help. I disagree very strongly with this principle.

I have two kids, and the best things that they learn in school are precisely those that I'm unable to teach them. For a start: mastery of the language, since I'm not a native speaker of the place where we live. I would be frankly enraged if the school lowered the level of language exigence to accommodate for the needs of my kids who do not speak it at home!


> By that same argument, schools should not teach anything that is not widely known by 100% of the parents of each kid. Otherwise, it would be discrimination to those kids whose parents cannot help. I disagree very strongly with this principle.

Not at all what I mean. I mean schools (at least primary schools) should be designed for top 80% or 90% not for top 10% or 20%. You can never get to 100% but resigning from the start and going for 20% makes no sense.

You should expect people taking math at university to be able to solve linear equations and explaining it is a waste of time but you shouldn't expect kids in primary school to be able to do the same and it is your responsibility to prepare them in case they want to pursue academic career.

If public schools teach linear equations it's ok to assume that knowledge at university.

If they don't - it's not.

It should be the same with teaching programming and anything else is just funding rich people kids education by everybody's taxes.

The whole point of common public low-level education is to maximize the number of people participating in the economy. It's much better if everybody can read and write. Whole industries are impossible without this. And so is democracy.

It's the same with basic programming and math literacy. It benefits the whole society if vast majority of people have it.

If you "weed out" 60% or 80% of population just because they happen to be born in the wrong environment or went to the wrong school - you lose massive amounts of money and economic/scientific potential. Then you have to import these people from countries which don't fuck their own citizens in such a way.


I agree that public school should not leave any kids behind. I also want my taxes to be raised to fund a higher-level education for kids who may find it useful, even if it's only a small percentage of kids.


Sure but that's only fair if the assumed skills at higher levels are attainable for an average person that went to a public school.

BTW "no child left behind" isn't practical, there are people who can't learn basic stuff no matter how hard you try. But "less than X% kids left behind" is for some low value of X.


paganel doesn't think it's discriminatory to teach programming. Rather, he thinks orwin describes a class that's too fast paced - a class that wouldn't teach much, and would mostly weed out kids who hadn't self-taught themselves before they reached college.

He fears while a professor might imagine they're weeding out people who lack 'dedication' or 'aptitude' they're actually weeding out people who didn't grow up with a PC at home.


My professor (head of CS dept) referred to these as 'weed out classes'.

If that sounds evil, imagine the grief, wasted money, time, frustration, and stress of letting people get 3-4 years into computer science and then dropping out because it's fucking hard.

So my second hardest classes were freshman year. 3rd year (micro-architecture and assembler)finally bested them.


I don't really get the correlation between household income and programming experience in high school.

Their parents can't afford a laptop? They can't afford an Internet connection? The kids don't have a good place to learn in their house? They don't have time?

Is programming affected more than other subjects like math, English/grammar, science, etc?


> Their parents can't afford a laptop?

Yes! There are millions of kids in the US whose parents can't afford a cheap $300 laptop. The federal government pays for school lunches because there are so many kids who otherwise wouldn't even be getting decent food otherwise.

> They can't afford an Internet connection?

See above. Also, there are many places in the US where getting broadband service is very difficult. Including places just an hour outside of Washington, DC. My parents were only able to get conventional broadband service a few years ago. Prior to that they paid exorbitant fees for satellite internet service with a 500mb per month cap.

> The kids don't have a good place to learn in their house?

Imagine being a kid with 3 siblings and your parent(s) living in a studio apartment. Or a kid that doesn't have a stable "home" at all.

> They don't have time?

That can be an issue too, depending on age. A teenager may be working outside of school hours to help take care of the family's financial needs.


> Their parents can't afford a laptop? They can't afford an Internet connection? The kids don't have a good place to learn in their house? They don't have time?

All of the above, and it's surprising this isn't obvious. It may be hard to notice or internalize if you've never seen it and only know privilege, but possession of all or even some of those things is not a guarantee for everyone. Believe it or not, there are some who don't come home to a computer, caring (or even existent!) parents, stable meals, or free time.


It isn't obvious to me because I've never lived in the US. I was genuinely asking, not trying to shame people who can't afford a computer.

Maybe I didn't use the right words to formulate my question.


Thanks for clarifying the context of your question. Very helpful, and changes the tone completely. 1/2 of the people in the U.S. "don't pay taxes", that is, don't make enough money to owe taxes. So that's one issue. I mentor a hispanic kid who's mother's English was so weak and her knowledge of 'the system' so weak, that she couldn't take advantage of programs to provide used computers to her kids, or low-cost Internet access to her household. And the $10/month for low-cost Internet access WAS out of reach. 10 people in a two bedroom apartment was also their norm.


> I don't really get the correlation between household income and programming experience in high school.

> Their parents can't afford a laptop?

Holy crap, the amount of privilege shown off in just two sentences is absolutely astounding.

This may come as a shock to you, but a very significant number of people don't have a couple hundred dollars to buy a low-end used laptop. 40% of Americans would struggle to come up with $400 for an emergency expense [0], let alone save $400 for a laptop.

[0] https://www.cnbc.com/2019/07/20/heres-why-so-many-americans-...


> 40% of Americans would struggle to come up with $400 for an emergency expense [0], let alone save $400 for a laptop.

It actually doesn't say that, it says they don't have $400 in cash equivalents but may be able to produce it by selling "assets". So a person who keeps all their savings in CDs or investments also counts, although only for expenses you can't put on credit cards.


The working poor simply don't have the safety net of good credit or wealthy families that the vast majority of HN commenters do. If you're living paycheck to paycheck and get a $400 surprise expense, you don't have $400 just sitting in some money market account, or two shares of SPY they can just sell, because the poverty wages being paid to millions of working people leave zero margin to build any sort of financial security, leading to a kind of precariousness that is unimaginable to the educated professional with a comfortable upbringing. "Assets" means things like a wedding ring, some power tools, a computer, or maybe even a 1995 Dodge Neon; if you go to a pawn shop, you can see the sorts of things people pawn (or sell) when they desperately need $400 for an emergency expense.

They often take payday loans, mortgaging their next minimum-wage paycheck; since the next paycheck minus payment no longer covers their regular living expenses, they take another predatory loan or pawn another heirloom. 80% of people who take a payday loan have to renew it because they can't repay it. I have a deep personal dislike for Dave Ramsey, but he does a good job of explaining how even minor emergency expenses can lead to a cycle of debt and further despair. (https://www.daveramsey.com/blog/get-out-payday-loan-trap)

There is so much more instability and precarity in this country than most PMC people can imagine.


Only 19% of the people who could not be able to pay cash or its equivalent said they would be able to sell something. 29% said they would be unable to pay.

Figure 12 on page 21 of the underlying report - https://www.federalreserve.gov/publications/files/2017-repor... .


Do you think people living paycheck-to-paycheck, constantly having to decide which bill to pay and which to allow to collect a late fee, have investments or CDs?


74% of American families own a computer. The other 26% do not. Incomes are also highly clustered- chances are any given cohort of high school students either has 90% or more or 10% or less kids with computers, without a lot in between. Relatively poor districts may have a very general computer literacy class (typing, word processing, spreadsheets) but won't have a programming class, because the logistics of getting them adequate time at a computer to complete the work is impossible.

https://www.statista.com/statistics/756054/united-states-adu...


This may be less so now than it was 10 years ago, but I absolutely promise you that having a decent computer (not great, not a gaming pc, but just something that a kid can feel comfortable experimenting with) that is readily available (and not being shared with siblings) is absolutely a luxury.


In the UK, every student does Math, English and Science to a basic level. Maths and English in particular were held up as non-negotiable if you ever wanted a job, I suspect so you had a reasonable level of literacy and numeracy to be able to count money, read letters, etc.

Conversely, programming was not available in my fairly middle-class school. In terms of money, we only have to look to the laptops schools are providing to students (or not depending on government funding) to see how many children don't have access to a laptop. A good place to learn can also be hard to find for large families in small houses which is sadly all too common for low income households.


The cost of computers has come down a ton, but it was a much bigger deal in the 90's and earlier. A lot of people didn't have computers at home. A decent x86 system (like a 486 with VGA etc) was at least $2500 or so. That's without any programming tools... compilers weren't free. When I meet fellow developers who didn't have computers growing up, I realize how privileged and lucky I was.


"Is programming affected more than other subjects like math, English/grammar, science, etc?"

Probably a bit more, as it is common to learn other subjects by a book, but learning programming without a computer ... sounds hard.


Dijkstra probably would disagree, but his isn't a common opinion.


Ooh, the Epitech cursus. Nice.

Also, I'd say "not having segfaults" is the hardest thing to get right when you're going through that.


Eh, not really possible in my experience... more like ‘incidentally becoming a gdb wizard in order to be productive with C’!


Seeing valgrind come up with 0 leaks after like 10 hours straight on a lab was such a good feeling


F. This sounds too hard. I mean, I know how to turn code into money but I'd fail this.


I don't this looks like a beginners course though. My students have zero experience.


Let's face it, the Moulinette was the hardest part.


No, the hardest part is not punching the asteks in the face when you ask them for help on a problem that has stumped you for two hours, they take one look at your code and they go "C'est pas à la norme!" AND THEY WON'T EVEN TELL YOU WHERE.


Piscine ?


Most of the C I wrote was while in college. I think understanding the question, "why are strings in C hard?" is a good gateway to understanding how programming languages and memory work generally. I agree with you though that teaching C as introductory is probably not the best — our "Programming in C" course was taken in sophomore year.

I wouldn't want to use it my day job, but I'm glad that it was taught in university just to give the impression that string manipulation is not quite as straightforward as it's made to appear in other languages.

The early days of Swift also reminded me of this problem – strings get even more challenging when you begin to deal with unicode characters, etc.


It's also because other languages have better designed strings. D, go, rust, etc. have pointers too but their string handling is based on slices and arrays, which are approximately 10,000 times less footgunny.


I am seeing Python becoming the go-to language for many academics because it's easy to hack something together that somehow works.

Unfortunately most of those developers don't care much about efficiency and Python is out of the box inefficient compared to other high-level languages like Java [1] or C#. OO Java courses circulating in academia lack modern functional, and to be frank educational, concepts and must to be refreshed first.

I personally would recommend to start with Java and Maven because it's still faster than C# [2], open source, and has a proven track record in regards of stability and backwards compatibility. Plus quickly introduce Spring Framework and Lombok to reduce boiler plate code.

For advanced systems programming I suggest looking into Rust instead of C/C++.

And last but not least the use of IDE's should be encouraged and properly introduced, so aspiring developers are not overwhelmed by them and learn how to use them properly (e.g. refactoring, linting, dead code detection, ...). I recommend Eclipse with Darkest Theme DevStyle Plugin [3] for a modern look.

[1] https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

[2] https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

[3] https://marketplace.eclipse.org/content/darkest-dark-theme-d...


I generally agree with you (especially on the update java guides), though I think it is important to teach C/C++ after some experience with a higher level language, if for nothing else, the large amount of already existing code bases.

I also like the newfound interest in some FP languages, I for example had a mandatory Haskell course in first year — we did not take Monads in this course yet, but I think it is a great introduction for students for a different take from the more imperative world.


Python and Java fill very different niches in the ecosystem. You're not going to cobble something together quickly in Java, the language just isn't designed to do that. Python is the software equivalent of cardboard-and-hot-glue prototyping, which is fairly common in academia.


I've studied IT in early 00s in Poland, the course was a little outdated, but it had some advantages.

We've started from 0 with no assumption of any computer knowledge and first 2 years most courses were using Delphi (console only, no GUI stuff, basically it could just as well have been Turbo Pascal, some Linux enthusiasts used FPC instead of Delphi and it worked).

We all complained that we want C++ then, but I've learnt to appreciate Pascal later. After first few months we've known every nook and cranny and there was very little corner cases and gotchas. So basically we focused on algorithms and not on avoiding the traps language set for us.

Most people had no programming experience and after a few weeks they wrote correct programs no problem.

I doubt this would happen if we started with C++ as most people wanted, and I think it's better than Python as a starting language because it teaches about static typing and difference between compile- and run-time.

Sadly it's a dead language now.


I've actually had more success teaching assembly as a first language than C. There's less magic, and you borderline have to start with the indirection of pointers in a way that people seem to grok a lot easier than the last month of the semester of learning C.


+1, my university's program seemed to work well with "program anything" (Python), "program with objects" (Java), "program some cool lower-level stuff" (C)


If I had to choose a language to teach programmers to absolute beginners, I think I'd actually go with Go.

I understand the predilection for Python but there are some parts of Python that are just... odd.


Python is great fun, and you can be really productive with it, but for people first coming into programming, a language with an explicit and strict type system is invaluable.

I used to think that everyone should be taught python first, because it lets you focus on the meat of computer science - algorithms, data manipulation, actually _doing_ something - but after helping my girlfriend out with some comp sci 101-104 projects, I really think Go, Java, or Rust should be everyone's first language. It's hard for someone new to the field to understand the nuances that come with python's corner cutting. You can work yourself into some weird corners because of how permissive the language is, where in a (strongly) typed language, the complier just says no.


I use Python a lot professionally these days, after having been a C#/Java developer for a while (and some experience with C and Free Pascal). I absolutely love the language.

I always feel a little iffy when people talk about Python like it's a language ideally suited to beginners.

Dynamic typing puts so much power in your hands to create expressive structures. But it requires discipline to use properly. It's a great trade off for me but I don't think it would be for beginners.


My reasons for suggesting Python compared to Java for example is due to the fact that I teach to Electrotechnical engineers and there are plenty of libraries to experiment with Raspberry and stuff and it's a little bit higher level than C. Every language has its own difficulty to teach but the fact the companies have a banned.h it's basically saying "well C gives you functions for C, but don't use them". It makes it unecessary harder to explain something to people with no experience.


> You can work yourself into some weird corners because of how permissive the language is, where in a (strongly) typed language, the complier just says no.

Could you share an example?


Here's one my (intro programming, non-major) students have just been tripping over this week:

  if word == "this" or "that":
      ...
Not an error, always runs. Very mysterious to a beginner. (Shared with C/C++) Another one:

  counter = "0"
  for thing in things:
      if matches(thing):
          counter += 1
The error is in the init, by someone who is overzealous with their quoting, but the error is reported, as a runtime error, on the attempted increment, which throws a TypeError and helpfully tells them "must be str, not int", and of course I know exactly why it's reporting the problem there and why it's giving that error, but it's a bit confusing to the newbie programmer and it doesn't even turn up until they actually test their code effectively, which they are also still just learning how to do.


There are parts of Go that are similarly odd. Arrays and slices, and the hoops you have to jump through to do something as simple as adding a new item to a list, are very unlike anything else, for example.

In Python, the weird stuff is generally easy to avoid/ignore until it's actually needed.


Very true, but if you're teaching CS then this also exposes students to the topic of ownership-versus-reference in a much gentler way than C.


If I remember correctly, slices in Go also have ownership semantics, in a sense that so long as any slice exists, the array is kept alive by the garbage collector.

Or do you mean value vs reference semantics? In that case, I think C pointers are simpler as a fundamental concept, and slices are best defined in terms of pointer + size.


In Java you do the exact same reallocation dance as append does behind the scenes when using arrays.


Java standard library does that, not you directly. The issue here is not performance, but rather ergonomics. For example, in Go, you can forget to assign the result of append to the variable, and it'll even work most of the time (because there was still some unused capacity in the array, so there was no need for reallocation).


Which method? Note im talking about the language level array, not ArrayList. Also this was years ago.


Java arrays don't have add() at all - they're fixed-size once allocated.

ArrayList etc do have add(), and they implement it by re-allocating the backing array once capacity is exceeded.

In practice, you'd use the ArrayList anyway. I don't think it's worthwhile comparing Go and Java "language-only", because the standard library is as much a part of the language definition as the fundamental syntax; indeed, what goes where is largely arbitrary. E.g. maps in Go are fundamental, but the primary reason is that they couldn't be implemented as a library in Go with proper type safety, due to the lack of generics.


Sorry to bug you since this is unrelated. I'm a huge fan of teaching others and I was wondering how you got to be an external lecturer at a college? I'd love to teach classes related to software engineering and data structures. Would you mind emailing me (in my profile) about this?


So this is what happened for me: I went for a walk in the forest with my wife and some of her friends. There was one friend that had an husband working at university.

We started talking and basically we discovered what he was teaching really related for what I do for work so he asked me to become a "mentor" meaning a professional that helps students with their thesys.

In the meantime I went to talk during his class about product management as an engineer where basically I said "I'm an engineer like you, go and talk to customers, it's part of the job", plus extreme programming stuff etc...

After that there was a position open and this professor recommended me because I told him it was one of my goals to be a teacher as well.

And then from there I met the head of dept. He was happy with me being versatile, I usually handle C, database design or java.

But the usual stuff is go to the university you like an look for open positions.

I need to get confirmed every semester and apply again. Usually this job is done by people with a main job and sometimes it happens you don't have time in a semester.


I’d imagine you look for job openings for “adjunct professor” or “instructor” at universities. You can look forward to part-time employment with no benefits and no chance for tenure (this isn’t a dig at adjunct faculty, it’s unfair how it works in the US). Depending on the field of study and school, you need anywhere from bachelors to a PhD to qualify.


You are a good teacher.

20 years ago I was in the exact situation of one of your students, i.e. I was put in front with the C language in the first semester of the first year. I barely, barely passed, failed with glory a similar course in the second semester which I only passed (with an A, to put it in US university terms) a couple of years later after I had managed to learn Python by myself in the meantime.


Thanks! I just think that you can get out of practical programming concepts (like loops, conditions) without the need to understand that a string is an array of chars and that chars are actually integers.

Because if remove the basics of programming with something like Python, you can fully concentrate on the second course on low hardware stuff, how to use memory etc..which is really important for my students, them being Electrotechnical engineers.


Yep, agree. I used a lot of assembler on C64 and Amiga until I touched so called high level programming languages for the first time. For me thinking in strings was really a weird concept.

Nowadays I find it extremely strange to think of bits and bytes when being confronted with strings.


Question: how do you teach for-loops?

That is something I have a hard time convey as a teacher. My problem is that I have done this so long that I have no idea what there is not to understand about loops ... it's such a simple thing. But my (undergrad biology) students regularly have a hard time groking the concept no matter what explanation I use.


Not OP but I'm teaching undergrad C. I'm assuming you have covered while loops before hand, if not, start there and cover the constructs with them that you would normally use a for loop for, example:

  int i = 0; //a
   
  while(i < 10) //b
  {
      printf("%d\n"); //c
      ++i; //d
  }
and introduce for loops as a special case of the while loop:

      //a        //b     //d
  for(int i = 0; i < 10; ++i)
  {
      printf("%d\n",i); //c
  }
Then outline situations when you would use a for loop over a while loop, fixed number of repetitions, use with arrays etc.


Well I took the time and made some simple animation of how that works. For that and pointers I use animation so they can visualize what's happening. (Wasn't my idea, I googled it "how to explain pointers")


thats probably because pointers are considered "hard" , too hard to exist in other languages. It's interesting that in the 80s, the standard library modeled C++ iterators with pointer semantics because they assumed everyone could do pointer arithmetic, but nowadays the concept is not mainstream at all.


But that's point right? I have two hours per week for a semester. So basically I tried to be really fast at the beginning with if else and loops, gave then an exercise that counted towards final score and then it was pointers and pointers related stuff.


I don’t think the reason for hiding pointers is because they are hard — it’s just that arbitrary pointer arithmetics are especially error prone and can be avoided in most codebases.


well hard to get right anyway. in any case newer generations of programmers are less likely to be familiar with the pattern


My partner was on a doctoral training course at Oxford and they had to learn C over a few days; string manipulation is the hardest thing she remembers doing out of any medical science crash course they studied over 2 terms


> Technically C11 has strcpy_s and strcat_s

"Theoretically" is the word you're looking for: they're part of the optional Annex K so technically you can't rely on them being available in a portable program.

And they're basically not implemented by anyone but microsoft (which created them and lobbied for their inclusion).


Microsoft doesn't actually implement Annex K! Annex K is based on MSFT's routines, but they diverged. So Annex K is portable nowhere, in addition to having largely awful APIs.

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1967.htm#:~...

> Microsoft Visual Studio implements an early version of the APIs. However, the implementation is incomplete and conforms neither to C11 nor to the original TR 24731-1. For example, it doesn't provide the set_constraint_handler_s function but instead defines a _invalid_parameter_handler _set_invalid_parameter_handler(_invalid_parameter_handler) function with similar behavior but a slightly different and incompatible signature. It also doesn't define the abort_handler_s and ignore_handler_s functions, the memset_s function (which isn't part of the TR), or the RSIZE_MAX macro.The Microsoft implementation also doesn't treat overlapping source and destination sequences as runtime-constraint violations and instead has undefined behavior in such cases.

> As a result of the numerous deviations from the specification the Microsoft implementation cannot be considered conforming or portable.


I didn’t know that it was Microsoft that lobbied for them; that perplexes me since I thought Microsoft’s version of them were a bit different (for example, I think C11’s explicitly fail on overlapping inputs where Microsoft specifies undefined behavior) and because Microsoft didn’t bother supporting C99 for the longest time. (Probably still don’t, since VLA was not optional in C99, IIRC. I think Microsoft was right to avoid VLA, though.)


VLA syntax can be useful because you can cast other pointers to them - for instance you can cast int* to int[w][h] and then access it as [y][x] instead of [y*w+x].

As a bonus this crashes icc if you do it.


>As a bonus this crashes icc if you do it.

I thought this was a pretty funny thing but unfortunately when I tried this on ICC it seemed to compile just fine.

Though I am amused by one thing: the VLA version generates worse code on all compilers I've tried. Seems to validate the common refrain that VLAs tend to break optimizations. (Surely it's worse when you have an on-stack VLA though.)


This does it: https://gcc.godbolt.org/z/5fz8sM

I'm not sure if they take bug reports if you're not a customer, but this one goes back at least 8 years.


Microsoft's implementations are distinct and incompatible (and they haven't changed to be compatible with the standard versions because of backwards compatibility).


Important to note that strcpy_s doesn't truncate, it aborts your app if it fails:

> "if the destination string size dest_size is too small, the invalid parameter handler is invoked"

> "The invalid parameter handler dispatch function calls the currently assigned invalid parameter handler. By default, the invalid parameter calls _invoke_watson, which causes the application to close and generate a mini-dump."

https://docs.microsoft.com/en-us/cpp/c-runtime-library/refer...


They are also implemented by embedded compilers such as IAR and Keil.


The issue is pretending that C even has strings as a semantic concept. It just doesn't. C has sugar to obtain a contiguous block of memory storing a set number of bytes and to initialize them with values you can understand as the string you want. Then you are passing a memory address around and hoping the magic value byte is where it should be.

C is semantically so poor, I find it hard to understand why people use it for new projects today. C++ is over complicated but at least you can find a good subset of it.


C is a good language for solo projects because of it’s simplicity. By simplicity, I mean understanding what it’s doing under the hood. ‘Portable assembly’ is not an unfitting title.

Big C projects work well when they are carefully maintained (like Git).


> but at least you can find a good subset of it.

It's a constantly shifting subset, though. Moving slowly is a feature of C for some.


You can pick a subset of C++ now and it will not be worse in the future. There may be better ways to do things added but that seems a weird thing to complain about and you can just ignore it if you want.


> strings in C are actually hard,

Strings in C are more like a lie. You get a pointer to a character and the hope there is a null somewhere before you hit a memory protection wall. Or a buffer for something completely unrelated to your string.

And that's with ASCII, where a character fits inside a byte. Don't even think about UTF-8 or any other variable-length character representation.

In fairness, the moment you realize ASCII strings are a tiny subset of what a string can be, you also understand why strings are actually very complicated.


> In fairness, the moment you realize ASCII strings are a tiny subset of what a string can be, you also understand why strings are actually very complicated.

Oh absolutely, but it's a pretty reasonable expectation that any contemporary language should handle that complexity for you. The entire job of a language is to make the fundamental concepts easier to work with.


C very much does make the fundamental concepts easier to work with, it merely disagrees with you about exactly which concepts are fundamental :).


Sadly, strings are, at the same time, complicated enough to be left outside the fundamental concepts of a language, but far too useful to be left outside the fundamental concepts of a realistically viable language.


What I don’t understand is why C programmers use the built in strings. It’s like rolling your own sorting algorithm every time you need it. Surely someone could write a better string library in C that hides the complexity. The real problem is that C programmers are apparently allergic to using other people’s code.


Because most projects involve interfacing with other third-party libraries that will undoubtedly not know about this other third-party library that implements a nice string.


It can contain a to_cstring() function and problem solved.

If it uses a struct with length of string and pointer to a c-style string, even the conversion can be elided (at the price of some inflexibility/unnecessary copying while in use)


That would mean you lose the ability to do all sorts of optimizations and memory sharing. Or at least, you can do them, but then the c_string() function requires copying the data. And that also means that it's a one-way thing: you can't use the copy on something that wants to modify the string, and expect your FancyString instance to reflect the modification.


I think nearly every C programmer has gone through the phase of Oooh I'll write my own string library! Sure, that works. Except you have to call system libraries and all kinds of other external functions all of which naturally assume the conventional char arrays. So you spend a bunch of time converting back and forth until eventually realizing it's silly, just learn the convention and go with it.


There are a large number of those libraries. Every large C project eventually seems to grow its own string class.


... except for libc, which apparently which is hardly ever questioned.


>>> Surely someone could write a better string library in C that hides the complexity.

In short, it's not possible to write a nice string library in C because C simply doesn't support objects, and by extension doesn't support libraries.

Strings are a perfect example of an "object" in what is later known as object oriented programming. C doesn't have objects, it's the last mainstream language that's simply not object oriented at all, and that prevents from making things like a nice string library.

If you're curious, the closest thing you will see in the C world is something like GTK, a massive C library including string and much more (it's more known as a GUI toolkit but there are many lower level building blocks). It's an absolute nightmare to use because everything is some abuse of void pointers and structs.


What rubbish! You do not need objects to make a library. Structs, typedefs, and functions do just fine. There are even techniques in C to define abstract data types if you want!

Take another look at https://developer.gnome.org/glib/stable/glib-Strings.html#g-... . That’s all C, baby, and could be replicated in a completely independent strings-only library built on the standard library if you wished. The reasons no such library exists are ecological, not technical.


I think you mean GLib as seen here https://developer.gnome.org/glib/stable/glib-Strings.html

GLib and GTK are closely aligned parts of GNOME so they are easy to get mixed up.


Right. The string library is in glib.

There were a few big libraries in the ecosystem if I remember well, GTK, glib and another two. They're from the same origin and often mixed together.

It's been almost a decade since I dabbled into this stuff day-to-day. I think being forced to use glib is the turning point in a developer's life where you realize you simply have to move on to a more usable language.


So true


Strings have nothing to do with objects. You can write a string library, eg. [sds](https://github.com/antirez/sds). It's just not standard.


The challenge is not to write a string library, but to write a "nice" string library.

Let's say, something that's easier to use and doesn't have all the footguns of the char arrays.

The library you link doesn't come anywhere close to that. It's 99% like the standard library and it has the exact same issues.


I would love to see what you mean by "exact same issues".

sds strings contain their lengths, so operating on them you don't have to rely on null termination, which (to my knowledge as a lower-midlevel C programmer) is the most prevalent reason why people take issue with C strings.

If you mean that they're not really "strings" but byte arrays I would say that I agree, but to all intents and purposes that's what the C ecosystem considers as strings.

Keeping an API which is very similar to the standard library is also a plus, as it doesn't force developers to change the way they reason about the code.


> sds strings contain their lengths, so operating on them you don't have to rely on null termination, which (to my knowledge as a lower-midlevel C programmer) is the most prevalent reason why people take issue with C strings.

Wait, haven't I seen that idea somewhere else...?

> If you mean that they're not really "strings" but byte arrays I would say that I agree, but to all intents and purposes that's what the C ecosystem considers as strings.

Aha, strings as byte arrays but with a built-in length marker.

But yeah, Pascal is sooo outmoded and inferior to C...

Sigh.


You’re moving goalposts now. Just earlier you wrote that you couldn’t write a library in C because C does not support objects, not that you couldn’t write a nice library (for whatever definition of “nice” you want to use, which will be different from someone else’s).

In fact there are several libraries for string-like objects; the main barrier to use them is that none of them is standard. You can at least acknowledge that before talking about nice-ness, which is a whole other point.


I'm partial to https://github.com/antirez/sds these days


The only problem I have with antirez's lib is that he didn't make it into a single header library.


Is it so hard to add a single source file to your build system?

If yes, then you can do #include “sds.c“ in some random source file. In fact, that's what so-called header-only libraries in C implicitly do. shudder


A C file implies a compilation unit. For the projects I write I like to have a single compilation unit per binary (what's called a unity build). In the case of C, this doesn't bring much speed to the table, but it allows for a simpler build toolchain none the less.


strcpy is a coding challenge where I work for interviews. I typically ask them to write it as the standard version and ask them why they might not want to use it to see if they are aware of the risks. After that I ask them to modify the code to be buffer safe. And for those claiming C++ knowledge ask them to make it work for wchar_t as well to see if they can write a template. Some people really struggle with this.


If only C had followed the Pascal way to have the size with a string - so much human suffering could have been avoided!


It was considered:

> C designer Dennis Ritchie chose to follow the convention of null-termination, already established in BCPL, to avoid the limitation on the length of a string and because maintaining the count seemed, in his experience, less convenient than using a terminator.[1][2]

* https://en.wikipedia.org/wiki/Null-terminated_string#History

Richie et al had experience with the B language:

> In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled *e. This change was made partially to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator.

* https://www.bell-labs.com/usr/dmr/www/chist.html


> ~~Technically~~ Optionally, C11 has strcpy_s and strcat_s which fail explicitly on truncation. So if C11 is acceptable for you, that might be the a reasonable option, provided you always handle the failure case.

One of the big problems with C programmers is they often neglect to check for and handle those failure cases. Did you know that printf() can fail, and has a return value that you can check for error? (Not you, personally, but the "HN reader" you) Do you check for this error in your code? Many of the string functions will return special values on error, but I frequently see code that never checks. Unfortunately, there isn't a great way to audit your code for ignored return values with the compiler, as far as I know. GCC has -Wunused-result, but it only outputs a warning if the offending function is attributed with "warn_unused_result".

I'm not a huge fan of using return values for error checking, but we have the C library that we have.


Truncation, even if it is wrong in an application logic sense, is strictly superior to UB (and in practice, buffer overruns, which can be exploitable). That's the main benefit of strlcpy/strlcat. It is certainly possible to construct a security bug due through truncation! But it is much more common to have security bugs from uncontrolled buffer overruns.


Yeah. I just avoid str manipulations in general in C and when I have to, fuzz it ... (but still, the perf cliff is definitely new to learn in the past few days).


The decision to make C strings null terminated with implied length instead of length + blob continues to trip us up, 30+ years later. There's a good reason the "safe" versions of those functions all take length parameters. But way back when this approach was chosen, I don't think the state of the art could fully predict this outcome.

But also, "strings" and "time" are actually very complex concepts, and these functions operate on often outdated assumptions about those underlying abstractions.


I would argue that C's fundamental mistake (well, more like limitation due to hardware of the time) was allowing arrays to decay to pointers; arrays hold valuable type information (the length!) that is lost once converted to a pointer.

C99 came so very very close with VLAs. You can declare a function like:

  int main(int argc, char *argv[argc]) { ... }
But C99 requires the compiler to discard the type annotations and treat the declaration as equivalent to:

  int main(int argc, char **argv) { ... }
Imagine a world where the C string functions were declared as:

  char *strndup(s, n)
    const char *s[n];
    size_t n;
  {
    /* now we can do sizeof(s) and bounds checking! */
  }
(You'd have to use K&R style declarations to get around the fact that the pointer argument comes before the length argument, alas.)

Edit: and then C11 made VLA support optional, since the feature didn't get used much, because the feature was only half-baked to begin with... sigh.


It wasn't a limitation due to hardware of the time. It was a deliberate choice due to C's ancestry as a derivative of B.

In B, thee was only one data type: machine word. The actual meaning was determined by the operators used on it. Thus, given x, (x + 1) would be integer addition, but *x would dereference it as a pointer (to another word). There was no need to distinguish between integer and pointer arithmetic, because their semantics was the same - pointers were not memory addresses of bytes, but of words, and thus (x + 1) would also mean "the next element after x", if x is actually a pointer.

When it came to arrays, B didn't have them as a type at all. It did have array declarations - but what they did was allocate the memory, and give you a variable of the usual word type pointing at that memory (which could be reassigned!). Thus, arrays "decayed" to pointers, but in a broader sense they did in C.

This all works fine on machine where everything is a word, and only words are addressable. But C needed to run on byte-addressable architectures, hence why it needed different types, and specifically pointer types to allow for pointer arithmetic - as something like (p + 1) needs to shift the address by more than 1 byte, depending on the type of p. But they still tried to preserve the original B behavior of being able to treat arrays as pointers seamlessly, hence the decay semantics.

BTW, this ancestry explains some other idiosyncracies of C. For example, the fact that array/pointer indexing operator can have its operands ordered either way - both a[42] and 42[a] are equally valid - is also straight from B. A more obvious example, the reason why C originally allowed you to omit variable types altogether, and assumed int in that case, is because int is basically the "word type" of B, and thus C code written in this manner very much resembles B. And then there's "auto" which was needed in B to declare locals because there was no type, but became redundant (and yet preserved) in C.

https://en.wikipedia.org/wiki/B_(programming_language)#Examp...


What you want works, you just used the wrong syntax:

    #include <stdio.h>
    
    void foo(int len, const char (*str)[len]){
        printf("%zu\n", sizeof(*str));
        printf("%.*s", len, *str);
    }
    
    int main(void){
        // note: not nul-terminated
        const char text[] = {'h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!', '\n'};
        // prints 13, then 'hello world!'
        foo(sizeof(text), &text);
        return 0;
    }


Why yes, yes it does! I did know (but seem to have forgotten) that it's only the first level of array-ness that decays to a pointer.

It's unfortunate that the resulting VLA-enhanced function is no longer compatible with the original:

  /* original function */
  void foo(size_t len, const char *str);
  /* compatible signature but str[] decays to sizeless pointer */
  void foo(size_t len, const char str[len]);
  /* allowed but signature is no longer compatible with original */
  void foo_improved(size_t len, const char (*str)[len]);
  /* (this is how the non-VLA caller would see the signature) */
  void foo_improved(size_t len, const char **str);
So what your example does show is that existing compilers already support this concept (no need for fancy dependent types) but the C99 standard explicitly prohibits compilers from acting on the VLA information contained within the const char str[len] declaration.


See WalterBright's 2009 blog post, C’s Biggest Mistake, about exactly this.

https://digitalmars.com/articles/C-biggest-mistake.html


Ah, indeed! Thanks for the link.

I think many of the "safe C" variants get tripped up by starting with fat pointers (length + pointer as an atomic value) and then have trouble (rightly so!) when trying to squeeze them through the C standard ABI; it's a square peg in round hole sort of situation.

The key observation from WalterBright's post is that the C standard ABI already has a way to pass fat pointers, using a pair of arguments (size_t and char *) in an ad-hoc manner.

It's the not-useful-but-legal C99 VLA declaration in the function prototype that could, if one is willing to violate the C99 spec, allow a compiler to automatically derive a fat pointer inside the body of a function in a manner that is backwards-compatible with the C standard ABI.


It's because to get that to actually work you need dependent types. Which—it's not gonna happen.


But the caller to strnlen() has already provided both the (pointer to the) array and the length! Note that C99 does permit declaring a VLA in the body of the function:

  char *strndup(size_t n; const char *s[n], size_t n) {
    char buf[n];                /* alloc a temporary VLA */
    assert(sizeof(buf) == n);   /* yep! */
    assert(sizeof(s) == n);     /* nope, sizeof(s) == 1 */
  }
So there's absolutely no reason (other than being in violation of the C99 specification) for the compiler to refuse to let you make the assertion that sizeof(s) == n.

And given the prototype for this VLA-enhanced strndup(), a smart C compiler could catch errors like this:

  char * bugged_func() {
    char buf[20];
    /* do stuff with buf, e.g. snprintf() into it */
    return strndup(buf, 30);    /* error: 30 > sizeof(buf) */
  }
Since of course within a function the C type system is already tracking the size of an array -- so no additional type information is required, and certainly not dependent types!


With standard VLAs, you always have a guarantee of being able to access sizeof(buf) bytes from buf, for any variable buf. With your syntax, that guarantee would no longer hold, unless c had dependent types that could prove said guarantee.


Well, given that C is fundamentally about separate compilation and external linkage, most "guarantees" in the language are really just promises or contracts. As demonstrated in david2ndaccount's comment, standard C already handles VLA function arguments just fine (without any need for dependent types).

The only issue is that C99 insists that the first dimension of an array argument must decay to a pointer, discarding the associated type information of that array's dimension.


We wouldn't need to go all the way to dependent types, which would guarantee at compile-time that array accesses are safe. Even if all the bounds checking happened at run-time it would still be tremendously helpful.


Bounds checking can be done, and it doesn't need any special language features. Tcc does it, as do some of the sanitizers (present in gcc and clang).


Can you bounds check dynamically sized arrays? For example, a function that receives the size as a separate argument?

    double f(double *xs, int n){
        return xs[g()];
    }


Yes. In my hypothetical world where the C compiler makes use of the VLA declaration in the function arguments, it would certainly be possible for the compiler to insert automatic bounds checking in this case:

  double f(xs, n)
    double xs[n];
    size_t n;
  {
    size_t _tmp0 = g();   /* temporary var created by compiler */
    assert(_tmp0 < n);    /* bounds check inserted by compiler */
    return xs[_tmp0];
  }
The key to making this possible is telling the compiler about the relationship between double* xs and size_t n; once the compiler has the knowledge that the type of xs is double [n] (array of double with first dimension n) it would be able to automatically insert dynamic bounds checks.


> dynamically sized arrays

Yes. What you can't do is associate bounds information with some specific pointer to an array, but this will work, for instance:

  int *x = malloc(2 * sizeof(int));
  x[1]; //ok
  x[2]; //runtime error


For reasons that were never clearly articulated, the prefix approach was considered odd, backwards, and to have numerous downsides, at least where I learned C. In hindsight, I can only cringe at that attitude. Strings as added in later Pascal, about 40 years ago now, were memory safe in a way that C strings still are not.


Hey, languages used length,blob even when C was invented. HP Access BASIC used that kind.

It was a limitation, because they chose a byte length (to save space). So strings up to 255 characters only. It was decades before folks were comfortable with 32-bit length fields. And that still limited you to 4GB strings. In the bad old days, memory usage was king.


The funny thing is that you can just use the topmost bit of the length to indicate that the string length is >127, and chain as many length bytes as you want before you begin the string proper (to save space). It would be still a better encoding than a null at the end.


This way you would trade in a null-byte-terminated variable length string for essentially a null-bit-terminated variable length number (plus the remaining string). I am not convinced that this actually would be much safer.


Unicode does variable length bit strings too, so I'm not a visionary or anything. It would be safer for no other reason than that such a pattern could only occur at the start of the string, with zero special handling, while a null could occur anywhere in a zero-terminated string.


This is just the LEB128 format, which is used commonly used and I don't think there's any serious problems with it.


Interesting. Thank you for sharing this!


At least you don't have (obvious) performance problems with it, because you will effectively never need more than 9 (usually 2 or 3) of these bytes.

But sure on modern 64 bit systems just using a 64 bit integer makes much more sense. On a small embedded 8 bit oder 16 bit microcontroller it might make sense.


You are correct, I was trying to show that such a scheme was practical even in the early 1980s when zero-termination was beginning to dominate. This could well be used on 64-bit systems (just with a larger word size than a byte), though the utility of such a thing is questionable.


In a toy language I once wrote I got around that by encoding binary values as quaternary values and using a ternary system on top of that with a termination character: 11 = 1; 01 = 0; 00 = end; 10 was unused.

Having truly unbounded integers was rather fun. Of course performance was abysmal.


Pascal compilers of the time when Pascal was popular typically had a switch that disabled range checks for speed reasons.

Such a system would effectively remove that feature. Yes, you could disable range checks when indexing into a string, but you still would have to figure out how many length bytes there are. That would only be a little bit faster than a full range check.

Because of that, I don’t see how that would have been useful at the time.


The prefix approach turns the neat "strings are just character arrays are just pointers" pattern into something a lot more clunky, because now you've got this really basic data type that is actually a struct and now you have to have an opinion on how wide the length value is and short strings get a lot of memory overhead in just lengths, and so on.

In hindsight, I think the complexity is worth the safety, but I could see why it felt more elegant to use null-terminated strings at the time.


It's a classic case of moving the complexity from one part of the system to another. "Strings are just character arrays" seems simple and elegant, but in reality is a giant mess, because strings are not just character arrays, any more than dates are just an offset from an epoch.

Human concepts are inherently messy. "Elegant" solutions just shove the mess down the road.


On the contrary, I think "Strings are just character arrays are just pointers" is the solution, not the problem. As with non-character arrays, you must always pair the pointer with a length. (I don't like the idea of prefixing the length because it prevents su stringing).

The problem is the null termination, which is not general to arrays (though it is sometimes used with arrays of pointers).


You sometimes see sentinel values used with varargs (only if they're all the same type), which is basically just an array with more boilerplate.

That being said. Length as a first parameter and the rest of the arguments being the variadic bit is also quite normal.


You're just moving the complexity from strings into unsigned integers. Your strings are limited by the size of whatever you put in the head of the string.

Sure 16 exabytes sounds like a lot today, but so did 4 billion ip addresses. Differently bad is not better.


I didn’t say anything about unsigned integers, or any specific approach to encoding strings at all, so I’m not sure what you’re trying to say here.


It doesn't just buy safety. It also makes it possible to include null bytes inside of strings.


This is the part that boggles my mind. It's not like fat pointers just didn't exist at the time. You need fat pointers any time you do anything with dynamic non-string binary data.

Null is always 1 byte minimum so at best you save size_t-1 bytes per string. Ignoring clever structures like LEB128 varint length.

This is a classic case of "simple is actually complex". How many billions of dollars has null terminal strings cost? Hope that 3 bytes of overhead per string saved is worth it.


and with length+array you don't need 2 copies of so many functions (array input vs string input)

No matter how you slice it, null termination was a mistake.


Pascal strings are not inherently memory safe:

   cat_pascal_strings(pascalstr *uninited_memory,
                      pascalstr *left,
                      pascalstr *right);
how big is uninited_memory? Can left and right fit into it?

You need to design language constructs around Pascal srings to make them actually safe. Such as, oh, make it impossible to have an uninitialized such object. The object has o know both its allocation size and the actual size of the string stored in it.

What is unsafe is constructing new objects in an anonymous block of memory that knows nothing about its size.

C programs run aground there not just with strings!

   struct foo *ptr = malloc(sizeof ptr);  // should be sizeof *ptr!!

   if (ptr) {
      ptr->name = name;
      ptr->frobosity = fr;
 
Oops! The wrong size of allocated only the size of a pointer: 4 or 8 bytes, typically nowadays, but the structure is 48 bytes wide.

"struct foo" itself isn't inferior to a Pascal RECORD; the problem is coming from the wild and loose allocation side of things.

Working with strings in Pascal is relatively safe, but painfully limiting. It's a dead end. You can't build anything on top of it. Can you imagine trying to make a run-time for a high level language in Pascal? You need to be in the driver's seat regarding how strings work.


"Can you imagine trying to make a run-time for a high level language in Pascal"

You mean like the strings in Delphi? Yeah, I can since I use them daily. Strings in Delphi nowadays are actually more like classes in java than Old Pascal strings. Then depending on your intend either get them to be arrays or old strings after linker goes over your code. Best of both worlds, and on top of it, if you really want, you can definitely shoot yourself in your leg with unsafe operations. So in the end is best of both worlds and worse of 3rd world. Though the 3rd one you really need to go out of your way to have it as bad as C strings are.


> Working with strings in Pascal is relatively safe, but painfully limiting. It's a dead end. You can't build anything on top of it. Can you imagine trying to make a run-time for a high level language in Pascal? You need to be in the driver's seat regarding how strings work.

I doubt string representation is really the blocker here since C-strings are now pretty much just used by some but not all C programmers. QString and GString and C++ std::string and Rust strings and Go strings and Java strings and so on are not null terminated


I can totally imagine trying to make a run-time for a high-level language in any sensible Pascal dialect, such as Turbo/Borland Pascal. I mean, forget strings - that thing had syntax specifically to implement interrupt handlers, or access absolute memory addresses.

Better yet, how about Modula-2? I can't help but think that the programming language landscape would be much better if that language occupied the niche that C does today.


In the Pascal that I remember, strings were always 256 bytes and the first byte tracked the length, meaning they were always safe, though might get truncated. The LongString just did allocations whenever it needed, and was also safe, as long as you weren’t rolling your own pointer math.


> struct foo ptr = malloc(sizeof ptr); // should be sizeof ptr!!

This is why whenever I use sizeof, I pass a type, not a variable.


On platforms with limited register sets, keeping length around burns a register that could be used for something else. From the mindset of assembler programmers wanting a high level language that suits them, a sentinel that doesn't consume extra machine resources was preferable. Not to mention all the precious RAM those multi-byte lengths squander.


Ok, I hadn't considered the registers thing. That almost makes sense. Almost. But variable length arrays are so useful I think it warrants adding a register just for dealing with fat pointers.

Like I get why it happened. It is just crazy how long it has stuck around.


Strings as they existed in standard Pascal were extremely limited in how you could work with them, since it didn't have true dynamic arrays.

Strings as implemented in e.g. Borland Pascal were better. But then, the length-prefixed implementation had its own downsides. For example, it had to decide how many bits to use for length. 16-bit Pascal would generally use a single byte, and in BP at least, you could even access it as a character via S[0]. Thus, strings were limited to 256 bytes max - and because this was baked into the ABI, it wasn't something that could be easily changed later.

Hence when Delphi decided to fix it, they basically had to introduce a whole new string type, leaving the old one as is. And then they added a bunch of compiler switches so that "string" could be an alias for the new type or the old, as needed in that particular code file.


(Many of) The trade-offs were known to Richie et al; writing in 1993:

> None of BCPL, B, or C supports character data strongly in the language; each treats strings much like vectors of integers and supplements general rules by a few conventions. In both BCPL and B a string literal denotes the address of a static area initialized with the characters of the string, packed into cells. In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled `*e'. This change was made partially to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator.

[…]

> C treats strings as arrays of characters conventionally terminated by a marker. Aside from one special rule about initialization by string literals, the semantics of strings are fully subsumed by more general rules governing all arrays, and as a result the language is simpler to describe and to translate than one incorporating the string as a unique data type. Some costs accrue from its approach: certain string operations are more expensive than in other designs because application code or a library routine must occasionally search for the end of a string, because few built-in operations are available, and because the burden of storage management for strings falls more heavily on the user. Nevertheless, C's approach to strings works well.

* https://www.bell-labs.com/usr/dmr/www/chist.html

He mentions Algol 68 and Pascal [Jensen 74].


Thanks for the reference!

I personally don't think that the qualitative pros/cons of the chosen approach or alternatives that we're discussing today, 30-ish years later, would be all that new to the designers of C in 1993. The difference is that we've had 30-ish years to watch those decisions play out over millions of lines of code in software running at scales and levels of complexity that programmers in 1993 could only dream of.

Also, software security was barely an issue in 1993. Today, it's a massive issue.


> Also, software security was barely an issue in 1993. Today, it's a massive issue.

That was him reflecting on things in 1993, but the C team designed things in ~1970. That was basically the Stone or Iron Age of computing.


Thank you (and the others) in this thread. Very insightful, particularly with what motivated the thinking then.


We could say that a string in C is an implicit type like a list in Lisp.


Oh Pascal, why couldn't we have had you instead.


"Oh Pascal" reminded me of a book titled, "Oh! Pascal!" by Doug Cooper. I used it to learn Pascal.


Best programming book I ever read. His later "Condensed Pascal" isn't quite as good -- a little too, well, condensed. Too bad that's the only one I could find to buy after having had to return Oh, Pascal to the city library.

(OK, it's hard to compare; Code Complete and other much later stuff might be just as good. Too many decades between when I read them to say for sure.)


You can, it's called Delphi.


Or Nim [1]. :-) It's hard to describe succinctly, but "Pascal meets Ada meets Python meets Lisp-like syntax macros" starts to convey it. Forget operator overloading trying to square peg-round hole whatever the operator set is - Nim has user-defined operators. And a dozen other nice things.

https://nim-lang.org


> But also, "strings" and "time" are actually very complex concepts, and these functions operate on often outdated assumptions about those underlying abstractions.

Even in safer languages such as Rust, there are often quæstions as to why certain string operations are either impossible, or need to be quite complicated for a rather simple operation and are then met with responses such as “*Did you know that the length of a string can grow from a capitalization operation depending on locale settings of environment variables?

P.s.: In fact, I would argue that strings are not necessarily all that complicated, but simply that many assume that they are simpler than they are, and that code that handles them is thus written on such assumptions that the length of a string remain the same after capitalization, or that the result not be under influence of environment variables.


> locale settings of environment variables

Also known as "why does my code that parses floats fail in Turkey?"

Also also known as the discrepancy between a string's length-as-in-bytes, its length-as-in-code-points, and its length-as-in-how-humans-count-glyphs.

Strings are hard.

Edit to respond to your addendum:

> P.s.: In fact, I would argue that strings are not necessarily all that complicated, but simply that many assume that they are simpler than they are, and that code that handles them is thus written on such assumptions that the length of a string remain the same after capitalization, or that the result not be under influence of environment variables.

I don't think I agree with that, though we may just be disagreeing on semantics. I think the big mistake many of us make is confusing two different abstractions for the same one. We've got this high level abstraction for "text" that includes issues like locale and encoding and several other things. And then we've got this low level abstraction for "text" that is just a blob of bytes. And we often mix the abstractions because it often turns out okay anyway. Otherwise we have to confront demons like "a UTF-8 string containing 10 characters can be anywhere between 10 and 40 bytes long".


> Also known as "why does my code that parses floats fail in Turkey?"

I am quite certain that I have produced code that lowercases or uppercases and then checks for “i” in them, that I now realize would fail under Turkish locale settings as under that “i” does not uppercase to “I”, as one might expect.


> Why does my code that parses floats fail in Turkey

Because you, or someone, called

  fuck_my_program();
which is defined in "idiot.h" as

  #define fuck_my_program() setlocale(LC_ALL, "")
and the project is missing:

  #define setlocale(x, y) BANNED(setlocale)
Hope that helps!


I agreee with you but it would be much better if every locale-dependent function had an argument for an explicit locale object.


The problem is you'd have to pass an annoying extra argument (even if just a NULL) to numerous functions which have no alternative without that argument.

Technically it would be better, especially from a multi-threading point of view. The locale stuff was designed in the 1980's, before multi-threading was a mainstream technique.

Say you have a multi-threaded global server which has to localize something in the context of a session, to the locale of the user making the request.

Still, for thread support, you don't necessarily need a cluttering argument. The locale can be made into a thread-specific variable. In Lisp I would almost certainly prefer for the local to be a dynamic variable. (It would be pretty silly to be passing an argument to influence whether he decimal point is a comma, while the radix of integers is being controlled by *print-base*.)

What you want is for the locale stuff to be broken out into a complete separate library: a whole separate set of loc_* functions: loc_strtod, loc_printf, and so on.*


The threading aspect is one thing yes, but I think programmers forgetting or never realizing that these functions will magically behave differently for some of their users is a bigger problem.

I don't think having to pass a locale arguments would be that big of a problem - you could always have wrapper functions for the C locale, although they should be implemented directly for performance.

> What you want is for the locale stuff to be broken out into a complete separate library: a whole separate set of loc_* functions: loc_strtod, loc_printf, and so on.*

Yes, that would be ideal.


I think 30-40 years ago it was perfectly appropriate to null-terminate strings. Every byte actually counted.

I remember thinking about setting the high bit to denote the end of string to save space.

Nowadays the binary for "hello world" might be as big as a whole operating system of the past.

(though honestly I can't recall the size of the OS on a boot floppy, but the original floppies were 160k)


Just a surreal reminder that 30 years ago was 1991. By that time it already didn't make sense to null terminate strings (except perhaps on embedded platforms).


To be fair, in 1991, 1mb of RAM was a substantial amount for a PC, so a single word per string instead of a byte per string could still add up quickly.


30+ years -> 50+ years

Funny mind thing to forget to increment counters each year.


C89 was 32 years ago, so I think saying 30+ years is fair.


Some of us learned C off of the original K&R book.


The reason that the safe functions take length parameters is that they produce a new object in uninitialized memory, a pointer to which is specified by the caller.

It has nothing to do with null termination.

And that uninitialized memory is not self-describing in any way in the C language. Which is that way in machine language also.

This is a problem you have to bootstrap yourself somehow if you are to have any higher level language.

The machine just gives you a way to carve out blocks of memory that don't know their own type or size. C doesn't improve on that, but it is not the root cause of the situation. Without C, you still have to somehow go from that chaos to order.

Copying two null terminated strings into an existing null-terminated string can be perfectly safe without any size parameters.

   void replace_str(char *dest_str, const char *src_left, const char *src_right);
If dest_str is a string of 17 characters, we know we have 18 bytes in which to catenate src_left and src_right.

This is not very useful though.

Now what might be a bit more useful would be if dest_str had two sizes: the length of string currently stored in it, and the size of the underlying storage. This particular operation would ignore the former, and use the latter. It could replace a string of three characters with a 27 character one.


Null terminated strings are remnants of an era when computers had little memory available. So, at the time it seemed smart to discard the length field and use a single byte-sized terminator (null). If you are writing an operating system for a machine with little memory to spare, this seems like a good decision. Of course things are very different now when memory is not a problem and the goal is safety.


> If you are writing an operating system for a machine with little memory to spare, this seems like a good decision.

Also registers! Especially in syscall interface, consider eg:

  int renameat(int olddirfd,char* oldpath,int newdirfd,char* newpath); /* first example I found that had 2 paths */
If you have registers edx,ecx,esi,edi,ebx available, nul-terminated strings make this fit into:

  edx olddirfd,ecx oldpath,
  esi newdirfd,edi newpath
If you need separate length fields, there simply aren't enough registers:

  edx olddirfd,ecx oldpath.ptr,esi oldpath.len,
  edi newdirfd,ebx newpath.ptr,??? newpath.len


> memory is not a problem and the goal is safety.

People keep repeating this. What about embedded systems? For instance, I have to know how an object is structured and how is allocated, exactly and without surprises. The behavior has to be predictable and as simple and fast as possible. You can (likely) achieve that with C.


As someone who learned C as their first language, strings in every single language after that have felt like cheating.

"What? You mean I can type an arbitrary string and it works? I don't need to worry about terminators or the amount of memory I've allocated? You can concatenate two strings with +?!? What is this magic?"


It always makes me wonder if there's some hidden overhead that I'm absorbing. When I program in C I feel like I know a lot better what the generated instructions will be. Using higher-level languages for embedded programming where resources are tight makes me uncomfortable.


In addition to the overhead from dyn alloc and the GC as someone else mentioned, there is also the size overhead that comes with every object in an OO language. The obj overhead for Java is JVM-dependent, but I believe it to be somewhere around 16 bytes.

A mostly unrelated stackoverflow post I found[0] states that an empty standard string in Java occupies 40 bytes due to the normal object overhead and overhead related to the internal byte array for the char storage. Obviously what you gain in return is convenience in programming as well as runtime-enforced safety from buffer overflows. Whether this is worth it depends on what you're doing.

In general, you're definitely absorbing overhead with any managed lang, although it need not be hidden. The specifics should be documented somewhere for whatever platform you're using, and most GCs are pretty tuneable nowadays.

[0] https://stackoverflow.com/questions/56827569/what-is-an-over...


In cases where perfomance actually matters, you usually have other tools, like StringBuilder in C#. But in my experience, 99% of code that works with dynamic strings is not perfomance-sensitive, but is often very correctness-sensitive and readibility-sensitive.


Together with parent comment, this is exactly how I feel every time I use something that is not C (I do mostly embedded). Like: ok, I can use a slice here, but what is a slice? what thing is it doing that I don't know? where's the catch? how does it look like in memory? is it calling a function when I access it? is it being copied or referenced?. And so on...


Coming from a hardware education and moving to an almost strictly Python career, I simply cannot enjoy the full fanciness of Python as I'm constantly worried about what weird things it's probably doing in the background. Particularly shaken after an incident of creating objects and appending them to a list, and at the end of the loop the list was the last item n times.


Basically every (non-C) string implementation in existence has to do dynamic memory allocation under the hood, which is hidden from the programmer. That’s the overhead.


> When I program in C I feel like I know a lot better what the generated instructions will be.

If you don't know about the Compiler Explorer, definitely check it out: https://godbolt.org


There certainly is --- it's very easy to make accidentally quadratic (or worse) algorithms in languages where data structures automatically resize themselves to their contents.


To be honest, with today’s compilers you can’t be all that sure about the generated code. They do some insane tricks under the hood to make naively written code perform good. As per a “famous” blog post, C is a high level language.


Yeah, every time I decide to play with C for nostalgia's sake, I immediately get hung up on just how painful everything is, especially strings.

I still love C, but I'd do my best not to have to write anything serious with it again.


I think the key is to understand the historical context of C, what it was competing with, and what concerns people writing C had.

Compared to the alternative (straight assembler) at the time as a systems programming language, C is a massive step up.

Also, the UNIX way was independent processes, so the APIs did not need to be thread safe, as there was no threading in the target architectures.

Now given the massive amount of existing C out there from the time of such architectures, you either have to move the API and language on to make it incompatible with existing code, or support the old baggage. The language has kept compatibility, and in this case, the github peeps have deprecated APIs using macros, so it's a reasonable approach.

An alternative approach would be to move the language on, but by it's nature it won't be compatible with C, so you give it a new name. You call it things like go, or rust, or swift. These are all C with the dangerous bits removed. It'll be interesting in 40 years time to see if people are having the same conversation about these languages - 'OMG, how did people write stuff in rust? It can't cope with [insert feature of distributed quantum computing]. It's really scary'


I wouldn't say that Go is an alternative approach. I mean, what's the difference between Go and Java AOT with Graal? But Rust is truly an alternative to C/C++.


In many ways Go is the safer more productive language for C lovers. There are only a couple major differences between the two languages so I think that it is one of the easier places to convince C lovers to move to. I have met a number of strong C proponents and they all seem to be relatively OK with Go, especially compared to C++, Rust and Python.

The major differences in my head are:

- Garbage Collection

- Interfaces

There are some other more minor things like defer and the built-in generic types but many C programmers already had macros to implement similar things.

So sure, Go doesn't solve all of the problem spaces that C does. But if you are in a problem space where you don't need see (or another low-level systems language) Go may be a middle ground to move C programmers to.


When we talk about a language, we tend to mix together the language itself and it's ecosystem/runtime/tooling. For example C# can be compiled to produce single-megabyte executables, you could concievably use it without a GC, but that's very non-standard tooling with poor support. So practically speaking, C# can't do that for most people for most usecases, even if hypothetically the language could.

So currently the issue is you can't write baremetal stuff in Go, for example the linux kernel or Arduinos. You can't get rid of GC if you have realtime requirements.

GO is 'close enough' to C in many areas like writing server-side code and CLI tools - it's GC pauses are short, even though C could produce an executables in a few KB, rather than an MB, that has no practical relevance in those areas.


I think you are agreeing with me. Go fails to solve some problems that C does, but feels fairly comfortable to C programmers. Therefore it works well for C programmers when solving problems that can be solved by Go.


As a low-level programmer, I am not convinced. The Go runtime really confuses the use case.

If I needed a C alternative, rust gives me all (most of) the power of C, plus safety, without all the drawbacks of a heavy runtime.

If I am okay with paying the garbage collection tax, etc, then I would use scala or kotlin. If I'm using a higher level language, then I expect higher level features like exceptions, generics, reflection, etc.

I'm sure go is a great language with it's own usecases, but it's not a C replacement.


I didn't say it was a C replacement, I said it was approachable to C programmers.

Most of the C lovers that I have spoken with find that Rust is far too complicated with templates and lifetimes. Similar arguments can be provided for Scala and Kotlin with their more functional syntax.


> what's the difference between Go and Java AOT with Graal?

The thing that C/Go authors brought in straight from the seventies: utter garbage naming of identifiers.

But then again, the name of the language itself should be warning enough.


This is git, not github.


A better way of looking at it is that functions which expose very simple operations were among the first ones to be placed into the standard library -- and consequentially are the least well thought out.


This is a lot like how in JavaScript you have footguns like the with statement or in Python 2 where you have Unicode issues, etc. I am sure we could definitely a new C standard that excludes these functions as obsolete, but the linked header file is a pretty sensible interim solution. C is an old language and it’s kind of amazing that code written 30 years ago can still by and large be compiled by a modern compiler. Ever try to run 3 year old React projects using today’s React? :)


> in JavaScript you have footguns like the with statement

I've been coding in JS on a daily basis for more than 10 years and today I learned there is a `with` statement in JS.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

Edit: well, seems like it's been deprecated/forbidden since ES5 (2009), so it makes sense I've never seen it.


An old grad school connection of mine wrote a formal semantics for ES5 and couldn't explain what "with" did in words.


This is JavaScript, so there are probably tons of weird edge cases, but basically, within a with (expr) { … } block, everything defined on expr is put in scope:

  with (Math) with (console) log(PI)
It reminds me a bit of Clojure's doto.


Or java’s static import


And me around 20 years - also never even heard of the `with` statement! I think to qualify as a footgun, people actually need to be using it in the real world.


I have seen it in the wild. I first learned of it from Douglas Crockford’s JavaScript the good parts. He also had some things to say about the new keyword and prototype inheritance and how we should stop using them. Ironically while he was dismissed on that suggestion then, VueJS has pretty much implemented exactly what he has in mind in their V3.


It should be called something like appendixgun then because you don't use it but it still has a chance of causing needless suffering and pain.


The appendix is a reservoir of gut bacteria in case you lose yours in an infection or by taking too many antibiotics.


There are lots of more conventional footguns in JS, though - everything about the Date object/class, for example.


Was also popular in ActionScript, the JS cousin that powered dynamic Flash content:

    with (myShape.graphics) {
      beginPath();
      setFill(0xFF0000);
      moveTo(6, 9);
      // ...
    }


It amuses me that HN hates JS so much, that even a topic about problems with C turns into a JS-bashing thread.

Also, I just want to remind you that JS isn't just React. There are plenty of libraries written in C that introduce breaking changes over the course of 3 years. Nothing will stop people from finding ways to complain about JS though, I know. The hate-boner is very real.


I've been a long-time Javascript hater. Probably didn't help that I started out 20 years ago, and dealing with cross-browser support was a big issue. And of course, let's so no more about Internet Explorer shudder. And then NPM - a direct result of JavaScript's anaemic standard library.

Anyway, things have changed a lot, and I recently worked on my first ever web app with native ES6 - no transpiling to ES5! It was... not nearly as bad as it used to be! Modules are a thing, and the language has evolved with things like async/await, evolved for the better, I think. The standard library is still horribly anaemic though - the number of "helper" functions needed is ridiculous.

But still, I would no longer classify myself as a hater. Progress at last :)


If static-typing is your thing, you should give a try to TypeScript. It's easily the biggest game changer that happened to the JS world in the recent years.


There are some JS problems that TypeScript doesn't solve, like Array.prototype.sort and .map, but it's still quite nice.


I appreciate Javascript's LISPy qualities, but it has an inordinate number of footguns and a relative lack of standard, stable libraries. Coming from languages like Java and Erlang that are relatively scrupulous about such things is a bit jarring.

I do like Typescript though, as it adds some really nice ergonomics.


I think in most cases it's probably not hate but a deep, deep love.


JavaScript, LISP under C disguise. No wonder it's "popular" on HN.

Assorted musing : Rust, OCaml under C disguise.


I think most people on HN like Javascript, or at least its idea? I mean, its a very C-like functionnal language, especially since ES6 put Js on the right road (for me at least)?


It looks similar, but it isn’t. You can’t blow your stack in JavaScript while in C that’s practically a language feature and a design goal.


Because individual libraries choosing to change quickly is comparable to language stability how? The relevant comparison would be "run a 3y old react app (or a 20 year old website using JS) in a modern browser or interpreter"


Yes, and it would still run fine I guess. I think only eval() changed over time. APIs and so on are still the same except for some Netscape stuff.


Insofar as stdlib for C is a library, I think it’s not the worst comparison.


The string stuff is kind of the original sin, but to be honest almost all programming environments have massive footguns when it comes to times/dates. Python's datetime story is _extremely_ painful to deal with. Try doing .... I dunno, anything apart from getting the current time and doing an ISO format of a Javascript Date object.

I think stuff has kinda gotten better, but while Unicode had emoji to kinda save the day, dates never had this moment and we're still suffering through major messes on a daily basis because of it.


Python's dates are very unlikely to cause quadratic or exponential performance dips, segfaults, or remote code execution vulnerabilities. (And JS now has Date#toISOString, since ES5.)

C's string manipulation functions are a regular source of the worst vulnerabilities in software.

Even if they're in the same category of legacy cruft, they're not even remotely in the same magnitude of consequences.


I think you misread my message, I was saying that "time stuff is messed up everywhere", not "python's time stuff is like C's string stuff". The C string stuff is a mess.

I was also saying that JS dates _can_ generate ISOStrings. But good luck doing any serious manipulation without issues. Hell, there isn't even a `strftime` equivalent for JS dates! And so much stuff ends up going through locales that you can't rely on it for machine transformations.

I would be careful about ascribing the quadratic perf discussion to be a C thing though.... I find loads of "accidentally quadratic" stuff in loads of languages all the time. People are really bad about this (lots of confusion between "this is built-in to the data structure" and "this is cheap").

Anyways, yeah. Strings are uniquely awful. Other C APIs suffer from issues, but I find those issues are on par with other language thing. Granted, it's sometimes _because of C_ that other languages suffer from the issues (by relying on C layers for the logic).


Yeah, there is a culture of complacency in C probably owing to the enormous historical baggage of legacy code that has to be supported and the blurred line between stdlib and system call.


I disagree completely. Devs who use C are the least complacent about security in my experience. The problems are from previous eras before they knew about many of these things. A ton of people in modern languages couldn't name a single dangerous function, though they do exist in every language. You'd be amazed at how many race condition vulns result from TOCTOU errors just in authentication, or checking for the existence of a file before opening it, etc.

It's absolutely true that decades ago the C community was complacent, but it's not true now. Source: I taught secure coding in C/C++ in the 00s.


What you said. Nobody is complacent. Anyone who thinks the Linux or OpenBSD (etc.) kernel developers take the lazy way out is talking about a thing they know little about. I do think better languages than C exist and maybe could even be used as a basis for new systems. But I have yet to see a mature OS that’s as secure and as performant as these. Closest might be the chips I’ve seen that have an embedded Java byte code interpreter.


I agree in principle but think these security-focused C developers are focusing on the trees for the forest. Every developer having the responsibility of cultivating their own pet list of banned functions is, frankly, NOT the way to achieve security. Those things need to be enforced at the widest level possible (OS, or language) to have the needed effect.


Two ways to look at it. If I told you that your computer or phone could run the same OS and all the same programs but at 1/10th of the speed so that certain classes of bugs could be eliminated, would you make that switch right here and now? I don’t mean theoretically I mean the device you are looking at right now. If not, would you do that in a year?

On the other hand, Moore’s law and all that. Computers will get faster over time so at some pint we might not care. And the opposite of my question is also true: if you could switch to a faster OS written in assembly, would you (assuming all functionality stays the same), knowing certain classes of bugs are more likely?

It seems to me that the cost of these kinds of bugs is amortized such that it is cheaper to use C than to switch. Expressed in those terms, we will only switch to a different language for all our systems stuff when the cost of the rewrite and the cost of the performance penalty are clearly and significantly less than the cost of the bugs we are likely to expedience.


You're ignoring the fact computer science has advanced since the 1970s when C was created. C is not full of footguns because that's the only way to build a fast language, it's unsafe because it's old and full of legacy baggage. Modern systems languages (primarily Rust, and to a lesser extent Zig) are on par with C in terms of performance, yet eliminates entire classes of potential safety bugs. Rewriting of course has a (major) cost, but I don't think the argument that using C is somehow inevitable in order to get fast code holds any water.


Any runtime security measure produces overhead (array bounds checking, dynamically checked borrow rules like Rust RefCell, etc.), at least in computational cycles. There is no magic formula.

Calculating mandelbrot fractals to measure speed might be a nice exercise in which Rust or Zig can compete with C. But in a real software implementation, when you need to open a file you still have to call the OS function fopen(). Whatever thing File::open (Rust) is doing before calling fopen() is overhead.

How can you avoid that overhead? Write in C (at your own risk).


Compile time security measures have absolutely no runtime overhead. Also, I don’t see what you mean by File::open — there is a kernel call somewhere there. But if you are writing a new OS you are free to implement the fopen call as you wish - C has no advantage here.


Not all bounds checking can be done at compile time, can it? You can’t check if a file exists on a target system before it is opened at compile time, can you?


Benchmarks or it didn’t happen. The last OS project I saw was written in Rust and was twice as slow as Linux. It also required that all your software be written in Rust.

This is why we keep seeing the “X is faster than C” articles: if you use the standard C library in a sort of not great way (sscanf) vs a more intelligent version of the code in another language you will get faster than C results. But on the whole doing less work is always faster. Not doing bounds checking on an array will always be faster than doing bounds checking. How could it not be? No amount of computer science can make bounds checking take negative time.

I am not saying C is magically faster. I am saying that by letting you not do critical safety checks it will be faster. Rust has a similar capability for some things but if your goal is to write unsafe Rust for the sake of performance, then is it worth the switch?


Reminds me of something I once read on Evan Martin's (creator of the Ninja build system) blog [1].

"Underspecifying and overspecifying.

Ninja executes commands in parallel, so it requires the user to provide enough information to get that correct. But at the other extreme, it also doesn't mandate that it has a complete picture of the build. You can see one discussion of this dynamic in this bug in particular (search for "evmar" to see my comments). You must often compromise between correctness and convenience or performance and you should be intentional when you choose a point along that continuum. I find some programmers are inflexible when considering this dynamic, where it's somehow obvious that one of those concerns dominates, but in my experience the interplay is pretty subtle; for example, a tool that trades off correctness for convenience might overall produce a more correct ecosystem than a more correct but less convenient alternative, if programmers end up avoiding the latter. (That could be one reason Haskell isn't more successful. Now that I work in programming languages I see this dynamic play out regularly.)"

[1] http://neugierig.org/software/blog/2020/05/ninja.html


I’m sorry but compare apples to oranges. A serious OS project is a tremendous multi-decade project, it has not much to do with the implementation language.


With the recent push from Rust community to implement OS kernels in Rust I think it is apt.


It's not really complacency: it's that the standard library is intentionally minimalistic to maintain portability and backwards compatibility. If you want sensible string handling, it's usually best to use a high level utility library like GLib(https://developer.gnome.org/glib/stable/) or Apache Portable Runtime(http://apr.apache.org/), or roll your own safe string type (preferably non-null terminating)


No, if you want sensible string handling, the sane choice is usually to choose to use a language that is not C. Not always, but definitely usually.


It’s not hard to have strings like you do in other languages in C. It is hard when you treat char foo[] as if it was a string object like you have in JavaScript or Java or Python. C strings are just chunks of memory terminated by \0. They can still be mildly useful that way but if you actually want to do string operations you need to use a library designed for the problem (variable length, storing length with the object, Unicode support, etc.). Problem is that most people don’t start with such a library so they end up doing the hard work themselves in an ad hoc manner.

You can’t fuck up String(“Hello “) + String(“world”) but you can definitely fuck up strcat(buf, “Hello “); strcat(buf, “world”);.


there's nothing inherently unportable about strings though.


Why do you need backward compatibility with a compiled language? Other languages like Rust and JavaScript (even) avoid that with a pragma tag on the source.


Because not everything is recompiled from source. That's why stable ABIs need to exist.


Good point, thanks. Could the headers contain the pragmas?


That assumes you have a header, which only exists at compile time for the developer. The running program knows nothing about it.


Why would a program need to know (e.g.) the details of what system calls or stdlib functions that a procedure it invokes uses? Aren’t C functions pretty well separated from each other except for the odd signal handler and assuming a stable ABI? In my view most of the issues with C are semantics within the function blocks.


The program doesn't "know" anything. The executable has a header used by the loader that tells it to use libc. If the libc version does not contain the symbols that the program expects or the symbols do not have compatible definitions (meaning identical function signature and ABI, including struct layout) the program will probably crash. If the program links against any shared libraries they'll use the same version of libc that's loaded with the executable.

There are ways around this that are varying degrees of acceptable. Versioning libc itself is outside the scope of the language, since it really depends on how the system linkers and loaders are implemented.


The parameters and return value is not in the object files.


c standard library doesn't really relate directly to system calls (at least in modern os'es). In particular, the stdio.h functions are buffered by default, while their system call analogues are not. For unixes, system call wrappers are typically found in <unistd.h>, not the "official" c standard library


I mean on Linux you're not encumbered by this because the syscall api is stable but in practice most GNU/Linux distros assume glibc. You can't correctly resolve a hostname on Linux without farming out to glibc -- hell even the kernel punts to userspace for dns names but you can technically ignore it if you want.

On BSDs and macOS you're always SOL because the syscall api isn't stable and only the C wrappers are.


While it's true that there are a lot of unsafe functions in C, it's not really a mistake. C is a fundamentally unforgiving language. You just have to accept the fact you're driving a naked supercar with no seatbelts.

It's easy to survive: just don't crash. :)

And, functions aside, it's trivial to write a C program that bombs out without calling any functions at all, safe or otherwise.

It's a language from a different era, for sure. Back then no one had the computing power to build Rust. And remember that before C, they were writing Unix in assembly language. So sprintf() was a big step up!


Yeah, because of NUL-terminated strings. They cause so many problems it's not even funny. Even something simple like computing the length of the string is a linear time operation that risks overflowing the buffer. People attempted to fix these problems by creating variations of those functions with added length parameters, thereby negating nearly all benefits of NUL-terminated strings.

Why can't we just have some nice structures instead?

  struct memory {
      size_t size;
      unsigned char *address;
  };

  enum text_encoding { TEXT_ENCODING_UTF8, /* ... */ };

  struct text {
      enum text_encoding encoding;
      struct memory bytes;
  };
All I/O functions should use structures like these. This alone would probably prevent an incredible amount of problems. Every high-level language implements strings like this under the hood. Only reason C can't do it is the enormous amount of legacy code already in existence...


That would be nice. You hit on the other hell with C strings: modern encodings where wchar_t and mb* are useless and replacements essentially don't exist yet with char8_t, char32_t etc. Then there's the locale chaotic nonsense [1]. A new libc starting fresh would be nice.

1. https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f02...


> A new libc starting fresh would be nice.

Agreed. I want to make something like this on top of Linux one day. I discarded the entire libc and started from scratch with freestanding C and nothing but the Linux system call interface. Turns out the Linux system call interface is so much nicer.

https://github.com/matheusmoreira/liblinux/blob/master/examp...


If you list the languages you use, I'd be happy to point out the "footguns" in each of them. For all the warts on C, there really is no language that can compete for what it has accomplished over ~50 years.

Recall that during the rise of C, people were writing machine code on punch cards. Assembly -> Machine code has far more footbullets than C, it is a tradeoff between hand holding and tiny fast code.

Wow, this blew up.

To all the people popping off about how great other languages are, tell me: when will we see the Unreal Engine written in Python, or Pascal, or Algol, or Rust, or Go... the next big step is WebASM (or .cu), and that's way more footbullet-y than C. And what is the native language all of your sub-30 year old interpreted languages were written in? Thank you!


Yeah there are footguns in every language. But this is not a boolean question about the presence of footguns, this is about how much one has to know to be able to handle a language safely.

I know C/C#/Python/Rust/Javascript.

After a decade of using C I am still not totally sure if I didn't dangle a pointer somwhere in precisely the wrong way to create havoc. And yeah, that means I have to get better, etc. But that is not the point. The point is, that even with a lot of experience in the language you can still easily shoot yourself into the foot and don't even notice it.

Meanwhile after a month of using Rust I felt confident that I didn't shoot myself in the foot, because I know what the compilers e.g. ownership guarantuees. While in C shooting myself into the foot happen quite often in Rust I would have to specifically find a way to shoot myself into the foot without the compiler yelling at me, and quite frankly I havent found such a way yet.

Javascript is odd, because the typesystem has quite a few footguns in it. This is why such things like Elm or Typescript exist: to avoid these footguns.

I don't want to take away from the accomplishments of C, and I still like the language, but to claim it is equally likely in all languages to shoot yourself into the foot is not true.


This is a grossly inaccurate description of computing at the time of the rise of C. C was competing with Pascal/Modula, BLISS, PL/I, BCPL, and so on, not assembly on punched cards.

The “C competing with assembly” meme was very specific to microcomputer game and operating system development, not more general microcomputer application development, and not to minicomputer or mainframe development.


Mainframes very quickly were outclassed by minicomputers. They could not respond quickly to technology changes as fast. C was indeed king for decades.


C was not without competition on microcomputers, either. A lot of DOS software was written in Pascal, for example - and it wasn't any slower for that.


A lot of it was written in Turbo Pascal, which (among many other things that would have caused Niklaus Wirth to break out in hives) let you include inline machine code (and later, inline assembly language).


That's true, but so did C compilers at the time. And most software didn't actually make use of it - they didn't have to, because Turbo Pascal had e.g. pointer arithmetic as powerful as in C.

But anyway, we were talking about the real world software on microcomputers, not just standards in the abstract. With that in mind, I think TP/BP is a better example of Pascal in the wild than anything Wirth ever made.


Recall that during the rise of C, people were writing machine code on punch cards.

Or Fortran, Algol, Lisp, Cobol, Basic, Pascal, ...


My favorite assembly foot gun was a guy I worked with had a cute routine. You had a call to the routine, followed by a null terminated string after that. The routine would spit the string to the terminal. And then return to the location after the string.

He had some bug where in one place it returned to the start of the string, executed it, and kept going. The end result just happened to be a nop. Had been like that in production for a couple of years.


And when you fix it, everything breaks.


Consider the fact that Simula-67, which predated C by 3 years, had classes and objects very similar to what Java offers (and then some - e.g. coroutines), and a built-in string processing library that used object-oriented syntax.

The reason why C won had little to do with its advantages as a language over the competitors. It just happened to be the systems language for Unix, which was the winner in the early OS wars on microcomputers (for unrelated reasons). Once it became so established, there was a positive feedback loop: you would write portable code in C, because you knew that it was the fastest language that most platforms out there would support. And then any new platform would offer a C compiler, because they wanted to be able to run all the existing C code out there. And so, here were are.


Your edit really isn't helping your case.

Those of us who have always known about less dangerous 'system' languages (Pascal probably being the most popular) lament the fact that so much code got written in C instead.

It wasn't inevitable. It was preventable! It just didn't happen that way for reasons which are largely historical.

I don't work for the Rust Evangelism Strike Force, my main project is written in (as little) C (as possible), but I beg anyone who has a choice: use something else! Rust is... fine, Zig is promising. Ada still works!

Writing out the set {Python, Pascal, Algol, Rust, Go} tempts me to say uncharitable things about your understanding of the profession, but I accept you were just being snarky so I'll just gesture in the direction of how $redacted that is.


> when will we see the Unreal Engine written in...

Why would a huge C++ (not C, btw) codebase with roots going back to the 90s be rewritten in any other language?

And in fact how is the language Unreal Engine written in relevant to C having footguns?


Not that I dont believe there are any, but I'd love to hear your perspective...

Go (golang)


defer having function scope instead of, well, scope scope.

Using defer to unlock locks can lead to some fun deadlocks if you don't realize the issue with the scope, and it's completely unintuitive to someone with experience with other implementations of similar concepts.


channel programming and the races caused by closing channels. channels seem nice and easy until they don’t.

the whole var/:=/= assignment combined with the error handling style and the shorthand is another one


Only close channels when trying to tell the receiver that you're not sending more data. Otherwise let the garbage collector deal with it. Channels seem easy until they don't until they do again in my experience.

Don't understand your second point.


yeah the lack of determinism in selecting a channel can be tricky for causing bugs where order matters. Luckily in smaller cases you're likely to encounter them as flakey tests (eg 1/2 the time)

   select {

       case <-ch1:

       case <-ch2: 
    }


thanks for this, i couldn’t articulate it so clearly. Another gotcha is responding to closed channels, it makes sense in the grand scheme of things but when the program grows it gets tricky


Well, shit. Got me there.


There's far more critical code in the world running on COBOL and s3[79]0 assembler. COBOL is vastly more important than C.


citation needed

I'm sure there's a lot of important things that rely on COBOL, but by most definitions of "critical", I think this is way off the mark.


COBOL is still used in many banking systems such as ATMs. These are 'critical' systems by most any definition of the word 'critical'.


Sure. But is there far more critical COBOL than C out there?

The OS kernel for nearly every PC and server on earth is written in C.

Almost every electronic device on earth complicated enough to require software is probably running at least some firmware written in C.

I think those both outnumber ATMs by a hefty margin.


take a look at VMS to behold how elegantly and seamlessly we wrote software in whichever language suited the task most appropriately, barely conscious of the concept of foreign functions being a horror story for future generations instead of being entirely standard and letting us switch from BLISS to Pascal to ADA and C and C++ and DEC BASIC and throw in a few DCL macros or optimize the compilers to which source was of course provided and so that's how we delivered really rather efficient programs, really, in LOC, since nobody was hammering square pegs into semantic round holes or chasing effects when you needed to manage types intrinsically to your critical path routines but didn't need to extend the type system throughout your code base because RDB quaintly would let you call it in FORTRAN and get down to typology from the elements of mathematical sets themselves if you must, nicely isolated and ACID wrapped with system wide cluster transaction management available for everything and everyone's needs and the filesystem designed orthogonal to the database letting you think of your data in a single unified concept with primordial os level guarantee of data integrity for corollary sanity checking


That's a hugely broad definition of critical, enough to encompass most of business and finance software.


but there's a significant albeit rarely seen or sought for ecosystem for managing software on mainframes the source to which is long lost and the authors deceased and the the machine code is what you have to work with. you can't describe the work as being attractive, but the tooling which exists is first class and has been keeping the world turning safely for a good half century I'd say in reality...


Nearly everything around you runs code that was written in C, and absolutely nothing you can actually see runs COBOL code.


which language z/OS is written in?


Just because I want to know people opinion's (+they may be more than happy to shit on C#:)

C#

But please, nothing about using unsafe.


gmtime is just not thread-safe that's all, since it returns a static structure; gmtime_r is not banned.


Thanks, I am now a decade out of the C game and I was wracking my brain on what the problem with gmtime would be. My best guess was dodgy is_dst portability /shrug


Yeah, found this which explained it for me :)

https://lgtm.com/rules/2154840805/


Many of C's problems relate to string handling. These are all legacy functions which have been replaced with safe alternatives many decades ago.

strcpy() was replaced with a safer strncpy() and in turn has been replaced with strlcpy().

The list is a ban of the less safe versions, where more modern alternatives exist.


strncpy() is not a "safer" strcpy(). It can avoid some errors involving writing past the end of the target array (if you tell it the correct length for that array), but it's not a true string function, and it can leave the target unterminated and therefore not a valid string.

http://the-flat-trantor-society.blogspot.com/2012/03/no-strn...


This is true, and many people don't realize it. I used to call a wrapper function that would always set the last byte to 0.


I never could really understand the point of strncpy()... we always end up wrapping to deal with writing an unterminated string.

Was it intended for fixed length records?


It is for fixed length records, which is why it also zeroes the remaining space.


Arguably naming it with “str” is itself a security vulnerability.


No argument. At best it is a "string to fixed record" function, hence the name, but it is not a string function.


Yes. strncpy was intended for copying file names into a buffer that was only zero terminated when the name was shorter than the maximum length of a file name in Unix (14 bytes. See https://stackoverflow.com/a/1454071, https://devblogs.microsoft.com/oldnewthing/20050107-00/?p=36...)

You can also use it to overwrite part of an existing string, but I think that’s a side effect of the above.


Yes it was. On early Unix systems, each entry in a directory was bascially:

    struct dir
    {
      char name[14];
      int  inode;
    };
Adding a NUL byte might waste a full byte that could otherwise be used---remember, back when C was first developed, 10M disks were large and very expensive.


In the interest of satisfying pedantry I think we can agree that strncpy() is intended to be a safer strcpy() for a subset of uses.

As you say, it does in fact obviate some errors. A value judgement as to which behaviors are more or less safe may be subjective, but the intent is not.


strncpy was never intended to be a safer strcpy. It was created for a very specialized use case in the Unix kernel--copying a string-ish identifier to a fixed-size char field that only uses NUL termination if the identifier is shorter than the field size. Because of how the C language and the Unix kernel coevolved, it became part of the standard C library by default. I've seen it used for it's original semantics in only a handful of places, but in general it's almost always misused.

To be clear, strncpy does not guarantee NUL termination. It takes a C string as the source argument, but it doesn't write out a C string; it writes out a very esoteric data structure that is unfortunately easily confused with a C string.

By contrast, strlcpy was intended to be a safer string copy routine: https://www.usenix.org/legacy/event/usenix99/full_papers/mil... In particular, it was designed to be what people seem think strncpy is. Its return value semantics are controversial, though mostly only among the glibc crowd as every other Unix libc, including musl and Solaris, now provide it. But the semantics were designed based on experience in fixing old C code, and observations about how developers tend to write C code, not based on prescriptive theories about how people should manipulate C strings in C code.


Still, unless you're writing something that has to be very low-level all the way through, it's better to use a string-handling library than the stdlib tools for strings.


The first thing you do is not use any strings. You'll be amazed how much you can get done in languages that aren't so obsessively centered around stringified programming.


It was a design decision of QNX that the kernel never uses strings. Everything the kernel handles is fixed length, except messages, and messages go from one user process to another. The kernel does not allocate space for them. I think they go that right.

There's a QNX user process that's always present, called "proc", which handles pathnames and the "resource managers", programs which respond to path names. But that's in user space, and has all the tools of a user-space program.


There are absolutely things that can be written without string handling. Then again, there are things that can't. Not handling strings in the kernel probably was a good decision. That userland I'll bet has string handling though, to be useful to users.


Most of the code I write has a spec of input and output being some form of text. Still, I tend to write that in languages that have safe string handling and drop into C only when the profiler indicates that's useful.

When handling strings in C, it's useful to use the string functions from glib or pull in one of the specifically safe string handling libraries and not use any C stdlib functions for strings at all.

There are a number of C strings libraries safer to use than the standard library, and many of them are simpler, more feature-rich, or both.

* https://github.com/intel/safestringlib (MIT licensed)

* https://github.com/rurban/safeclib (MITish)

* https://github.com/mpedrero/safeString (MIT licensed)

* https://github.com/antirez/sds (BSD 2-clause, and gives you dynamic strings)

* https://github.com/maxim2266/str (BSD 3-clause)

* https://github.com/xyproto/egcc (GPL 2.0, includes GC on strings)

* https://github.com/composer927/stringstruct (GPL 3.0)

* https://github.com/c-factory/strings (MIT licensed)

* https://github.com/cavaliercoder/c-stringbuilder (MIT licensed, does dynamic)

If one does use the C standard library directly for handling strings, the advisories from CERT, NASA, Github, and others should be welcome advice (CERT's advice, BTW, includes recommending a safer strings library right off).


Yes, sure, write Unix CLI plumbing tools without strings.


Until you want to communicate with the user, filesystem, or web.


Why are these functions deprecated in favor of others but not removed? I know in Javascript this can happen so as to not break older websites, but in a compiled language this shouldn't be a problem right?


Removing anything breaks existing source code that has been tested to work. After all just because something may lead to issues it doesn't mean it will always lead to issues.

Also in many systems the C library is linked dynamically and shared among all programs so even though a program is compiled it still relies on the underlying system to provide the function.

Finally i'm certain that if a C standard removes something, it'll be treated as the equivalent to that standard not existing. C programmers are already a conservative bunch without such changes.


In a compiled language, when you remove a function it fails to compile. So removing them from the standard library forces code changes - they're not usually drop in replacements because the semantics were wrong in the first place.

Removing strcpy would make the Python transition look easy.


The expectation of a C89 programmer is that a valid C89 program can be compiled for any machine that has a C89 compiler, and likewise for C95, C99, C11, and C17. Furthermore, it's expected that any C89 program can be compiled unchanged on any future version of C, and the standard library is part of the definition of the language, and therefore functions cannot be removed.


At a certain point we have to say that it’s wrong for someone to expect C89 should still be the LCD.

And yes: it should all still compile, but none of that prohibits the compiler from issuing flashing red/yellow warning messages to your terminal for using footgun functions, preferably with uncomfortable audible notifications too.

All of this is silly though, because even in a strict C89 environment you can still have your own safe wrappers over the unsafe functions. I find that very little of modern programming has a hard dependency on ultramodern compiler features (e.g. you can theoretically build React/Redux using only ES3 (1998ish) if you like. Generics using type-erasure can be implemented with macros. Etc.).

Also, C89 conformance doesn’t mean much: you can have a confirming C89 system that doesn’t even have a heap - nor a stack for autos! (IBM Z/series uses a linked-list for call-frames, crazy stuff!)


I think for new code in environments that support newer standards C89 shouldn't be used. For the increasingly rare places new C is being written where C89 is the latest tooling available and the code handles strings, a safer string library is nearly a must. I strongly recommend a safer string library no matter which standard, but I'm nobody.

When updating existing code C89 (maybe K&R) might be what's used so minor code changes won't undo that.

I tend to write most of my code in something higher-level than C and only resort to C or assembly in performance-critical sections as found with a profiler. Plenty of general-purpose languages have memory-safe strings built into the language, and honestly I keep hoping the Cisco/Intel safestrings library or something like SDS gets the standard library blessing one day.


> I think for new code in environments that support newer standards C89 shouldn't be used

Why stop there? Don't use C. Use Rust!


Rust doesn't (yet) support all the targets C does. Mostly weird embedded stuff that needs it, and gccrs might solve that problem, but it's not always possible.

When it is possible, I certainly agree that Rust is nicer.


> And yes: it should all still compile, but none of that prohibits the compiler from issuing flashing red/yellow warning messages to your terminal for using footgun functions, preferably with uncomfortable audible notifications too.

As long as it is done like in recent versions of Visual C++ where i can disable that useless compiler output pollution with a #define, usually with a snide remark about Visual C++ right above it.


> disable that useless compiler output pollutio

The compiler is trying to help you write better code - suppressing warnings should not be taken lightly.


This is not the same as the regular warnings though, what Visual C++ is doing isn't helping writing better code - it is suggesting to replace standard functions which are available everywhere in code where i actually know what i'm doing with functions that are available to Visual C++ and pretty much nowhere else.

As i wrote in another comment, something that may lead to issues isn't the same as something that will always lead to issues - e.g. if i check a string's length or actually calculate and allocate the necessary memory before calling strcpy it is perfectly fine and safe to use it, but Visual C++ doesn't know about that, it complains like some stupid greenhorn that read somewhere "never use gotos" and then is surprised when he sees some Linux kernel code with gotos everywhere for cleanup, thinking that those people writing the kernel do not know what they're doing.


The C Standard Committee doesn’t actually ship a compiler the way the people behind Java, Python, Lua, C#, Go, Rust, etc. do. The best they can do is deprecate particular functions and hope compiler writers and standard library writers follow along. But the compiler writers have vocal customers who insist the depreciations are overly-cautious.


There are actually very few _dangerous_ functions in C (gets is the only one that comes to mind). Others have massive caveats (strncpy) but still have their place. Others are just known to have certain gotchas (strcpy, strcat, sprintf).

The reality of C is that if we deprecated every objectionable function in the stdlib we wouldn't have anything left.


> There are actually very few _dangerous_ functions in C

I think you mean there are very few functions that cannot possibly be used correctly (namely gets). Most C functions are dangerous - can lead to crashes and security vulnerabilties if used incorrectly - but that's just a expected consequence of using a language with no provisions for memory-safety.

> The reality of C is that if we deprecated every objectionable function in the stdlib we wouldn't have anything left.

Somewhat ironically, malloc is actually perfectly safe[0] - using the return value has some issues, but calling it is always[0] fine.

0: Assuming the OS-level memory allocator is sanely configured WRT overcommit, anyway.


It's not great if you're working on a new release and you realize you also need to change something unrelated because the language changed under you, especially if it's just a bugfix but a high-priority one, or consider the head-aches caused by source-only distributions suddenly breaking for all your new users (or existing users switching to a new computer or spinning up a fresh VM).


Why wouldn't it be an issue with a compiled language?

Its nearly the exact same reasoning as "we're not going to break older websites"


In Javascript there's an expectation that Javascript written 15 years ago for Netscape will also work on Firefox 89. Is that also the case with C, wrt compiler versions? I've always assumed it wasn't.


It's very much the case, so long as you stick to standard C (the full limitations of which very few people are actually aware of).

Runtime backwards compatibility is similarly extensive on platforms that care about it. You can still take a DOS app written in ANSI C89 the year that standard was released, and run it on (32-bit) Windows 10, and it'll work exactly the same. In fact, you can do this with apps all the way back to DOS 1.0.


Wow, that's super interesting. Thank you :)


strlcpy() isn't standard. You have to provide your own implementation if you want your code to be portable.


This is something git does. That's why they prefer it - it's available to git everywhere.


It's 4 lines of code to implement it. So even if it is not available on a platform (glibc mmmh because of Dreppers stubborness), it's no problem.

    size_t strlcpy(char *dst, const char *src, size_t dstsize)
    {
       size_t len = strlen(src);
      if(dstsize)
        *((char*)mempcpy(dst, src, min(len, dstsize-1))) = 0;
      return len;
    }


and

     *((char*)mempcpy(dst, src, min(len, dstsize-1))) = 0;
can be replace by

     ((char*)memcpy(dst, src, min(len, dstsize-1))[min(len, dstsize-1)] = 0;
if you don't have mempcpy


These still lead to lots of bugs via off by one errors on lengths or other buffer misuse.


What’s scary is programmers assuming any function as being safe. C programmers don’t trust anything, and they’re better programmers for it.


This statement is not even wrong. Good programmers of any language are aware of the footguns in their language and the things their compilers assume. Bad programmers don't.

C has unsafe basic functions because the programs written then were much simpler, and this sufficed. There's decades of PL research resulting in new languages that give better guarantees than C, allowing you to worry less about wrestling with the language and more on your business logic.

> C programmers don’t trust anything, and they’re better programmers for it.

By that dime, frontend JS programmers trust things even less than C programmers, and they're even better programmers for it. \s (in reality, FE JS devs mainly wish that browser environments were more consistent and predictable, and would disagree that they are better developers because of it).


That's an unfortunate result of backward compatibility. If it were Python, it would just become v2 and tell people to suck it up.


Notice that it's a giant PITA to work with any variable-length data. Because language lacks adequate means to abstract away safe fast memory access with generic types, RAII and borrow checkers. Comparing to C, both C++ and Rust (very different beasts) feel like pals of JavaScript: basic operations with dynamic strings and arrays just work™.


Well that's input validation for ya. It's not enough to say "give me a string", or even "give me a file path", and then only check that it has ASCII characters. You have to validate that this input could conceivably be a file on this system that someone would use.

"../../../../../../../../../../../../../../../../../../../../etc/shadow" is not a file someone would ever reasonably want to access. But is there an easy way to look for nonsense paths without potentially limiting functionality, or writing more code than you wanted to? Nope.

The same footgun exists in all languages; C's design just has a hair trigger.


That example is typical "confused deputy" security vuln irrespective of the language, not a "validate input" one. Meaning, the typical unix interface to the filesystem is such that it's hard/impossible to express "i'm only having access to this folder and interested in paths within that folder". `chroot` is too dramatic sandboxing that cannot be used for all use cases.

BTW, in macOS there are "secure bookmarks" (see NSURL docs) that are effectively capability tokens: when user drags a file, or selects it in an Open File dialog (which runs isolated from the app), the kernel creates an app-specific token that grants access to that file to the app, so it can access it beyond its sandbox.


I remember an entire lecture about the use and abuse of sprintf and related functions as a means of exploit. Yeah, when you delve into the internals of C you find things that are terrifying if you are concerned about reliability, security, or performance. The same is true though for many languages. The problem is, as is often the case, the Iron Triangle: good, fast, cheap - pick two. Different sections of the language are written by developers under different constraints and pressures, which leads to different choices. In my experience every language implementation has at least one area that was done quickly for expediency or done poorly because no one else was able to (or wanted to) work on it.


I was hoping that someone would have already pointed out that this is not quite correct phraseology, as the qualifier "old ways to" has to be inserted: e.g. "old ways to get time in GMT". The "new" ways have been around for nigh on 30 years in some cases, and aren't really "new" now. I've been using localtime_r() for almost that long, for example. Coming from another language, don't be fooled into thinking that what you are looking at in these lists is the current state of the art for the language.


C is a well-stocked kitchen of a language.

Unfortunately, it is riddled with sharp knives that can cut you, open flames that can burn you, gas that can smother you, water than can drown you and food that can make you sick if you prepare it incorrectly.

Some react to this potential safety threat by banning the use of knives, stoves, sinks, and food from a kitchen.

Fortunately most attempts at safety just require having a microwave to prepare the frozen pizza or Uber Eats delivery.


Strings are half the reason I never recommend someone learns C as their first language. Python is much easier, and then you can pick up C afterwards when you've got a better baseline knowledge

I've seen the same question posted way too often by beginners to C. "I've created a char*. Why am I getting <random fault> when I try to write to it?"


All this stems from C strings being zero-terminated.

This in turn stems from the dedicated CPU support for working with zt strings that traces all the way back to PDP-11. So what C does here is exactly what it has always been doing - it provides a thin wrapper of the existing hardware functionality.

The variadic arguments are of the same nature - they basically allow for manual call stack parsing, again something that is a level down from the application code.

It's also easy to see how an API like sprintf and scanf came about - someone's just got tired of writing a bunch of boilerplate code to print a float with N decimals aligned to the left with a plus sign. So they threw together a function call "spec" (the format string), added a call stack parsing support (va_args) and - voila - a beautifully concise print/scan interface. It is a very clever construct, you've gotta give it that.

The flip side that it required people to pay close attention how they use it, which wasn't that bold of a requirement back then. But as time went on, the average skill of C programmers went down, their use of the language did too, so more and more people started to step on the same rakes.

So, here we are. Zero-terminated strings are forbidden and va_args calls are nothing short of the magic.


I wonder if these headers were applied to the majority of C projects in, let’s say, all projects that were part of a Linux distribution- how much would fail to compile. My guess is: a lot.


Many of the problems with C descend from a common root, the decision to use bare pointers (memory addresses) as the basic way to refer to strings, arrays etc.

If they had used a {pointer, size} pair instead, it would have avoided all of these string problems, most buffer overflows, even the GTA Online loading problem that was on HN recently.


For what it's worth, while what @Camillo says is both true and important, people usually do not mention the trade offs involved or why that decision was attractive at the time.

These days (ptr,size) is probably 16 bytes -- longer than almost all words in the English language (the scrabble SOWPODS maxes out at 15). A pointer alone is 8B. Back at the dawn of C in 1970, memory was 7..8 orders of magnitude more expensive than today..(about 1 cent per bit in 1970 USD). (Today, cache memory can be almost as precious, but I agree that the benefits of bounded buffers probably outweigh their costs.)

8B pointers today are considered memory-costly enough "in the large" that even with dozens of GiB machines common, Intel introduced an x32 mode to go back to 32-bit addressing aka 4B pointers. [1] There are obviously more pointers than just char* in most programs, but even so.

Anyway, trade offs are just something people should bear in mind when opining on the "how it should be"s and "What kind of wacky drugs were the designers of language XYZ on?!!?".

[1] https://stackoverflow.com/questions/9233306/32-bit-pointers-...


Pascal, which had sized strings, was in wide use before C. Many people, including Bill Atkinson, who wrote many of the original Macintosh applications, thought C was a step backwards.

Pascal, to save one byte, limited strings to length 255. Bad decision.


> Pascal, which had sized strings, was in wide use before C. Many people, including Bill Atkinson, who wrote many of the original Macintosh applications, thought C was a step backwards.

Sure, but parent wasn't saying "it was not possible", they said "It was too expensive".

And sure enough, the market drifted to the cheaper solution: you could run slightly more applications if your OS and applications were all written in C than if they were written in Pascal, Modula, etc.


If they would have used a "fat strings" for the standard lib there would have been at least four different types by now with 8 to 64 bit lengths. Maybe even with signed char as length field on some systems, unsigned char on other. Or signed and unsigned for all int:s for a total of 8 types.

I think the sentinel character was the best choice in hindsight and at the time in that regard.

But I wish the xxx_s versions and strdup would have made it into the standard like 30 years ago.


There are no C standard functions, aside from malloc(), calloc() and realloc(), that have to allocate memory to work. I think that's intentional on the part of the C standard library.


strdup would be one in c22. Maybe they should have done a allocating "stradd" when they were at it ...


What you call "format a string" is actually "begin a perilous expedition into uncharted memory".


To respond to some of the comments.

It is not that there is anything intrinsically wrong with these functions. You can technically use all of them and I have been using all of them, safely, for decades.

The issue is they are huge traps to the point that in a larger piece of software one can say "well, it's just not worth it".

You can go much, much, much further than that.

In couple embedded projects I worked some of the rules were:

* dynamic allocation after application has started is banned -- any heap buffers and data structures must be allocated at the start of the application and after that any allocation is a compile time error,

* any constructs that would prevent statically calculating stack usage were banned (for example any form of recursion except when exact recursion depth is ensured statically),

* any locks were banned,

* absolutely every data structure must have size ensured, in a simple way, beyond any reasonable doubt,

etc.


It is interesting to read the rules you came up with to limit memory usage, and then to think of the criticisms one gets in Java for limiting memory usage. In Java we try to limit new as much as possible to prevent the GC from pausing too much, or inconveniently, or for too long. And basically all the rules you say are what we also use in Java.

Except when you have these rules in Java, the ironic counter-point is "if you are doing this much memory control yourself, you should just use C or C++ or something".

I'll keep your comment in mind next time I see that rebuttal. Thank you.


Having almost 20 years of experience with Java... but are not following recent garbage collector developments.

There is a bunch of misconceptions about Java. Java is actually very performant and memory allocation is generally cheaper than in C (except for inability to have good use of stack in Java). What's slow about Java is all the shit that has been implemented on top of it, but that's another story for another time.

For example, allocation in Java is basically incrementing the pointer. And deallocation for most objects is basically forgetting the object exists.

No, you don't want to "limit the use of new", that's wrong approach.

What you want is to have objects that are either permanent or last very short amount of time.

The worst types of objects are ones that have kind of intermediate lifetime ie if they are allowed to mature from eden. These cost a lot to collect.

The objects that have very short lifetime are extremely cheap to collect.

So if your function takes arguments, creates couple of intermediate objects and then never returns them (for example they were just necessary for inner working of the function) and your function does not call a lot of other heavy stuff, then it is very likely the cost of those temporary objects will be very low. Also, they tend to be allocated very close to each other and so pretty well cached.


This was very insightful. Thank you.


Well that seems to align pretty well with Entities/Aggregates and ValueObjects in DDD.


Actually I experienced worse restrictions when I was at Siemens writing embedded. Expanding on your list, here are some extras:

- ternary operator("?") was strictly forbidden. One had to use full "if () {..}else {..}" syntax with comments inside each branch even if the branch was empty

- a dynamic array written in an abstract way, when used and implemented specifically for current project had to become a constant static one, with values precalculated and copy/pasted to current project source. This was a fun one to do maintenance work years later.

- magic numbers inside code was forbidden. All numbers had to be defined in a specific header, with explanation why is that number said value.

- no variable parameters. All functions to have fixed parameters

- use of macros as minimum as possible. Code review was wasted sometime on 50% time over use of macros that were not already "classic" from the project point of view

- operator overload strictly forbidden. Also overloading functions was forbidden too.


All of these except for the first two seem like good rules in general. Was the ban on the ternary operator just a style/readability thing?


Maintenance reasons. Future juniors to not have problems diving into projects directly. Chaining several ternary operators in same line is great to showoff your C prowess but a PITA to decipher what was doing, hence the rules.


Anything enforcing MISRA has essentially (almost) no way of allocating memory at runtime.


MISRA: "Motor Industry Software Reliability Association"

https://en.m.wikipedia.org/wiki/MISRA_C


It’s funny, I worked exclusively with MISRA at the start of my career. Eventually I started a job at a FAANG and received quizzical comments on why I implemented a memory arena.

The argument was to allocate memory freely and let it pool memory as necessary. Fair enough, it was simpler and fit the standard expectation of development.

The issue is that if you talk with the allocator team they complain of not being able to fix performance issues fast enough due to allocations firing off left and right in the middle of a request.

I never realized that my view of C programming is heavily influenced by MISRA until your comment.

I know game engine programming follows a similar, perhaps unspoken, convention.


Custom allocators are quite common, it's not an arcane convention. I think the rule of thumb is preallocate until it gets questionable in complexity, then write your custom allocator - and really it's only applicable to code with a real-time deadline (hard or soft). Otherwise the system allocator is going to be a lot smarter than yours once it leaves microbenchmarks.


The lack of runtime allocations in game engine programming comes from a different motivation: allocations are expensive, garbage collections are expensive, cache coherency matters, and you're chucking around a lot of very similar looking objects, so... object pools!


Yeah, the first time we coded a scroller shooting game with my friend (at school), we were baffled that our terminal-based scroller lagged more than the raycaster we did two weeks prior. Was it a C vs C++ thing?

Turns out, creating then destroying every single missile/enemy was extremely costly


How often does the dynamic allocation rule lead to an ad-hoc allocator appearing inside the program?

Also doesn’t the OS lie? I thought the memory wasn’t really physically assigned until first use.


In my experience dynamic allocation is banned in either (a) small embedded environments or (b) high scrutiny environments (soft realtime, safety critical, etc).

In both cases the project size is small enough, or the scrutiny is high enough that the ad-hoc allocator doesn't develop. The environment is also simple enough that the memory cheats you're thinking of don't exist (or you can squash them by touching all allocated memory up front).


Why would you want to implement your own allocator?

The goal of these rules is to improve reliability and timeliness of your application. If you intend on working around those rules to do what the rules explicitly forbid then either you or the rules are wrong.


Suppose your problem intrinsically has variable numbers of variable size records coming and going at runtime. Sure you can allocate a big array upfront. But it seems to me that the code which directs records to slots in this array is an allocator.


> How often does the dynamic allocation rule lead to an ad-hoc allocator appearing inside the program?

You could maybe call filescope buffers with an size counter a dynamic memory allocation. I.e. for storing RS232 or CAN messages. Since they shrink and grow.

The important thing is that you want to know that flooding one buffer wont flood another, which malloc could result in if it was used for unrelated buffers.


> Also doesn’t the OS lie? I thought the memory wasn’t really physically assigned until first use.

That depends on the OS. Linux lies (overcommits), Windows doesn't. In embedded it's more typical to have a special OS like VxWorks or FreeRTOS that don't lie to you, or to have no OS at all (like basically every arduino project)


Linux doesn't lie, it is just probably not doing what your simplified view of memory allocation is.

On Linux memory allocation is basically assigning range of address that may or may not be backed by pages in physical memory.

This allows doing a lot of interesting and useful stuff.

If you really want the memory for some reason (for example you need to guarantee your operation finishes without running out of memory), you need to touch the pages or force them in some other way (for example using mlock()).

It is just that developers are mostly oblivious how memory management works on Linux and then are surprised that it doesn't exactly do what they want.

Most people I work with can't tell how much memory is available on a Linux box if their life depended on it.


On my machine, which to my knowledge hasn't had this setting changed, ulimit -l reports each process can only lock 64KiB of memory... I feel like "you should use mlock()" isn't really practical advice.


It's an embedded system, it's very likely there's no OS in the first place.


Actually it is irrelevant whether you use an operating system or not.

One project I worked with these rules was on Verix OS.

The rules are more intended on reducing application complexity and unpredictability which is typically helping reliability regardless of the setting.


It was mostly on if the OS would actually allocate physical memory, or if it'd do a linux style overcommit. Without an OS, the latter is very unlikely.


Again, Linux DOES NOT overcommit memory.

Linux does not lie about what available memory is. Rather, it is most developers that do not understand how memory is managed on Linux.

What you probably mean is that you don't get physical memory when you run malloc().

That's because when you allocate memory on Linux you allocate virtual address space rather than physical memory.

Basically, you get a bunch of addresses that may or may not be backed up by physical pages.

If you want physical memory you just need to use mlock() along with malloc() and you are all fine.


You're redefining overcommit from how everyone else uses it.

https://www.kernel.org/doc/Documentation/vm/overcommit-accou...


If you want to "allocate" (reserve) physical memory just call malloc and then mlock. No lie. You get physical pages or error.

Get over it. What you call "allocation" is two distinct operations on Linux, one of which is called malloc in standard c library, unfortunately, and that's where your confusion comes from.


Overcommitting means providing more virtual memory to be mapped than is available in physical memory. It works by mapping memory as readonly, mapped to a zero page, and "committing" a real physical page when a write occurs to a page for the first time. The act of asking for virtual memory and mapping it is considered allocation. Linux overcommits virtual memory. Some OS, like Windows, actually commit physical pages to back virtual memory when you ask try to map new virtual memory.

I think you know these things and it's mostly just a semantics argument, but this is the widely agreed definition.


The stack thing was always the big worry for me. Without a comprehensive static code analysis tool that's hard to do. And runtime stack checking adds quite a bit of overhead, especially if you also have to worry about running on the interrupt stack and possibly switching.


> dynamic allocation after application has started is banned -- any heap buffers and data structures must be allocated at the start of the application and after that any allocation is a compile time error,

how do ensure that?


One way I can think of is to include the banned.h after you have performed all init processes

(It would have to be in the .c files, not the headers, might not be so clean)


It would be nice if the error messages generated would suggest replacement functions that they deem appropriate. I see that I'm not supposed to use gmtime, localtime, ctime, ctime_r, asctime, and asctime_r; but what do they think I should use?


From the commit messages

> The ctime_r() and asctime_r() functions are reentrant, but have no check that the buffer we pass in is long enough (the manpage says it "should have room for at least 26 bytes"). Since this is such an easy-to-get-wrong interface, and since we have the much safer strftime() as well as its more convenient strbuf_addftime() wrapper, let's ban both of those.

(https://github.com/git/git/commit/91aef030152d121f6b4bc3b933...)

> The traditional gmtime(), localtime(), ctime(), and asctime() functions return pointers to shared storage. This means they're not thread-safe, and they also run the risk of somebody holding onto the result across multiple calls (where each call invalidates the previous result). All callers should be using their reentrant counterparts.

(https://github.com/git/git/commit/1fbfdf556f2abc708183caca53...)


Yes, but every hapless user shouldn't have to go searching through a bunch of commit messages to find the suggested replacement. Bad UX.


It seems pretty safe to assume a developer contributing C code to git itself would know how to use git blame (or the GitHub interface for it).


I find it highly backwards that documentation on "what to use instead of X" is in the commit message disabling X. One _might_ do it and might remember to do it, but IMO it makes absolutely no sense for this not to be documented properly in code, as suggested by OP.

By that logic, a non-insignificant amount of (good) comments in code could be removed and people asked to "git blame the code and check out the commit that made it for the documentation". Of course this could be done, but it sounds ridiculous even typing it out.


I disagree. Commits messages exist for the very purpose of adding context to your code base. If you added <complex_function> for something that needs context, sure MAYBE add a comment, but I really pray that I'm going to find a few paragraphs disambiguating the problem within a git commit. If I'm _really_ lucky, maybe I find a PR number or Jira ticket reference as well.

If you're truly clueless as to what could be substituted for these commands, then you don't understand why they're banned. So our first step? Figure out why they're banned. And how would we sanely approach this? Probably by checking the commit message for _why that code is there in the first place_. That's a very safe, sane, and not-at-all backwards assumption. After you understand why it's there, a quick google search might help out if the commit message didn't already include information on alternatives.

Lastly, yeah, I totally agree a large amount of GOOD comments should be relegated to the git commits if all they're doing is adding additional context around a complex piece of logic. Comments do not exist to edifying a code base in any way other than context. They're too easy to let become stale, whereas a git commit will always reference exactly the code you're blaming.

So, I have to really disagree that it's ridiculous or in any way absurd. In fact, I think a lot of code suffers from NOT using git as a way to extend context around a code base. It's SUPER easy with most development environments to select a block of text and blame it. It's so easy that it's almost always my go-to to increase my context of what's been happening around a particular part of the code base.


What if that code is refactored, moved around and changed so many times that it's nearly impossible to find the "documentation" for the line you're interested in. I mean sure you could spend a few hours going though commit messages, but wouldn't it be nice if there was a simple comment next to the code that gives the info right away?

Also commits shouldn't be changed so if you want to improve the doc and provide more details, well you can't.


Or changed by an autoformatter.

Granted these concerns are probably less likely to apply to this particular file.

Perhaps they just felt that anyone contributing to git would already know why not to use those/what to use instead (but then there would be no need to ban them).


So, you are tied to Git for eternity to preserve documentation?

Might work in practice for a long time, but Git is a version control system, not a documentation system.


For developer documentation - yes, absolutely!


> By that logic, a non-insignificant amount of (good) comments in code could be removed and people asked to "git blame the code and check out the commit that made it for the documentation". Of course this could be done, but it sounds ridiculous even typing it out.

Yes, exactly. You want to understand how a codebase changed and evolved over time? Git is your friend. If you want the facts of the code today? The source code is your friend. That's why the way Linux and Gits Git repository method of storing history makes sense. See also https://news.ycombinator.com/item?id=26348965

Try navigating the Git codebase with a git-blame sidebar (probably VS Code has that somewhere) so you can see the history of the source files. If you wonder why something is what it is, you can checkout the commit that last modified it. Or go even further backwards and figure out in the context it was first added. If you truly want to understand a change, a git repository with well written git messages is a pleasure to understand and dig into.


> Yes, exactly. You want to understand how a codebase changed and evolved over time? Git is your friend. If you want the facts of the code today? The source code is your friend.

100% agree. Though I don't mind if comments also leave historical information about the code. Can't be too much -- there is a delicate balance.

Do note, however, that you said it yourself: If you want the facts of the code today, go to the source code. In my opinion, the "facts of the code of git" are that functions X,Y,Z are "banned", but the code does not tell me why, or what to use instead. It just bans them. I would expect to see something in the code, not (just) in a git commit. It's also not that I can't google these functions (a couple of minutes will answer these questions), or that I should be experienced enough to know why they're evil, it's that it's IMO a reasonable, developer-friendly and good thing to do.


Why make it harder, and why make it impossible to update if there are other suggested alternatives that are available since whenever the commit was made?


> Why make it harder

Because there is no way for a commit message to become outdated or detached from what it talks about, both of which are very much issues with comments.

> why make it impossible to update if there are other suggested alternatives that are available since whenever the commit was made?

Because that doesn't really matter.


> Because that doesn't really matter.

Ok, so maybe rather than have this file we should run “git log | grep BANNED” and build a list of functions from that? Or maybe we could change all error messages to be “go look at the commit history to work out why this happened”.

No? Maybe putting context in source files (or better yet, an error message!) rather than in a side channel like the commit message has value when it comes to understanding and updating, and it won’t be lost under the weight of future commits.


Your source code should describe what the program should do today. It should not contain all historical artifacts about your source code, as it'll grow to big and unmanageable then. Instead, use Git to store temporal information, data that is about change and reasoning behind it. Git is basically a timeline, instead of hard facts of today.

That's why it makes sense to describe the background and reasoning behind a change in a Git commit, instead of inside your source files as comments.


Totally agree, which is why nobody is suggesting adding the background and reasoning behind the change to the source file as a comment.

They are suggesting adding a more informative error, which may include a subset of that background and reasoning. An error message that points you to the functions you should use instead is infinitely more informative than one that says “this is banned. Bye.”


Precisely.


Code is evergreen, whereas a git commit represents a change at a single point in time. It will always be limited by the knowledge the author had available to them.

The commit message from 2020 with suggested alternatives might very well go stale. Does the author go and force a noop commit so they can document new best practice in a new commit message?


> Because there is no way for a commit message to become outdated or detached from what it talks about, both of which are very much issues with comments.

What if they think of another reason why one of the same functions should be disabled?


I think you are confused this is not for any hapless user. Developers search through and read commit messages all the time.


The UX of using this list is not by manually searching through the list and seeing the reason behind them. You include the file together with the rest of your sources and now you get compilation errors if you try to use them. Can't think of a better UX for banned functions.

Discovering why the thing is banned you only have to do once, if you care. If you're just modifying something quickly and minor in Git, you might not even care why.


Strangely there is no mention of strtok which has a similar issue.


The commits actually do give that info. Take for instance this commit:

https://github.com/git/git/commit/c8af66ab8ad7cd78557f0f9f5e...

It actually gives examples and a lengthy explanation and reasoning behind the ban.


But why put that info in commit message instead of a comment in the file itself?


Because comments can be tedious and get out of sync with the repo. Why not check the git history? I wish more repos could be like this!


> Why not check the git history?

Because that is effort every person who uses the file has to do over and over again, whereas maintaining the file is effort that has to be done once by one person.


Someone here commented to use git blame to find the commit that banned the functions and read the commits. These people making the suggestions.. must hate other people and their time. Also, what if someone.. for example runs a code formatter on the file, making git blame useless? Is it really so difficult to make a manual or explain properly in the comments about what replacements to use?


Try:

  git blame -w -M
This will ignore whitespace and detect moved or copied lines.


git blame --ignore-rev / --ignore-revs-file


It sounds like you want a manual. Personal preference I guess. The maintainers seem to have decided to keep it in the history. It's not like this was ever meant for anything other than git itself.


If all you have is Git everything looks like a commit.


I really wish tooling like this was more common:

https://github.com/eamodio/vscode-gitlens/tree/v11.2.1#curre... (screenshot)

> Current Line Blame: Adds an unobtrusive, customizable, and themable, blame annotation at the end of the current line


Or even in the compile error message itself.


Now that's what a good commit message looks like!


Commit messages like that are common in the Linux kernel project, which is where git came from (though this particular commit message is a bit on the longer side).

It makes more sense if you think of it as an email message justifying why the project maintainer should accept that change, because that's what they were before git even existed. Still today, unless you're one of the Linux kernel subsystem maintainers, you have to convert your changes to emails with git-format-patch/git-send-email and send them to the right mailing list. Even the Linux kernel subsystem maintainers keep writing commits in that style out of habit (and because Linus will rant at them if they don't).


Also, why the functions are banned.


It would be even nicer if it redefined the call to a safe version and then generated a warning message informing the programmer of the substitution.


You can't do that because the semantics are different in most cases.


I love seeing "strncpy" right after "strcpy."

If someone wants some fun, try this:

1. Slurp up all the FOSS projects that extend back to 90s or early 2000s.

2. Filter by starting at earliest snapshot and finding occurrences of strcpy and friends who don't have the "n" in the middle.

3. For those occurrences, see which ones were "fixed" by changing them to strncpy and friends in a later commit somewhere.

4. See if you can isolate that part of the code that has the strncpy/etc. and run gcc on it. Gcc-- for certain cases (string literals, I think)-- can report a warning if "n" has been set to a value that could cause an overflow.

I'm going to speculate that there was a period where C programmers were furiously committing a large number of errors to their codebases because the "n" stands for "safety."


Meh, most of us understood the sharp edges of strings pretty well. Before, we'd check the len of strings before strcpy, strncpy let us do it without doing that, and just slap a 0 in if needed. Safe? No. Better? A bit. Do I ever want to do string manipulation again with C? Nope.


Understanding the sharp edges is one thing. Being able to avoid them in practice is another. The history of memory safety problems in C string handling, especially involving strcpy/strncpy, strongly suggests to me that they're unavoidable even for C programmers who are skilled, knowledgeable, and experienced.


As an embedded programmer mostly working with C who considers themselves skilled, knowledgeable, and experienced...

I agree.


Ok, memcpy(dst, src, strlen(src)) it is then!


Yay for errors, it should be memcpy(dst, src, strlen(src)+1). Strlen doesn't count last 0. If your dst is not zeroed already you will have unterminated string.


Should have let me use strncpy then, shouldn't you? ; P


It would be interesting to see the rationale behind these bans, and what the suggested alternatives are. Some are obvious, like `strcpy`, but I can't remember what the problem with `sprintf` or the time functions are.

If you are doing something like `sprintf(buffer, "%f, %f", a, b)`, yes it is tricky to choose the size of buffer frugally, but if you replace that by `ftoa` and constructing the string by hand, you are likely to introduce more bugs.

Edit: as pointed out in another post, you can do git blame to see the rationale for each ban, quite interesing.


The trouble with printf-family functions is their variadic nature. If the arguments don't match the format string, you can wreak all sorts of havoc.

A fun exercise you can do is put a "%s" in the format string, omit the string argument and see what happens to the stack.


That's however relatively easy to verify programmatically, and indeed any recent compiler will complain about that.

I'd say the usual trap is rather the size of the target buffer, because that requires bigger static analysis guns. (I'm ignoring things like "%n", because then you're playing with fire already.)


I think the big three C compilers have pragma's that you can tag printf/scanf with that will cause the compiler to verify the argument list.


__attribute__((format(printf, x, y))) for GCC:

https://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Function-Attribut...

There's options other than printf too.


There's that, but with sprintf/vsprintf specifically, there's no way to keep it from storing characters past the end of your buffer. For example:

    char buf[2];
    sprintf(buf, "%d", n);
This will happily write to buf[2] and beyond if n is negative or greater than 9.


This was my reaction as well. Banning strncpy just encourages haphazard manual copying.


From the commit message:

If you're thinking about using it, consider instead:

  - strlcpy() if you really just need a truncated but
    NUL-terminated string (we provide a compat version, so
    it's always available)

  - xsnprintf() if you're sure that what you're copying
    should fit

  - strbuf or xstrfmt() if you need to handle
    arbitrary-length heap-allocated strings


strlcpy is safer but effectively running strlen(src) every call is a good wtf


I think you're meant to use snprintf instead. It would be great to see documentation on the alternatives!



strlcpy is the safe way, that is used by git.


strncpy doesn't do what you think it does (it is not analogous to strncat). strncpy does not terminate strings on overflow. In C terms, it is not actually a string function and shouldn't be named with `str`.

snprintf or nul-plus-strncat do what you want, but snprintf has portability problems on overflow. Most projects I've been on rely on strlcpy (with a polyfill implementation where not available).


strnlen may be surprising the first time you see it not null terminate, sure. But if you use the n version of all the string functions (which you should anyway) then it's safe.


snprintf will always terminate the string, and won't overflow the buffer.


sprintf() warnings have gotten pretty sophisticated these days. I discovered GCC's -Wformat-overflow the other day. It complained that the buffer for a date string wasn't big enough; e.g., sprintf(buf, "%04d-%02u-%02u", year, month, day), where year, month, and day are 16-bit shorts, and buf was probably eleven or twelve bytes.

It may actually be a bug that I got the warning, because the range of each input was checked, and I think the compiler is supposed to be smart enough to remember that.


These functions are one of the many reasons why I tend to have a C with some C++ classes dialect I use in my own projects.

std::string needs some tweaks, but it can mostly be treated as a built in and it wipes out a huge set of C string issues.


I have only ever dabbled in C, just to look at other people's code and occasionally when I really needed speed, so I am at what I would call a "Pretty Pathetic" level, able to recognize that I am looking at C.

However, I look at old books on C, and then I look at this list, and I wonder if it would not have been helpful to, after mentioning that a function was banned, suggest what the replacement is, even as a comment.


You're not wrong. But a seasoned C developer looks at this list and nods along. (I'm a little out of practice, but I have war stories for most of these).

It's likely that the authors of this list didn't think the comments would be worthwhile for the audience (git developers).


I wonder how they copy strings with strcpy and strncpy both banned. strlcpy? But it is not conforming to major standards. Or just memcpy with extra code?


Edited: Looks like they have safe alternatives: "

  - strlcpy() if you really just need a truncated but
    NUL-terminated string (we provide a compat version, so
    it's always available)

  - xsnprintf() if you're sure that what you're copying
    should fit

  - strbuf or xstrfmt() if you need to handle
    arbitrary-length heap-allocated strings
"


https://github.com/git/git/commit/e488b7aba743d23b830d239dcc... Yes:

> we provide a compat version, so it's always available


This gets me interested. Link [1] below shows their implementation of strlcpy(). This is a questionable implementation. With strncpy, the source string "src" may not be NULL terminated IIRC. The git implementation requires "src" to be NULL terminated. If not, an invalid read. EDIT: according to the strlcpy manpage [2], "src" is required to be NULL terminated, so strlcpy imposes more restrictions and is not a proper replacement of strncpy.

Furthermore, imagine "src" has 1Mb characters but we only want to copy the first 3 chars. The git implementation would traverse the entire 1Mb to find the length first, but a proper implementation only needs to look at the first 3 chars. So, they banned strncpy and provided a worse solution to that.

[1]: https://github.com/git/git/blob/master/compat/strlcpy.c

[2]: https://linux.die.net/man/3/strlcpy


You have found the answer - strlcpy is not a replacement for strncpy at all (it's arguably a safer version of strcpy), and git people didn't invent this, it's the existing BSD strlcpy interface.


Thanks for the confirmation. But my concern remains: they banned strncpy without a proper replacement. In addition, I didn't know the extra restriction of strlcpy until today (I have never used it before because it is not conforming to C99/POSIX). I might have fallen into this trap.


The problem is the actually often the opposite, in the real world many treat strncpy as if it behaves like strlcpy. Note that strlcpy is equivalent to:

    snprintf(buf, sizeof(buf), "%s", string);
strlcpy is on track for future standardization in POSIX, for Issue 8, but even as a de facto standard, it exists in libc on *BSD, macOS, Android, Solaris, QNX, and even Linux using musl.

https://www.austingroupbugs.net/view.php?id=986#c5050

But you're correct in that it is not a replacement for strncpy because no code should be using strncpy.


Agreed. It's O(n) inefficient. I guess looping though chars up to `size` would perform better on average.

I see this `strlcpy` recommanded everywhere.


A few days ago people here were discussing the quadratic sscanf() behavior. strlcpy() has the same problem.


Take a step back and consider strlcpy isn't supposed to be a drop in replacement for strncpy (a function which already exists).


strncpy is a dangerous function because it doesn't nul terminate on overflow. The danger is that it's named misleadingly (str* functions otherwise always work in nul-terminated strings).

(strcpy is just banned because there's no bounds check, and they want to force use of strlcpy instead).


strncpy also has the dubious behavior of zero-filling out to N even if strlen(src) is much shorter than N.


memccpy? Most platforms have it, and it's being added to C2X.

See https://developers.redhat.com/blog/2019/08/12/efficient-stri...


To its credit, it's convenient that the C pre-processor is so powerful that it facilitates baking a "C the good parts" concept directly into the compilation process.


In compilers that implement GCC extensions (such as Clang), you can use the "poison" directive to achieve the same effect (but with a better error message):

#pragma GCC poison printf sprintf fprintf

[0] https://gcc.gnu.org/onlinedocs/gcc-3.2/cpp/Pragmas.html


I'm an idiot, I read the headline and thought these were banned from Git entirely. As in, you couldn't commit them to any repo using Git, at all. Thought that seemed a bit harsh.

Turns out you just can't use them when you contribute code to the Git project. That makes sense, and seems reasonable.


Critiquing poor code practices is beyond the scope of git at this time


Should be easy to implement, will have a pull request ready tomorrow.

Edit: wait, I can't use strcpy?! Screw that, then I'm not open sourcing my AGI!


Funnily enough, strtok() is not listed :)


This one has my vote for the weirdest library function ever.


The storing of state between calls is beautiful in all its wickedness.


They should probably add sscanf.


First thing I looked for. It looks like it was used here:

https://github.com/git/git/blob/master/object-file.c#L1293

And currently used here (at least):

https://github.com/git/git/blob/master/refs.c#L1235


Some functions are missing which would normally cause a warning with most linters and static security analysis tools (e.g. the atoX family, mktemp, etc ...). Problem is most people I know don't run external linters (maintaining good linting rules is hard to scale in larger projects and in my >3 decades of writing C only few companies[0] I've seen managed the linting rules as part as their "definition of done").

While I think such rules are a good idea it only makes sense if it is done consistently and depends on how religiously the tooling (duct-tape and "process") enforces them (even so, you're still only one `#ifdef` away from undoing that "safety"). Having GCC[1] now support static analysis is a killer feature for this type of problem.

On the other end of the spectrum we have Huawei which instead of linting their code is finding creative ways to trick auditing tools and hide such warnings from auditors:

[0] https://news.ycombinator.com/item?id=22712338

[1] https://developers.redhat.com/blog/2021/01/28/static-analysi...

[2] https://grsecurity.net/huawei_and_security_analysis


The Git Mailing List Archive on lore.kernel.org (found in the README from the git mirror on GitHub) has more context [0] [1] [2]. From Jeff King on 2018-07-24:

  The strncpy() function is less horrible than strcpy(), but
  is still pretty easy to misuse because of its funny
  termination semantics. Namely, that if it truncates it omits
  the NUL terminator, and you must remember to add it
  yourself. Even if you use it correctly, it's sometimes hard
  for a reader to verify this without hunting through the
  code. If you're thinking about using it, consider instead:

    - strlcpy() if you really just need a truncated but
      NUL-terminated string (we provide a compat version, so
      it's always available)

    - xsnprintf() if you're sure that what you're copying
      should fit

    - strbuf or xstrfmt() if you need to handle
      arbitrary-length heap-allocated strings
I just did a search on the keywords 'banned' and 'strncpy' [2]

[0] https://lore.kernel.org/git/20180724092828.GD3288@sigill.int...

[1] https://lore.kernel.org/git/20190103044941.GA20047@sigill.in...

[2] https://lore.kernel.org/git/20190102093846.6664-1-e@80x24.or...

[3] https://lore.kernel.org/git/?q=banned+strncpy


Psst:

https://github.com/git/git/commits/master/banned.h

(Git development is done by emailing patches. Those patches include the git commit message, which we can see just by looking at the history of the file. Sometimes there's additional discussion on the ML, but the most important details are in the commit message because the git development team is very disciplined about that.)


Ha, yep, whoops


It would be great if the BANNED() macro could suggest the correct function to use.


The right function may change based on the use case, that's why they may not have wanted to suggest an alternative outright.


You could send a pull request, it doesn’t seem too complicated to implement


The similar list from ClickHouse repository: https://github.com/ClickHouse/ClickHouse/blob/master/base/ha...


Are there some details on whats wrong with these?


The commit messages that added them explain the reasoning


I wish they would have put that on comments instead of on the commit messages. It's not the first time that I've seen this particular list of banned functions being shared online and every time it happens someone has to explain that the most interesting info is hidden in the commit messages.


All the string functions have buffer overrun vulnerabilities if not used carefully. I'm not sure about the time functions though.


The time functions are either non-reentrant, or, for the _r versions, have the same problem with buffer overruns.

https://github.com/git/git/commit/1fbfdf556f2abc708183caca53...

https://github.com/git/git/commit/91aef030152d121f6b4bc3b933...


Very much this. I frequently write small games in C, and the number of times I have been bitten by baffling behaviour because a string somewhere was copied into an array that was too short, are many! Apart from that, I love the simplicity of the language and the stdlib, and it's definitely my preferred hobby programming environment.

It would be good to know what the commonly-accepted alternatives are.


I'm pretty sure you could google each of these with the word 'dangerous'

For example: https://lgtm.com/rules/2154840805/


At least they didn't ban memcpy()...

Much like with all other forms of effective censorship, I see this as a quick short-term "fix" with hidden long-term costs[1]. IMHO this sort of anti-thinking just leads to even worse, more dogmatic and cargo-cult, programmers who know less and less about the basics and then go on to make even more subtle errors.

Somehow the collective software industry has managed to propagate the notion that people are incapable of doing even basic arithmetic. Yet they think people are capable of creating complex systems with even more subtle behaviour? The justification would normally be because it's not directly affecting security. WTF. It's beyond stupid.

The only C function I think should be truly banned is gets(), because it is actually impossible to calculate what size of buffer it needs. That is not true of any of the others on this list.

[1] By short and long, I mean decades vs centuries.


OK, so no strncpy, strncat etc, what are the alternatives used in git then? I'm a long-time C coder but I do not know what will be used to replace strncpy/strncat and all those gmtime/localtime/ctime/asctime.


Ah this is a very good idea. I guess you still have to make sure that all your translation units include this header, which isn't completely foolproof.

Static analysis would probably be more robust, but way more involved.


Best of both worlds: use static analysis to ensure the header is included?


gcc has a -include option, so this can be done once in the Makefile and get the benefit everywhere (unless you’re being clever).


I remember visual studio having an option to force include a file, surely something like that would exist for other toolchains


You don't need fancy static analysis. You can find out whether the banned functions are called just by inspecting the compiled object file. Add it to the build step and done.



Our forbidden functions header is similar; it's got about 30 functions including most of the str* family to enforce the use of our safer versions.


Is there no linting software that can catch these kinds of issues? Like using strlen with sscanf like I've been hearing about lately?


I am so thankful git isn't forcefully including this header in every C language project and that we have a choice when using git! :-)


I used C many years ago so I’m quite out of it. What are the replacements for these? I would have thought these were all necessary.


Let's use a loophole ;) - (strcpy)(a,b)


Now I am getting really curious whether other companies with supposedly strong engineering knew about sscanf issues.


Just replace strcpy(a,b) with strcpyn(a,b,INT_MAX)

/joke


I'm pretty sure I've seen similar logic in my life.


Been a while, eh?

It should be strncpy(a,b,(size_t)-1)!


SIZE_MAX does exist.


What would be helpful is an explanation of how each function ends up being misused so people can learn from this.



View the git history for the file. Each commit that adds functions has a detailed explanation of what is wrong with the functions.


About 20 years too late. Those should have been moved to a "deprecated" header file decades ago.


Maybe instead of just writing a banned message, it should be the name of alternative function to use.


Yes, this is right. Any C decent programmer knows that functions are cursed.


I hope, one day to see it's rewritten in a safer language.


There's a nice Go implementation of git: https://github.com/go-git/go-git


gets() and scanf() should be on that list due to potential buffer overflow.


why is strncpy banned? what's wrong about it?


Can a C guru provide a TL;DR of why these are bad?


    - strcpy: no bounds check
    - strcat: no bounds check
    - strncpy: does not nul-terminate on overflow
    - strncat: no major issues, probably to force usage of strlcat
    - sprintf: no bounds check
    - vsprintf: no bounds check
    - gmtime: returns static memory
    - localtime: returns static memory
    - ctime: no bounds check
    - ctime_r: no bounds check
    - asctime: returns static memory
    - asctime_r: no bounds check
The str functions all have safer alternatives. The time functions have reentrant alternatives, and/or alternatives that provide a bounds check.


getc?


`gets` would be the ultimate banned C function, I suspect nobody thought it was worth spelling out though.


scanf?


Just banning is not fair; Include the alternatives;




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: