Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Git source code audit, viewed as a Rust programmer (litchipi.github.io)
160 points by todsacerdoti on Jan 24, 2023 | hide | past | favorite | 64 comments


Some of these comparisons are a little lackluster - sure, the issues may be relatively easy to reproduce if you translate the C code 1:1 to (unsafe) Rust, but you wouldn't even have such code in the first place if you wrote idiomatic Rust:

- Confusing different integer types is much harder not only because the type checker will complain, but also because you don't really come across anything other than usize/isize unless it's a field in a protocol or something.

- Strings in rust are not zero-terminated, so that specific example might not even happen at all.

- Integers, pointers and references are not implicitly cast to booleans, so you can't accidentally mix up comparing a pointer to NULL (which is only ever required for FFI calls) and comparing an integer to 0.

- You have proper types to indicate that a function encountered an error, so all the issues stemming from checking the integer return code the wrong way (e.g. interpreting an error value as a return value) don't exist.

- You simply don't have to invoke potentially dangerous functions like `memcpy` because there are abstractions that take care of this safely and more intuitively.

That's not to say that idiomatic Rust code wouldn't come with its own problems, but they tend to be much more concentrated to fundamental logic bugs rather than silly avoidable things like overflows or misinterpreting values.


> - Confusing different integer types is much harder not only because the type checker will complain, but also because you don't really come across anything other than usize/isize unless it's a field in a protocol or something.

For me it's the opposite: usize is only used for indexing arrays, isize only used for memory offsets (which is a really niche use case outside of low level code), and most application code should be written with u32, u64, i32 and i64. Why is that? Because usize has a platform-dependent size and generally the logic of the program should be platform-agnostic.

So you often need to index something based on an u32 (or something) and then there's a cast. How do avoid it?

Well first off, your own types can implement Index<u32> instead of Index<usize>. But better yet, you can have an Idx type that represents an index, and then implement Index<Idx>. That's seen for example in the petgraph API [0], you can index an graph with either a node index [1] or an edge index [2]

[0] https://docs.rs/petgraph/latest/petgraph/graph/struct.Graph....

[1] https://docs.rs/petgraph/latest/petgraph/graph/struct.Graph....

[2] https://docs.rs/petgraph/latest/petgraph/graph/struct.Graph....


Also what’s fun is that when you have usize and u64, which are both the same size on 64-bit systems, you might instinctively cast between them. But this could truncate the u64 on 32-bit systems, and it could truncate the usize on future 128-bit systems!

So neither type implements From/Into the other, the best you get are TryFrom/TryInto where you either have to handle an error or discard it after “proving” overflow can’t happen.

This is unambiguously the correct behavior, but it is sort of frustrating at times.


> So you often need to index something based on an u32 (or something) and then there's a cast. How do avoid it?

I imagine the nuance here is in the "or something", since that could imply something signed or larger than 32 bits, but for that specific case, I feel like a large amount of code that people write nowadays can be comfortably assumed not to need to run on 16-bit processors. Although `usize `only implements `TryFrom<u32>` and not `From<u32>`, even the standard library documentation for Rust simplifies things by saying that `usize` will be either 4 or 8 bytes (https://doc.rust-lang.org/std/primitive.usize.html), although I understand this isn't a hard contract and could change in the future. I guess it depends a bit on the use case, but I think that most of the time just using `try_from` and panicking with an appropriate error message on error would be fine, and if you want to be extra sure, you can always use something like `const_assert!` (https://docs.rs/static_assertions/1.1.0/static_assertions/ma...) with the `BITS` constant (https://doc.rust-lang.org/std/primitive.usize.html#associate...) to catch it at compile time to be extra certain.


Oh, u32 is generally fine exactly because 16 bits is so niche. But you still have a cast between u32 and usize (and .into() isn't that much better). Casts or conversions on each collection access makes code hard to read

So if your application can work with u32 everywhere (on either 32 bits or 64 bits platforms), you should use u32 and impl Index<u32> on your collection, to avoid the cast.

It's only if you need the less common use case of u32 on 32 bits platforms, u64 in 64 bits platforms that you should bother with usize


>> So if your application can work with u32 everywhere (on either 32 bits or 64 bits platforms), you should use u32 and impl Index<u32> on your collection, to avoid the cast.

This seems strange. Why does Rust not automatically use whatever size index you hand it?


Ultimately it is because the Rust developers decided that there would be no implicit conversions. In a lot of cases implicit conversions are convenient, but many people feel that the complexity just isn’t worth it.


It’s not out of the question that additional `Index` impls for other integer types could be added to the standard library at some point; the issue is discussed often enough. But it’s not trivial to do because some currently valid code could become ambiguous if impls are added, so some sort of a transition would be required.


I'm not sure it's the best policy, but I tend to represent every counter, of anything as usize. It doesn't matter if it's a memory map.

I imagine that applications with less addressing space also have less of every other resource, and can count less of everything.


> I imagine that applications with less addressing space also have less of every other resource, and can count less of everything.

That's a fair point, and I've seen here on HN a design approach that everything should have a limit (because programs often explode in unexpected ways, such as degraded performance, when operating outside the programmer's expectations)

But, usize doesn't necessarily capture the right limit, so.. I'm not sure


Well, yes. I favor your program blowing up on integer overflow too, so it's very predictable how it behaves. And if it's not overflow that is the concern here, well, it's not solved by going with another type either.

Almost all the time, usize has a convenient limit that is much larger than needed. On the few exceptions, yes, you have to design your types and deal with type conversions.


It's sometimes good to use a smaller size. Many string/dynamic array libraries e.g. will use 32-bit types for size and capacity even on 64-bit platforms to conserve space, and by that virtue also be more cache-efficient. I think the Zig compiler also explicitly uses 32-bit types in a lot of places, even though it's targeted mostly for 64-bit platforms.


> Because usize has a platform-dependent size and generally the logic of the program should be platform-agnostic.

Fair point, but it really only matters if the overflow is handled gracefully. If it's handled as an unrecoverable error condition, it often 1. doesn't happen during normal operation and 2. can be invoked by a determined attacker, regardless of the exact size. Sure, the exact characteristics might be different, but in the end it doesn't make much of a difference.

Petgraph has a great API, and in general using newtypes over integers for collection indices is a great way to avoid another logic bug: mixing up indices for different collections.


Defining a newtype avoids confusing indexes of, say, a graph and a Vec, but doesn't avoid confusing indexes of two graphs

Rust has a design pattern called lifetime branding (also called generativity), which uses phony lifetimes to prevent, at compile time, confusing indexes of two separate collections of the same type. This can also enable disabling out of bounds checks without triggering UB (because, with branding, we can be sure that the index is on bounds at the moment of creating it; essentially we move bounds check from the indexing time to the index creation time)

Here's an earlier mention of that [0] (7 years ago), and here's a crate from 3 years ago, indexing [1] but I'm not sure about recent developments on that.

Now, petgraph doesn't use branding for its index types, so if you have two graphs you can confuse their indexes. On the other hand, petgraph was specifically designed so that you can reuse the node and edge numbering across many graphs (so that if you have a subgraph for example, the nodes and edges share the same ids), in this situation it's kind of hard to use branding

There's another pattern for not confusing index types which is to make different index types for each different collection and make the collection work only with that type; this is done eg. in typed-indexed-collections [2] - but it doesn't use branding so two collections with same index type have interchangeable indexes

Anyway right now this stuff is mostly folklore but I wish it were more used.

[0] https://www.reddit.com/r/rust/comments/3oo0oe/sound_unchecke...

[1] https://github.com/bluss/indexing https://docs.rs/indexing/0.4.1/indexing/

https://crates.io/crates/typed-index-collections https://www.reddit.com/r/rust/comments/hr6xcu/announcing_typ...


> For me it's the opposite: usize is only used for indexing arrays, isize only used for memory offsets (which is a really niche use case outside of low level code), and most application code should be written with u32, u64, i32 and i64. Why is that? Because usize has a platform-dependent size and generally the logic of the program should be platform-agnostic.

While that may make it platform agnostic, doesn't that make it non-time-agnostic? What I mean is, I lived through the 16-to-32-bit transition. If Rust had been around then, and you wrote code that used u16, and you tried to use it now, you would probably bump against that 16-bit limit. The platforms grew with time, but the code didn't.

We're starting into the 32-to-64-bit transition (we've been slowly easing into it for years, in fact). Don't lock yourself into 32-bit code (unless the algorithm requires it). You're locking yourself out of the future.


64 bits is huge! We won't be needing more than 64 bits to address memory anytime soon. In fact, current processors use only 48 bits, leaving the rest unused / up for grabs with flags.

And note that the usize that is the index of Vec (and arrays, etc) is not an address but just a counter; it can count up to more than 10^19 elements. You probably won't be alive to see Vecs grow that big, even a Vec<u8>.

The most interesting platform with 128 bits pointers is CHERI, but its address space fits on 64 bits; the other 64 bits is used for a tag, to validate pointers (and prevent certain kinds of buggy memory access that can lead to corruption). If and when Rust support CHERI it should probably do so with usize = 64 bits, because that's how much you need to count to address a Vec<u8> over the whole address space.


I've often thought a 48bit architecture might be ideal. 256TB of address space. Use 48bit posits instead of double precision floating point for similar precision. A 48bit pixel could be 16bit rgb or 12bit rgba. I've also tried to re-encode the RISC-V instruction set and I think 24-bit opcodes with immediate data following them would be better than what they've got. Here's to 48bit CPUs! Hahaha.


64 bits is huge... for addressing memory. That's what you were already using usize for, though. What about the other places, where you're using u32? Is 64 bits huge for them? All of them? Will it still be huge for all of them in 10 years? 20?


Indeed, over the decades, the field moved from static weak typing (with a few languages being strong) to dynamic weak typing to strong typing with type inference.

C has not benefited from any of this progress.


Modern C compilers have these features. They just need to be enabled.


Maybe. But old code does not benefit from them. And new code either has to deal with old code or is often written in a style that does not use them. They are opt-in.

More modern languages have certain features enabled by default. They are either mandatory or opt-out.


For what it's worth, if you enabled all the useful warnings you'll be warned about all the implicit casts that can cause problems in any modern C compiler. You'll have to specify three or four flags that say "yes I want all the warnings" but you'll know when you do dangerous casts.

Rust with its mandatory explicit casts is a lot less strict in this sense, your compiler won't complain because you decided to do casts yourself, which is why you should remember to be wary of simple casts every time you use them. Nothing wrong with casting constants and upcasting, but you have to double check because it's easy to introduce bugs this way.

If you want to make full use of Rust's safety features, you may also want to use checked numeric operations rather than the standard operators that may under/overflow.

Rust is a great tool for writing safer programs, but it won't defend you from many if not most DoS attacks if you're not careful


Note that if we're talking opt-in warnings, Rust's linting tool Clippy offers a whole bunch of warnings related to `as` casts. See https://rust-lang.github.io/rust-clippy/master/index.html#as... .


Not sure why this was downvoted. I’m an unabashed Rust apologist and everything in here is reasonable.

`as` casts are generally a code smell, but they’re sometimes the best (or only) way to accomplish a task. This can absolutely make it more challenging to identify correct and incorrect examples at a glance.

And Rust is awesome in that you can use explicit numerical operations that wrap, panic, saturate, etc on overflow. But there is no clippy lint (that I know of) to make sure you use them (which I would love).


> But there is no clippy lint

There actually is! You can activate the lint with a `#[deny(clippy::integer_arithmetic)]` or by using the lint group `restriction`.[0]

  // https://rust-lang.github.io/rust-clippy/master/index.html#integer_arithmetic
  #[deny(clippy::integer_arithmetic)]

  fn main() {
      let (a, b) = (u32::MAX, 1);
      //let c = a + b; // clippy: `error: integer arithmetic detected`
      let (c, carry) = a.overflowing_add(b); // or any of checked_*, saturating_*, unchecked_*, wrapping_*
      println!("{a} + {b} = {c} (carry bit: {carry})");
      // output: `4294967295 + 1 = 0 (carry bit: true)`
  }
Playground link: https://play.rust-lang.org/?version=stable&mode=debug&editio...

[0]: https://rust-lang.github.io/rust-clippy/master/index.html#in...


TIL, thanks!


> Rust is a great tool for writing safer programs, but it won't defend you from many if not most DoS attacks if you're not careful

Just want to stop by and point out "behavior not considered unsafe": https://doc.rust-lang.org/reference/behavior-not-considered-...

I have a simple phrase I use to explain this. Rust tries to give you memory safety. It does not imply nor check for correctness.


It'll be interesting to see where the gitoxide[1] project goes, being a rust implementation of git

[1] https://github.com/Byron/gitoxide/


Gitoxide is useful in other Rust programs that need to interact with git such as Cargo[1] or Helix[2].

[1]: https://github.com/Byron/gitoxide/issues/106

[2]: https://github.com/helix-editor/helix/pull/3890


I wouldn’t hold my breath. The scope of Git is quite large and Git-Oxide is just getting started. What I’d like to see is the Git core team adopt Rust gradually for certain subprograms.


I don't know how far he intends to take it, but the author of gitoxide (which is just a library, like libgit2) uses it to power a git CLI called gix. As of October it looks like it can do fetches and clones: https://github.com/Byron/gitoxide/discussions/623 .

Since the git CLI is subcommand-based, it wouldn't be too hard to provide replacements for individual git subcommands one at a time. If you replaced git clone, push, pull, commit, checkout, and rebase, that would suffice for 95% of git invocations.


that would fit how git itself evolved. A fair amount of code was in perl and stayed that way for awhile before being written in C.


many of the porcelain commands are or used to be in perl/bash


As a full time C programmer, implicit type conversion is probably my least favourite feature of C. Especially compared to other infamous aspects of the language.

For all its drawbacks, unchecked memory access is a defining feature of C. Try to remove it and you end up with a language that is very clearly distinct from C, hence Rust, Java, etc.

I think implicit conversation shuld just be removed from the language. The ergonomic benefits don't outweigh the harms.


This says again that the right behaviour for int overflow should be to trap, rather than wrap around or go into the weeds with UB.


I think the Rust approach of having different methods for wrapping and checked addition is excellent. If you have two i32, x.wrapping_add(y) [0] is a sum where you have wrapping semantics. x.saturating_add(y) [1] is a saturating sum (when it reaches the max it stays there instead of overflowing), nd x.checked_add(y) [2] is a sum that catches overflow.

If you are doing wrapping arithmetic a lot, there's a convenience wrapper (heh), Wrapping [3], so that if x and y have type Wrapping<i32>, then x + y always wraps (like calling wrapping_add on two i32). There's a Saturating<i32> too but it's nightly-only for now.

The only downside of Rust's approach is that if you have x + y on two plain integers, the language itself doesn't specify whether it will panic (and interrupt the execution of the program) or not. By default, debug builds will panic and release builds wraps (on the understanding that unintended overflows are always a bug, but catching this bug on release builds has some performance penalty, so it isn't caught by default on release), but you can control this by setting overflow-checks = true on the release profile on Cargo.toml [3]

I think future versions of Rust should set overflow-checks = true by default even on release builds, but right now it's only a matter of adding a single line on Cargo.toml to enable it, and almost every program should do it. With this on, x + y will panic on overflow if they are i32, will wrap if they are Wrapping<i32> and will saturate if they are Saturating<i32>

[0] https://doc.rust-lang.org/std/primitive.i32.html#method.wrap...

[1] https://doc.rust-lang.org/std/primitive.i32.html#method.satu...

[2] https://doc.rust-lang.org/std/primitive.i32.html#method.chec...

[3] https://doc.rust-lang.org/stable/std/num/struct.Wrapping.htm...

[4] https://doc.rust-lang.org/cargo/reference/profiles.html#over...


It's off by default in production because all those overflow checks add up (hah). Rust's first priority has always been low-level/systems code, where these things matter. It (semi-controversially) makes runtime bounds checks despite that priority, because they're so important to memory safety, but with that guard-rail numeric overflows become less important- they can only cause logic bugs, not safety issues, so it makes sense to me that performance would then be prioritized by default


Also, thanks to the language, most bounds checks can be elided in practice with idiomatic code. This isn’t as often the case with arithmetic operations.

I do wish there was a(n optional) clippy lint to warn against non-explicit integer arithmetic operations though.


What do they add up to? You can turn overflow checks on in release mode with -C overflow-checks=on, so it feels like it should be really easy to benchmark the current impact. I couldn't find any benchmarks doing this, though.


I just did a very un-scientific benchmark to test this.

I chose ripgrep as a conservative choice, since most of the time should be I/O anyway.

I grepped for the word "Jesus" in a file that contained 100 copies of the king james bible (426MB of text). I ran it 100 times and averaged the results. Timed with /usr/bin/time.

--release without overflow checking was 0.0702 seconds on average.

--release with overflow checking was 0.0727 seconds, or about 3.5% slower

That's more than I expected for a program that's I/O bound.


ripgrep author here.

> That's more than I expected for a program that's I/O bound.

Based on your description of your methodology and a reasonable assumption that you have more than ~426MB of available RAM, your benchmark is absolutely not I/O bound. Or at least, not bound by the speed of your disk. In your benchmark, your haystack is almost certainly entirely within memory and served from your operating system's page cache. It's unlikely you're hitting disk at all.

And even if you constructed your benchmark such that every search flushed your cache first and forced re-reading the file from disk, it may or may not be I/O bound. If your disk gives you 6 GB/s read speeds, you gotta do better than a traditional finite state machine to match that speed. (That is, probably SIMD.)


Very true! I wouldn't expect any blocking IO here (except for the first run). I would still expect the majority of the time be spent reading from memory vs CPU bound "number crunching".

I should have been clearer. In my day job we are latency sensitive and we'll refer to excessively hitting main memory as being IO bound.


Your query might be memory bound, but might not be. You've got to attain very high speeds to be memory bound. In your case, you chose a query that likely matches quite frequently, no? If so, it's certainly not going to be memory bound because of the overhead of dealing with all those matches.

Try a literal that doesn't match or matches very rarely and you'll have a better shot at being limited by memory bandwidth. (But still maybe not.)


You might consider trying a harder regex for a benchmark like this. The problem with 'Jesus' is that it's a simple literal and is basically the best case for a tool like ripgrep. It won't even touch the regex engine and will instead stay in highly optimized SIMD code that is unlikely to be impacted by whether overflow checks are enabled or not. (Because it's already optimized to produce good codegen.)

Try something like '\w+\s+\w+' instead. That will force the regex engine to run and you might see more impact from overflow checks. (But I'm just guessing.)


Can you say the difference in the binary program size?


I'm sure it varies wildly by the type of program, and I'm sure much smarter people than me on the Rust team did their benchmarking homework when they originally made the choice. Just saying that to me, it seems like a reasonable default


It seems like a seriously wrong default for a language whose story is supposed to be safety. People wanting unsafe languages already have C and C++ for that :/.


To improve safety at large, Rust has to displace C++. People use C++ in part because of its speed (rightly or wrongly - but it is undeniably part of C++ culture). A language which is consistently slower than C++ will not displace C++. So, this was a point where the designers decided to hold their nose for now.



Of course the int overflows can cause safety issues. We're talking about Git CVE's regarding exactly that, after all. It's true that the int overflow checks incur overhead (has anyone measured it?). RISC-V missed the boat in not giving a hardware trap for it.


Indexing into an array or vec will always be checked in safe Rust, so if an overflow gives you a bad index you'll just get a panic (or an Option::None if you decided to access it that way); never a safety issue


Indexes are only checked for being outside the array, not for being wrong. Getting a[3] when you wanted a[5] is a bug, potentially a serious or dangerous one. SPARK Ada goes to great lengths to statically verify that there are no int overflows in the code that it checks, and Ada itself specifies trapping overflow, though as an "optimization", the overflow check is often disabled. OCaml also checks overflow for its machine int type. GHC doesn't, and that seems like a mistake.


"Safety" means a specific thing here, and whatever bugs might happen with this, they do not make the code "unsafe"

It also helps that, in practice, when writing idiomatic Rust you'll almost always be accessing array/vec members using an iterator. Direct indexing is uncommon, which means indexing bugs specifically are very uncommon. Of course that's not the only bug that overflows can cause, but it's a big one.

Not every bug can be prevented by the language (at compile time or runtime). You have to make some choices about which things are just in the programmer's hands, and I think Rust has struck a reasonable balance. Especially since the user can always choose differently in this case, and just compile their code with `overflow-checks = true`


I agree from a language standpoint but trapping overflows by default would be quite costly. If you get the branch predictor involved with every addition or multiplication then you'll very quickly slow down your program, especially in the hot code paths. I think most general purpose programs don't run into overflow anymore these days, and the performance benefit probably outweighs the safety for most people.


I think the opposite is true, most software projects benefit from safety and not performance


Most programs should be written in HLL's with unbounded integers and garbage collection. There are a few programs that need machine arithmetic and manual memory management, for performance or whatever reasons, and Rust is intended to allow writing those programs safely. So it seems like a mistake to not treat int overflow as a serious error.

Idk if branch prediction or whatever would have to be involved. Most architectures will already trap on divide by zero or things like that, and will set a condition code for int overflow. Making the condition code sticky like it is for IEEE floating point arithmetic would allow doing a software trap once a basic block finished. It is a shame that the current trendy architecture (Risc-V) goes the opposite way and has no condition codes at all.


Swift does this. Bare operators like + and * trap (crash) on overflow. You can use &+ and &* for 2s complement wrapping.


I like how this highlights the pitfalls of using `as` to cast between types. The article says to prefer using `try_into()` instead of `as`, which is a good suggestion when available, but `as` lets you cast between floats and ints, while `try_into` is not implemented for these type conversions. I think the reasoning is it is unclear `try_into` should fail when it had to round a decimal number down or only when the float was something above the bounds or `NaN`. Regardless it seems like a missed opportunity in the standard library.

https://play.rust-lang.org/?version=stable&mode=debug&editio...


For floats you do end up having to use `as` which is unfortunate. However the amount of things that can go wrong is too vast for `TryInto`.

The article also says "If you upcast u32 to u64, you can use the keyword as". I would avoid this. In these cases you can just use `.into()`. This is infallible and will raise errors if refactoring causes that conversion to no longer be infallible. (In which case you can fix it or migrate to `.try_into()`)

I think the floating point case is similar. Keep `as` super suspicious, but sometimes you do want to do it.


Also, in generic code write try_into() and don't sweat it, for the cases which can't actually fail in the monomorphised results the error type is Infallible, that's an Empty Type (an enum with no variants, as distinct from a Zero Size Type which is a struct with no members) and so no error handling code is emitted.

The compiler can't even imagine what the code should look like, after all this type can't exist, does it fit in a register? How do we perform operations on it? If we try to actually realise an empty type by e.g. dereferencing a pointer to it that somebody gave us from a C API the Rust compiler will reject that as nonsense.


cargo clippy is good at yelling at you for using `as`

I generally recommend starting with clippy set to pedantic and then disable all the lints which are too pedantic (or don't agree with, some lint style choices which are generally not grate but singing still got best option).

Anyway `as` is one of that "is not perfect but we need something now" design choices and most of it's usage have now better choices".

So maybe in the long future we will even see `as` depreciated and looked against in rustc without clippy.


thank you for sharing this post. :)

can somebody give an example for the line:

  int len = slen;      // Here if slen is too big, it loops backs to 0
what happens if slen is to big? all bits in len are zero? (sorry for this noob question btw, just started with c programming) :)


Yes. Imagine you have a 2-bit integer. `11 + 1` in binary is `100`. The last `1` does not fit so only the remaining `00` gets included. The same applies for any N-bit integer.


thank you - got it this way! :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: