Wouldn't it be a good idea to remove pack support from gzip if it obviously doesn't get enough eyeballs to detect bugs in a timely manner? Poorly tested obscure format support is a goldmine for hackers looking for exploitable code.
A nice middle ground would be to change to have this unexpected code path require explicit invocation via a command line flag or enablement through an env variable. This would reduce the attack surface for the vast majority of users and still retain support (albeit not entirely backwards compatible with scripts) for folks that want this.
> Poorly tested obscure format support is a goldmine for hackers looking for exploitable code.
IMO the issue is when these code paths are automatically invoked. This is how you get gstreamer running 6502 opcodes for a codec no one has ever heard of when you download a file from the internet.
I don't see how a command line flag or environment variable helps to reduce the attack surface. If an exploit is found, any attacker can add the required flag or environment variable anyway.
Only if the attacker controls the command line or the environment.
Suppose you have a shell script which does something like "blah ... | gunzip > somewhere", where input to the "gunzip" step is under control of the attacker. Requiring a command line flag, or even an environment variable, would be enough to avoid exposing the code in question to the attacker-controlled input.
Usually, it would be too late to add a new command line flag, since people might have scripts which depend on being able to unpack files without passing that flag. In this case however, since it was broken for years and nobody else complained, it's very probable that nobody actually depends on this feature working, so requiring a new command line flag or even removing the feature would not cause many problems.
Not in all cases. For example if you find a bug in some obscure imagemagick code you can exploit it by just having the user download it and look at the download folder in a file explorer that produces thumbnails using imagemagick. If support for that obscure format would require explicit enabling, most likely the user would see some generic icon instead of getting their hard disk encrypted.
Except that's not at all related. 6502 was added not because it's legacy, but because it was a feature. It wasn't added in 1973 and "kept" for decades.
What of having a pure ANSI C implementation of pack and unpack available?
I, too, am concerned with future data loss due to codec extinction. A solution seems like it’d take the shape of an archive of the codecs in a simple, pure C subset with an actively maintained compiler. Unused code should be refactored out of mainstream tools
There's no test suite for "pure ANSI C". Even experts tend to write code that relies on undefined behaviour. I'd sooner trust e.g. Java bytecode to remain executable in the future.
What if it were pure “archival C” whereby the project maintainers defined the undefined behaviors? And I don’t think it would need SIMD optimizations and such, it’s not designed for performance so much as to keep code runnable.
It’d probably require some sort of VM specification too, maybe java is a better choice. keeping the code “alive” by keeping it in a project when it’s never used doesn’t seem like a strong guarantee that it will still work when you need it.
The C compatibility test comes when you build your kernel, your toolchain, your browsers, your shells, your desktop applications, and run their test suites after each build. ;- )
"ANSI C" is a bit of a misnomer, since a) there have been updates to the ANSI standard which are not considered "ANSI C", and b) all of the updates are sworn to be backward compatible.
> The C compatibility test comes when you build your kernel, your toolchain, your browsers, your shells, your desktop applications, and run their test suites after each build. ;- )
But if we're talking about obscure applications being preserved for posterity, they're not going to get built.
Really I was talking about the other direction though: a way to confirm that a given program is written using only ANSI C. There's no testsuite for that, all you can do is test with the compilers that exist - and then sometimes your program suddenly stops working in the next version of the compiler.
It can’t prove a program is 100% conformant, but it can detect a much broader range of undefined behaviors than regular compilers, and does other things more strictly (like, IIRC, limiting the declarations exposed by standard library includes to what the standards guarantees they expose).
> But if we're talking about obscure applications being preserved for posterity, they're not going to get built.
Well, they don't really need to be, that's the point.
> Really I was talking about the other direction though: a way to confirm that a given program is written using only ANSI C. There's no testsuite for that, all you can do is test with the compilers that exist - and then sometimes your program suddenly stops working in the next version of the compiler.
See --std=c89 or -ansi flags on GCC or Clang. One other way to prevent other sources of rot is to avoid UB that is frequently subject to change. Most UB is defined, at worst, by the constraints of real ISAs (i.e. the behaviour of idiv when dividing INT_MIN by -1 is an exception on x86, but the ARMv7 equivalent simply produces a number), and most UB-altering characteristics are standard across popular ISAs, and new (general purpose) ISAs tend to go with the status quo when in doubt.
And if all else fails, you can still run your code in an ISA simulator, and compile it with an old compiler, to figure out exactly what it meant in the past.
They won't help you if your code relies on behaviour not defined in the standard, which is almost always what happens in practice.
> Most UB is defined, at worst, by the constraints of real ISAs (i.e. the behaviour of idiv when dividing INT_MIN by -1 is an exception on x86, but the ARMv7 equivalent simply produces a number), and most UB-altering characteristics are standard across popular ISAs, and new (general purpose) ISAs tend to go with the status quo when in doubt.
ISA extensions aren't always so conservative - x86 family ISAs used to never have alignment requirements, but some of the newer SSE-family instructions do. And modern C compilers will optimise code that invokes UB in surprising ways - codepaths that do signed integer overflow on x86 used to silently wrap around in twos-complement fashion, long shifts used to behave the way the hardware did, but these days compilers will omit to generate code for that codepath since the behaviour is officially undefined.
> And if all else fails, you can still run your code in an ISA simulator, and compile it with an old compiler, to figure out exactly what it meant in the past.
Sure, but in that case you're not gaining anything from using ANSI C, and might as well stick with whatever the code is written in, or just keep the binaries and run them on an emulator.
> Sure, but in that case you're not gaining anything from using ANSI C, and might as well stick with whatever the code is written in, or just keep the binaries and run them on an emulator.
Not being able to compile and run your code, with few or no minor changes, is an edge case. I frequently run unchanged or minimally changed code from before ANSI C. Sure, if you are definitely going to need a simulator to understand the code, you might as well just use all the original infrastructure; but if you'd like to run it in the interim, C is nice, and sticking to a restricted subset means you'll have fewer things to change if you want to adapt it to a new system.
> Not being able to compile and run your code, with few or no minor changes, is an edge case. I frequently run unchanged or minimally changed code from before ANSI C.
Not my experience. A lot of pre-ANSI code would segfault if you didn't build with -fwriteable-strings, for example, and there was certainly talk of removing that option from GCC entirely.
> if you'd like to run it in the interim, C is nice, and sticking to a restricted subset means you'll have fewer things to change if you want to adapt it to a new system.
I wouldn't call C nice, but if you are using it I would stick with the style today's popular compilers and programs use - popularity is the best guarantee of future compatibility. Using "ANSI C" might be understood to mean e.g. using trigraphs to replace special characters, which would probably be a net negative for long-term maintainability (I think some compilers are already removing support?).
This could be achieved of archiving the source tree of the compression software/version used to create the backups, or at least a version note. Nowadays with version control every can checkout and revert to older versions when they need to.
I'm not convinced it's the compression software author responsibility.
Without a running software, not check is made for the build. And without a checked build, one day, you will realize your machine can build this anymore and you would need to understand the algo, port it and compile it yourself to open a file.