I'm not criticizing the programmer's effort but whenever I see a new project as a "drop in replacement" and/or "50% to 3x faster", the first thing I always look for is the test data to verify that new bugs weren't introduced.
The kind of test data I'd like to see for something like a compression library would be:
0 byte files
1 byte files
2,147,483,647 byte file
2,147,483,648 byte file
2,147,483,649 byte file
(to check for bugs around signed 32-bit integers)
4,294,967,295 byte file
4,294,967,296 ...
4,294,967,297 ...
(to check for bugs around unsigned 32-bit integers and see if 64bit was used when necessary for correctness)
Also add files of sizes that straddle the boundary of algorithm's internal block buffers (e.g. 64kb or whatever). Add permutations of the above files filled with all zeros 0x00 and all ones 0xFF. I'm sure I'm forgetting a bunch of other edge cases.
The programmer may have done a wonderful job and all the code may be 100% correct. Unfortunately, I can't trust a library replacement unless I also trust the test bed of data that was used to check for defects. It's very common for performance optimizations to introduce new bugs so there has to be an extensive suite of regression tests to help detect them. Test data for bug detection has a different purpose than benchmark data showing speed improvements.
Those multi-gigabyte files are not git repository friendly so perhaps a compromise would be a small utility program to generate the test files as necessary.
I'm in agreement he should use more test cases, but it looks like he used the same as Golang's compression tests [1]. Note he has other test cases in the subdirectories (also the same as Go) [2].
However I believe even better testing would be to fuzz it for a few hours [3].
>It's a stream-oriented format. There are no file-global offsets or lengths. There are some relative lengths and offsets, but they're going to be 8 or 16 bit values. Most of the tests you're suggesting don't look particularly useful.
I don't think that streaming vs file is relevant. The deflate algorithm (stream) may have no global knowledge of the eventual file size but the gzip file format does.
Anyways, even with a cursory google search, we can find a compression bug related to file sizes:
It was also changed to only look for 4-byte or more minimum matches. "Better compression" is claimed, but I don't see any benchmarks to indicate the size of compressed files before and after.
Yes, you are not criticizing the programmer's effort. That's what a pull request is for. A pull request with a failing test in it would be some very concise criticism.
I went to the test subfolder:
https://github.com/klauspost/compress/tree/master/testdata
I saw 3 files.
The kind of test data I'd like to see for something like a compression library would be:
0 byte files
1 byte files
2,147,483,647 byte file
2,147,483,648 byte file
2,147,483,649 byte file
(to check for bugs around signed 32-bit integers)
4,294,967,295 byte file
4,294,967,296 ...
4,294,967,297 ...
(to check for bugs around unsigned 32-bit integers and see if 64bit was used when necessary for correctness)
Also add files of sizes that straddle the boundary of algorithm's internal block buffers (e.g. 64kb or whatever). Add permutations of the above files filled with all zeros 0x00 and all ones 0xFF. I'm sure I'm forgetting a bunch of other edge cases.
The programmer may have done a wonderful job and all the code may be 100% correct. Unfortunately, I can't trust a library replacement unless I also trust the test bed of data that was used to check for defects. It's very common for performance optimizations to introduce new bugs so there has to be an extensive suite of regression tests to help detect them. Test data for bug detection has a different purpose than benchmark data showing speed improvements.
Those multi-gigabyte files are not git repository friendly so perhaps a compromise would be a small utility program to generate the test files as necessary.