> But think: everything in those describe blocks had to be written by hand.
It also had to be thought about by the developer. Someone had to say "I want the code to do this under these conditions".
If your tests can be autogenerated then they aren't verifying expected behaviour, they're just locking in your implementation such that it can't change later. They are saying "hey look everyone, I got my coverage metric to 100% (despite any bugs I may have)."
One of the projects at a place where I have worked was set up so that when you ran the tests it automatically and silently updated the values that were expected. Completely bonkers because the first time I was contributing to the project I prepared the tests first and then started the implementation, and then while I was working on it I ran the tests which at this point should fail because I hadn’t finished writing the code but instead all tests passed. Because helpfully the test setup overwrote the expected values that I had prepared in my new tests, with the bad data. Yeah great, very helpful >:(
Oh yeah and the whole test setup was also way too tied to the implementation rather than verifying behaviour. Complete trash the whole thing.
One person's "cargo cult behavior" is another person's "best practices". :P
My favorite example is automatically generated documentation. The kind that merely repeats the name of the method, the names and types of arguments, and the type of return value. The ironic part is that this is later used as an evidence that all documentation is useless. Uhm, how about documenting the methods where something is not obvious, and leaving the obvious ones (getters, setters) alone? But then the documentation coverage checker would return a number smaller than 100% and someone would freak out...
I hate to dwell on this, but I've also seen it in real life and it boggles the mind.
Like "give review feedback that this code isn't doing the right thing" -> "change the test to make it pass, not change the code to make it work". And it wasn't really a small case where you could plausibly do that and still understand what you were trying to do.
Coincidentally that was a few weeks after I saw a comment here on HN about someone who hired someone from Facebook, and the guy would change the tests so he could push to production, rather than fixing the bug that the tests pointed out ...
>Coincidentally that was a few weeks after I saw a comment here on HN about someone who hired someone from Facebook, and the guy would change the tests so he could push to production, rather than fixing the bug that the tests pointed out ...
Can't blame him, he moved fast and broke things /s
Perhaps he's a Buddhist? "If the software is going to break, then the software will be broken." Then he adds a little wabi-sabi for good measure. https://en.wikipedia.org/wiki/Wabi-sabi
I remember once, using some in-house software, which for god knows why could not log it's errors back to the IT department. Instead, they relied on users to call up IT, or email them with the error. To make it more fun for users, each error message contained a humorous haiku.
Chaos reigns within.
Reflect, repent, and reboot.
Order shall return.
Would it do this just the first time? It’s still bad it was doing this silently, but it’s pretty common to test web APIs in a similar way manually. Make a request, check the response you get back looks right (important step) and then save it as the expected value.
Edit: or after reading the article, like in the article.
I can somewhat understand, because this is kind of the goal of property based testing—the actual values themselves matter so little to the test that you’re willing to subject those inputs to randomness
That said, this doesn’t sound like a very good way to pull that off because the developer has no control over that randomness (where it’s needed greatly).
So long as the diffs get reviewed and checked in, this is a great form of testing called "regression testing". It doesn't replace unit testing, but it can be super valuable.
You are missing the point entirely. It’s actually discussed at length in the article btw if you had bothered reading it.
Regression tests are extremely useful because you don’t want working code to get broken but they are tedious to write. What the author is describing is pretty much how everyone does it if you want anything moderately complex in the test, you just run and then copy-paste. Having something do it for you in a frictionless way is a huge win.
Plus the way the framework works you can still test expected behaviours before writing the code if that’s what you actually want.
Think of it as manual testing where your work is captured so it can be ran later in an automated fashion. There are many problems where verifying the answer is easier than coming up with the answer.
Asserting formatted output can also be really useful. A picture might be worth a thousand words, but when it comes to tests it can save you a thousand asserts. Writing those thousand asserts separately also would be so tedious that in practice you'd probably not write them all, leaving part of your output uncovered by tests.
When I wrote a LALR parser generator for fun, I added some code to print out a nicely formatted parsing table with debugging information. Besides being useful for debugging, it let me write simple yet powerful tests: I would feed the generator a grammar and then assert on the formatted parsing table. That made it easy to verify that I was asserting the right thing, and let me assert everything in one go.
Tests should define what the expectations are. If a change does not impact those expectations, then it should be allowed and not break any tests.
Locking your code such that all future changes require updating old tests tells me that your tests are just your code written a second time, with no thought about what the code's requirements are.
In many contexts, there's just no such thing as a safe behavior change which should be allowed without a specific decision from you to allow it. As a database systems guy, I've seen countless examples of customer breakages caused by a developer's decision that some behavior or another is so trivial it doesn't need to be tested.
When you're working on developing a random utility function (real example!), it's easy to say "come on, it's no big deal to return DECIMAL(14, 4) instead of DECIMAL(12, 3)". It feels like they're basically the same, updating the test is make-work, and the guidelines saying you must document it as a breaking change are pointless annoyances. It's hard, requiring substantial amounts of knowledge and expertise, to recognize that this change will cause a production outage because the schema of a customer's view is no longer write-compatible with their existing data.
It had, and that's precisely because of the lack of anything like the expect() tests described in the OP. It's laborious to reliably scan through a big test diff and identify when it's describing a user-facing change, and people are inevitably going to autopilot through it. If you have a golden file (the standard name in my area for an equivalent mechanism to expect() tests), the reviewer's work is a lot simpler: any non-append-only diff is a breaking change and must be either fixed or communicated broadly before deploying it.
Implementation !== Behavior. You want to test the behavior, not the implementation. I'd expect tests to change when behavior changes, but reimplementing the same behavior, the tests should pass when you're done.
Yeah in their Fibonacci example if it printed out 510 instead of 610 you'd still have a bug and think you had tested it. Especially confusing for future people who will assume it works because there are passing tests!
How do you know if you have the right result though? You might know if you have a plausible result. Like if it output -1 then you know something is wrong I guess.
There's a much higher chance of detecting bugs that give plausible output if you aren't given the opportunity to say "eh looks plausible I won't bother double checking it".
Any programmer dumb enough to just blindly accept that their program is correct is also a dumb enough programmer not to have begun writing a test in the first place. If this gets the friction of writing a test at all so close to zero that these programmers start writing tests (albeit sometimes blindly accepting the output), then it's better than just trying their program on some inputs and calling it a day. It writes down the current output of the program. That's a big step up already. Now people evaluating the code can read some of its outputs without downloading anything.
I personally already use a similar cycle to expect-test when I write tests. A great place to start when writing test assertions is the debug output, just like this thing uses. Then you convert the output into assertions after you have thought through which parts are right or wrong. Just like you can do with expect-test, but without the automation. If you don't know whether the output is right or not, just add an assert(false, "hmm, not sure about this") aka todo!() and voilà, your test fails and future you can be prompted to check over it again.
Sometimes the output is obviously wrong, but you still don't know what the right output is. (At this point you know you're doing useful work!) The remedy is the same. Just make the test fail somehow.
> Any programmer dumb enough to just blindly accept that their program is correct is also a dumb enough programmer not to have begun writing a test in the first place.
Then what's the point of this methodology? It requires you to write tests and also blindly accept that your program is correct.
Maybe they should just rename it to "plausibility tests" or similar because that's what they're really testing. And while that does have some value, I think most of the value is negated by the fact that it sounds like they are properly vetted tests which they are not.
So a more appropriate name would help a lot. I still think it's a bad idea though.
> It requires you to write tests and also blindly accept that your program is correct.
No. You can say no. Just don’t accept it. You’re a human and it asks. Even if you do accept it you can modify it because you have eyes and a keyboard and it’s written right there where you wrote your test.
As I said before, it's probably better than nothing in that it will help you detect obviously implausible results. But it really needs to be labelled as such otherwise people will assume that these are properly curated "golden" tests.
Of course. The reason expect-testing is good is that you need significantly less vigilance writing/maintaining the tests than when you do them with assertions for everything you care about, in exchange for slightly more vigilance required on the actual output of your programs. Yes you need to pay attention to the output, but your attention can now more focused instead of split between that job and the job of writing the test. It's possible to make mistakes when writing out your assertions, they are just generally more invisible and pernicious. Testing code is code like any other, and mistakes look like forgetting to test things, erroneous refactoring of the test or the code, mistakes copying tests around, mistakes writing out extrapolations, mistakes from sheer fatigue at the heft of the testing code you're trying to maintain. Further, the kind of vigilance required for expect-test is mostly not "Tesla kinda driving itself but driver is meant to watch the road". You are not checked out completely and talking to the other passengers or reading a book, but somehow legally responsible for taking control at any moment. You have your hands on the wheel and the car is offering turn-by-turn GPS directions.
Expect-testing is a good tradeoff in the short term (time to create tests) and in the long term (quality and size of test suites produced). The evidence for that is that there are pieces of software that need so many tests for their range of functionality, that you cannot test them any other way than in this style. I am talking about testing orders of magnitude more stuff than you could do manually. A great example is the Rust compiler UI test suite (https://github.com/rust-lang/rust/tree/master/tests/ui). It doesn't have to be that your tests have large amounts of noise, like compiler UI tests do. You can make focused and noise-free tests using this method, as the original post examined. The main thing is that writing the tests faster results in bigger test suites and more opportunity to look at the same code on different inputs. I would rather have two dozen tests that required me to look at their output, than three tests that made me think thoroughly about every single assertion. It's just a better use of your time. The rewards are compounded by the massively reduced cost of maintaining the test suite. The tests update themselves when the code does.
Overall, yes you have identified the negative part of the tradeoff. But you seem to have missed every single one of the benefits.
It's a repl, so you build the final output incrementally. Testing becomes part of the development workflow like you would do in languages that rely on the repl like lisps.
For example, you start with the inputs and you apply the first layer of transformations, then check what it does makes sense. Then maybe you refactor it out in its own function and add the generated test for it. Then you move on the next step and so on until you have the final result.
For Fibonacci (or indeed the result of most mathematical calculations) it makes no sense but I use this kind of thing all the time where the expected output is, for example, a templated string like an error message.
There are plenty of kinds of test outputs where rewriting the test and eyeballing the result is quicker, easier and ultimately better.
In many contexts there's value in ensuring the behavior doesn't change without being noticed. You're just moving the developer thinking about the expected behavior from when the test is written to when the test fails.
See the related memes "code never lies", "the code is the contract" and “when I use a word, it means just what I choose it to mean — neither more nor less."
> I think you’re supposed to write some nonsense, like assert fibonacci(15) == 8, then when the test says “WRONG! Expected 8, got 610”, you’re supposed to copy and paste the 610 from your terminal buffer into your editor.
> This is insane!
The sane approach is presumably to either expand the call tree and verify all the unique subsolutions. Or to do every step with a calculator if you can’t expand the call tree.
> The %expect block starts out blank precisely because you don’t know what to expect. You let the computer figure it out for you. In our setup, you don’t just get a build failure telling you that you want 610 instead of a blank string. You get a diff showing you the exact change you’d need to make to your file to make this test pass; and with a keybinding you can “accept” that diff. The Emacs buffer you’re in will literally be overwritten in place with the new contents [1]:
Oh okay. The non-insane approach is to do the first thing but Emacs copies the result on your behalf.
They also do that, the post refers to their Quickcheck library. But how do you property test the Fibonacci function ? There isn't much to say about it...
It is non-decreasing monotonic.
fib(n) <= fib(n+1)
It is increasing monotonic after 1.
fib(n) < fib(n+1)
Its domain and codomain are non-negative integers.
fib(n) + fib(n+1) == fib(n+2)
Notice this is like the recursive solution except going the other way (addition not subtraction) and is missing the base case.
I believe the way to test it is to have a property like `n in integer | fib(n+2) == fib(n+1) + fib(n)`. This is close to the naive (but obviously correct?) implementation of fib and can be used to test the optimized version of the function.
You can also test that the sequence is increasing like `fib(n+2) > fib(n+1) > fib(n)`.
You use the naive implementation as a test oracle, limit `n` to something small (through the property tester), and use the test oracle on your efficient implementation.
Unit testing elegant functions has no value.
(fib is often used as an example. But you asked how to test it.)
In combinatorics the adjusted fibonacci numbers start with 1 instead and is more commonly used as it aligns with many other results. One might want to document in the code, via a test, which sequence is of interest.
This is just an example of course but elegant functions might need to be tested.
In addition to what
travisjungroth said, you can also check against a reference implementation.
Eg if you coded up an O(n) version of the Fibonacci calculation, you can check against the naive recursive one (or if you are feeling confident, you can check against the O(log n) solution via repeated squaring of matrices.)
The problem is that the closed-form solution is vulnerable to floating-point error. If the calculations are done in float32 (including all intermediate steps), then the 32nd fibonacci number is erroneously given as 2178310, instead of the correct value of 2178309. Using float64 does better, but still has an error at the 71st fibonacci number. (I made a quick plot of the error as a function of N at https://i.imgur.com/bbc9OFC.png. As soon as the error crossed ±0.5, the rounding results in the wrong result.)
These are fine for property-based testing, so long as you restrict yourself to the range in which you have a correct value. But at that point, you might as well just hard-code the first 93 fibonacci numbers (the most that will fit in a uint64_t) and be done with it.
Comparing the output of your system against an oracle is one property you can test.
But you don't always have an oracle. So other properties still make sense.
As a simple example: if you code up a quantum mechanics simulator, that's hard, and I wouldn't be able to code up an oracle for you straight away. But I can tell you that you probably want to check that things like momentum and energy better be conserved.
Yes, I have difficulty understanding the point of a test-writing system that relies on your explicit assumption that whatever the code already does is correct.
A regression test is checking causality: Changes in new code, updating dependencies, updating the OS the software is running on, updating shared libraries, porting the code to a new platform, etc. aren't supposed to change the test results.
"I may not know what cos(x) means, but whatever it is shouldn't depend on what OS version I'm running"
> "I may not know what cos(x) means, but whatever it is shouldn't depend on what OS version I'm running"
Cosine is a terrible example to use for that idea. It's pretty likely to change, for certain x, in similar circumstances to your examples of "when test results should never change".
If it's likely to change, then you especially want the regression test so you can decide how to handle the divergence during your port. Maybe one library preserves the signal on NaNs and the other doesn't. Or maybe the CPU's default rounding mode is different when called in this context, and you're off by 1 ulp.
In either case, if the behavior is to change, it should change as an informed decision and not because nobody noticed.
This looks similar to snapshot testing in UI, where you save an output of UI components and test system notifies you when the output changes. This can be useful to detect changes in components that you didn’t intend to change.
- This here is akin to UI testing framework X where the test framework can compare an expected screenshot of the UI to a screenshot of the actual UI!
The last one basically requires automation if you want anyone to make use of it. The regression testing automation described in the OP is a nice-to-have, not a so-good-that-it-gets-a-new-name.
"Regression test" means something else, at least at the companies I've worked at: It means a test that was written after a defect was found in production, to ensure that the same defect doesn't happen again (that the fix doesn't "regress"). It can be a manual test or an automated test.
https://en.wikipedia.org/wiki/Regression_testing
That’s fine and I have no objective argument against it. But I don’t see much reason to need two different names for tests that do the same thing merely based on how they were introduced. Sometimes I add a regression test because I fixed a bug, and sometimes I add a regression test because I just implemented a feature that I don’t want my future self to ruin: six months from now they will co-exist in the same suite and serve the same function.
One reason to call bug fix tests for “regression tests” (and only those kinds of tests) is that someone might regress the code base through a merge conflict (maybe they effectively undo a commit?). So that’s one argument I suppose.
"Regression testing" can also refer to a process: When the QA team says they're doing regression testing, it means they're testing that existing functionality hasn't regressed (as opposed to testing a new feature).
I'm not particularly wedded to any of these terms, I'm just pointing out that "regression testing" has an established meaning, and it isn't snapshot testing (outside of certain industries, at least). I do find it amusing that one implementation of snapshot testing (https://pypi.org/project/pytest-regtest/) links to https://en.wikipedia.org/wiki/Regression_testing but that article doesn't describe snapshot testing at all! Maybe the article changed? Oh well, language changes too. ¯\_(ツ)_/¯
Snapshot testing is great, and I wish more test frameworks included first-class support for them. This means that they can auto update with a flag, and can be stored either in the source inline or in an external file (both modes have different use cases). Note that doc tests can also be a form of this, e.g. in Python's.
"Expect tests" seems like a bad name, since that covers all tests.
i find that snapshot testing gets overused in javascript - and mistakes can creep in easily, and if the snapshot is big, and in a separate file, code review can miss it.
I much prefer property based testing over expectation based testing. You have to explicitly think about what properties hold true about the thing you're writing.
For example, fib(N+1) = fib(N) + fib(N), so this property can be tested for all N; primitive generators can easily generate the data, and good composition framework can easily generate complex data from primitive data.
Of course, you have to have a property you can specify easily. Otherwise, it'd be exactly the same as expectation based testing.
Every single time I've introduced property based testing, even as a simple example, I've discovered a bug in either the code or the spec.
I've found a bug in a Haskell program about fib generation - your test would work (if fixed for the subtractions) but incorrectly as there was an overflow in the addition. A basic property of "fib(n+1) > fib(n)" for n>1 finds this.
I like this type of testing as it asks you to more generally consider what guarantees your code is making about its operation.
Edit - your example is a good one and necessary, I just wanted to add a bit extra as I really like property based testing
Snapshot testing works well for component systems, especially with storybook. There is a service called Chromatic that lets you diff component changes visually using storybook output.
The open source use pattern for expect tests in OCaml (via dune) is exactly as you describe (see https://dune.readthedocs.io/en/stable/tests.html) - you run the tests with `--auto-promote` to tell it to update. The editor integration is a very simple keybinding on top of more generic tooling.
I wonder if this has the same downsides as golden and screenshot type tests, where you end up over-asserting resulting in tests that break for unrelated changes?
Obviously that’s a risk for hand written tests too but it’s easier (today… who knows what copilot like systems will offer soon!) for a human to reason about what’s relevant.
Yes, that is definitely a downside for these tests. The worst is when the text of some exception is printed and it includes line numbers. It does still require some discipline to think about what you're printing and avoid output that will be very noisy. This problem is mitigated quite a bit by the ease of accepting changes when these tests fail for obviously nonsense reasons though (just hit a couple buttons in an emacs buffer).
Yeah, the OP's counterargument is that you can filter down what goes into the test output. But at that point it seems not too different qualitatively from the traditional bottom-up approach where you just write assertions yourself, except that the framework does the job of populating the assertions' expected values.
If you are saying this approach would tend to produce a lot of change-detector tests, then that is an issue, but I think scotty79 is making a different point: this approach would seem to make it easy to overlook any regressions that the latest change has created.
Some years ago I wrote a Python function, "replace_me"[1], that edits the caller's source code. You can use it for code generation, inserting comments, generating fixed random seeds, etc.
And one more use case I found was exactly what TFA describes, but even easier:
import replace_me
replace_me.test(1+1)
Once executed, it evaluates the argument and becomes an assertion:
import replace_me
replace_me.test(1+1, 2)
I never actually used it for anything important, but it comes back to my mind once in a while.
I tend to think that tests should be carefully crafted for readability just like normal code. The “content of a REPL” is unlikely to be well-thought out enough to preserve meaningful invariants while remaining supple in the direction of likely changes. Perhaps in the hands of very good engineers this tool is net positive, but I shudder at giving junior engineers a tool that encourages less structure in tests.
A good set of fixture/helper functions should let you write really short and expressive tests (or tabular parametrized tests, if you prefer) which seems to me to resolve most of the pain points the author is complaining about.
One big advantage I do see with this approach is it seems to be a very compact rendering of a table of outputs; in Python+pytest+PyCharm if I run a 10-example parametrized test, I have to click through to see each failure individually. Perhaps there is a UX learning here that just rendering the raw errors into the code beside the test matrix could help visualize results faster.
As an aside, I have recently been enjoying the “write an ascii representation as your test assert” mode of testing, it can give a different way of intuiting what is going on.
I think this would suffer from the same problem as partial self-driving cars: it's human nature for vigilance to falter if it doesn't feel like you're the sole/primary one in control.
Of course, you can say "I won't let myself do that", but working against human nature is not a formula for success. If my back hurts, I can tell myself I'm just going to go lie down on the bed for 10 minutes but not take a nap, but then 30 minutes later I wake up feeling groggy.
I don’t really understand this. How is this different from just writing the code and just assuming that you got it correct, and then locking in a potentially wrong implementation?
> What does fibonacci(15) equal? If you already know, terrific—but what are you meant to do if you don’t?
> I think you’re supposed to write some nonsense, like assert fibonacci(15) == 8, then when the test says “WRONG! Expected 8, got 610”, you’re supposed to copy and paste the 610 from your terminal buffer into your editor.
Who does that? How do you know 610 is correct? That’s just assuming your implementation is right from the get go. For such a function, I’d independently calculate it, using some method I trust (maybe Wolfram Alpha). I’d do this for a handful of examples, trying to cover base and extreme cases. And then I’d do property testing if I really wanted good coverage. Further, this expect test library seems to just smoothen the experience of copying what the function returns into a test.
This whole “expect test” business seems to rely on the developer looking at what the function returns for a given input, evaluating if it’s correct or not and then locking that in as “this is what this function is supposed to do”. That seems backwards and no different from how one implements functions in the first place, so I don’t know what is actually being tested.
The entire point of testing is saying “this is what this function should do” and not “this is what the function did and thus that’s what it should always do”.
You're supposed to use it as a repl, so you start with a test for `fib(1) = 1`, then `fib(2)` and so on. Once you're confident of your implementation, you use quickcheck to test general properties of the system.
Similarly if you find a bug in the live system, you add a test for that and the initial output will be wrong. Then you fix your code until it prints the correct value and commit that so any regression will be caught.
I work with a language where all test are expect tests ( GAP ). The biggest problem is you can basically never change how built in types are printed, as you'll break all tests in every program. For example, someone wanted to improve how plurals are printed, but that would break every test.
Is there anything like this in Python or C#? I have worked with OCaml extensively in coursework, but there’s no chance I’ll be using it in prod any time soon and I’d love toying with this approach in my working languages.
It also had to be thought about by the developer. Someone had to say "I want the code to do this under these conditions".
If your tests can be autogenerated then they aren't verifying expected behaviour, they're just locking in your implementation such that it can't change later. They are saying "hey look everyone, I got my coverage metric to 100% (despite any bugs I may have)."