“Expect tests” make test-writing feel like a REPL session

mabbo · on Jan 14, 2023

> But think: everything in those describe blocks had to be written by hand.

It also had to be thought about by the developer. Someone had to say "I want the code to do this under these conditions".

If your tests can be autogenerated then they aren't verifying expected behaviour, they're just locking in your implementation such that it can't change later. They are saying "hey look everyone, I got my coverage metric to 100% (despite any bugs I may have)."

codetrotter · on Jan 14, 2023

One of the projects at a place where I have worked was set up so that when you ran the tests it automatically and silently updated the values that were expected. Completely bonkers because the first time I was contributing to the project I prepared the tests first and then started the implementation, and then while I was working on it I ran the tests which at this point should fail because I hadn’t finished writing the code but instead all tests passed. Because helpfully the test setup overwrote the expected values that I had prepared in my new tests, with the bad data. Yeah great, very helpful >:(

Oh yeah and the whole test setup was also way too tied to the implementation rather than verifying behaviour. Complete trash the whole thing.

mabbo · on Jan 14, 2023

I keep rereading this hoping I'm misunderstanding.

That is cargo cult level behaviour. They know that software with lots of tests tend to have few bugs, so let's automatically have lots of tests!

I just hope whatever you were building wasn't critical to human lives.

https://en.m.wikipedia.org/wiki/Cargo_cult

Viliam1234 · on Jan 14, 2023

> That is cargo cult level behaviour.

One person's "cargo cult behavior" is another person's "best practices". :P

My favorite example is automatically generated documentation. The kind that merely repeats the name of the method, the names and types of arguments, and the type of return value. The ironic part is that this is later used as an evidence that all documentation is useless. Uhm, how about documenting the methods where something is not obvious, and leaving the obvious ones (getters, setters) alone? But then the documentation coverage checker would return a number smaller than 100% and someone would freak out...

This is just one of many examples, of course.

chubot · on Jan 14, 2023

I hate to dwell on this, but I've also seen it in real life and it boggles the mind.

Like "give review feedback that this code isn't doing the right thing" -> "change the test to make it pass, not change the code to make it work". And it wasn't really a small case where you could plausibly do that and still understand what you were trying to do.

Coincidentally that was a few weeks after I saw a comment here on HN about someone who hired someone from Facebook, and the guy would change the tests so he could push to production, rather than fixing the bug that the tests pointed out ...

So yes it happens.

AndreasHae · on Jan 14, 2023

>Coincidentally that was a few weeks after I saw a comment here on HN about someone who hired someone from Facebook, and the guy would change the tests so he could push to production, rather than fixing the bug that the tests pointed out ...

Can't blame him, he moved fast and broke things /s

irthomasthomas · on Jan 14, 2023

Perhaps he's a Buddhist? "If the software is going to break, then the software will be broken." Then he adds a little wabi-sabi for good measure. https://en.wikipedia.org/wiki/Wabi-sabi

I remember once, using some in-house software, which for god knows why could not log it's errors back to the IT department. Instead, they relied on users to call up IT, or email them with the error. To make it more fun for users, each error message contained a humorous haiku.

  Chaos reigns within.
  Reflect, repent, and reboot.
  Order shall return.

Edit: Just found this from 2001 https://www.gnu.org/fun/jokes/error-haiku.en.html And my experience with haiku error messages at work was 01 or 02.

travisjungroth · on Jan 14, 2023

Would it do this just the first time? It’s still bad it was doing this silently, but it’s pretty common to test web APIs in a similar way manually. Make a request, check the response you get back looks right (important step) and then save it as the expected value.

Edit: or after reading the article, like in the article.

codetrotter · on Jan 14, 2023

It did this every time, not just the first time.

travisjungroth · on Jan 14, 2023

Well, you know what they say: Expect the unexpected!

teeray · on Jan 14, 2023

I can somewhat understand, because this is kind of the goal of property based testing—the actual values themselves matter so little to the test that you’re willing to subject those inputs to randomness

That said, this doesn’t sound like a very good way to pull that off because the developer has no control over that randomness (where it’s needed greatly).

superb-owl · on Jan 14, 2023

So long as the diffs get reviewed and checked in, this is a great form of testing called "regression testing". It doesn't replace unit testing, but it can be super valuable.

avgcorrection · on Jan 14, 2023

What’s described in the OP (Jane Street) is regression testing.

What the commenter just described is tautology testing: whatever result of the computation I get is what I expected.

WastingMyTime89 · on Jan 14, 2023

You are missing the point entirely. It’s actually discussed at length in the article btw if you had bothered reading it.

Regression tests are extremely useful because you don’t want working code to get broken but they are tedious to write. What the author is describing is pretty much how everyone does it if you want anything moderately complex in the test, you just run and then copy-paste. Having something do it for you in a frictionless way is a huge win.

Plus the way the framework works you can still test expected behaviours before writing the code if that’s what you actually want.

carry_bit · on Jan 14, 2023

Think of it as manual testing where your work is captured so it can be ran later in an automated fashion. There are many problems where verifying the answer is easier than coming up with the answer.

Asserting formatted output can also be really useful. A picture might be worth a thousand words, but when it comes to tests it can save you a thousand asserts. Writing those thousand asserts separately also would be so tedious that in practice you'd probably not write them all, leaving part of your output uncovered by tests.

When I wrote a LALR parser generator for fun, I added some code to print out a nicely formatted parsing table with debugging information. Besides being useful for debugging, it let me write simple yet powerful tests: I would feed the generator a grammar and then assert on the formatted parsing table. That made it easy to verify that I was asserting the right thing, and let me assert everything in one go.

tantalor · on Jan 14, 2023

> locking in your implementation such that it can't change later

That's the whole point of tests. All tests do that.

This protects against later code changes that change behavior (output or side effects) unintentionally.

When you intend to change behavior then you need to change the tests tests too.

mabbo · on Jan 14, 2023

I disagree.

Tests should define what the expectations are. If a change does not impact those expectations, then it should be allowed and not break any tests.

Locking your code such that all future changes require updating old tests tells me that your tests are just your code written a second time, with no thought about what the code's requirements are.

SpicyLemonZest · on Jan 14, 2023

In many contexts, there's just no such thing as a safe behavior change which should be allowed without a specific decision from you to allow it. As a database systems guy, I've seen countless examples of customer breakages caused by a developer's decision that some behavior or another is so trivial it doesn't need to be tested.

When you're working on developing a random utility function (real example!), it's easy to say "come on, it's no big deal to return DECIMAL(14, 4) instead of DECIMAL(12, 3)". It feels like they're basically the same, updating the test is make-work, and the guidelines saying you must document it as a breaking change are pointless annoyances. It's hard, requiring substantial amounts of knowledge and expertise, to recognize that this change will cause a production outage because the schema of a customer's view is no longer write-compatible with their existing data.

EamonnMR · on Jan 14, 2023

In your story though the hapless dev just changed the test. And the reviewers approved it.

This suggests that there are so many changes to tests that it's just become background noise.

SpicyLemonZest · on Jan 14, 2023

It had, and that's precisely because of the lack of anything like the expect() tests described in the OP. It's laborious to reliably scan through a big test diff and identify when it's describing a user-facing change, and people are inevitably going to autopilot through it. If you have a golden file (the standard name in my area for an equivalent mechanism to expect() tests), the reviewer's work is a lot simpler: any non-append-only diff is a breaking change and must be either fixed or communicated broadly before deploying it.

schwartzworld · on Jan 14, 2023

Implementation !== Behavior. You want to test the behavior, not the implementation. I'd expect tests to change when behavior changes, but reimplementing the same behavior, the tests should pass when you're done.

IshKebab · on Jan 14, 2023

Yeah in their Fibonacci example if it printed out 510 instead of 610 you'd still have a bug and think you had tested it. Especially confusing for future people who will assume it works because there are passing tests!

angio · on Jan 14, 2023

The title mentions writing tests as if they are repl sessions because you're supposed to iterate until you have the correct result.

IshKebab · on Jan 14, 2023

How do you know if you have the right result though? You might know if you have a plausible result. Like if it output -1 then you know something is wrong I guess.

There's a much higher chance of detecting bugs that give plausible output if you aren't given the opportunity to say "eh looks plausible I won't bother double checking it".

cormacrelf · on Jan 14, 2023

Any programmer dumb enough to just blindly accept that their program is correct is also a dumb enough programmer not to have begun writing a test in the first place. If this gets the friction of writing a test at all so close to zero that these programmers start writing tests (albeit sometimes blindly accepting the output), then it's better than just trying their program on some inputs and calling it a day. It writes down the current output of the program. That's a big step up already. Now people evaluating the code can read some of its outputs without downloading anything.

I personally already use a similar cycle to expect-test when I write tests. A great place to start when writing test assertions is the debug output, just like this thing uses. Then you convert the output into assertions after you have thought through which parts are right or wrong. Just like you can do with expect-test, but without the automation. If you don't know whether the output is right or not, just add an assert(false, "hmm, not sure about this") aka todo!() and voilà, your test fails and future you can be prompted to check over it again.

Sometimes the output is obviously wrong, but you still don't know what the right output is. (At this point you know you're doing useful work!) The remedy is the same. Just make the test fail somehow.

IshKebab · on Jan 14, 2023

> Any programmer dumb enough to just blindly accept that their program is correct is also a dumb enough programmer not to have begun writing a test in the first place.

Then what's the point of this methodology? It requires you to write tests and also blindly accept that your program is correct.

Maybe they should just rename it to "plausibility tests" or similar because that's what they're really testing. And while that does have some value, I think most of the value is negated by the fact that it sounds like they are properly vetted tests which they are not.

So a more appropriate name would help a lot. I still think it's a bad idea though.

cormacrelf · on Jan 14, 2023

> It requires you to write tests and also blindly accept that your program is correct.

No. You can say no. Just don’t accept it. You’re a human and it asks. Even if you do accept it you can modify it because you have eyes and a keyboard and it’s written right there where you wrote your test.

See https://github.com/rust-analyzer/expect-test for a demo gif of the rust version.

IshKebab · on Jan 15, 2023

> No. You can say no. Just don’t accept it.

Yes you can except...

> You’re a human

Precisely. You're a human. Humans are lazy and bad at manually checking things are correct, especially if there's an "eh it's probably fine" option.

This is extremely well studied: https://en.wikipedia.org/wiki/Vigilance_(psychology)

As I said before, it's probably better than nothing in that it will help you detect obviously implausible results. But it really needs to be labelled as such otherwise people will assume that these are properly curated "golden" tests.

cormacrelf · on Jan 17, 2023

Of course. The reason expect-testing is good is that you need significantly less vigilance writing/maintaining the tests than when you do them with assertions for everything you care about, in exchange for slightly more vigilance required on the actual output of your programs. Yes you need to pay attention to the output, but your attention can now more focused instead of split between that job and the job of writing the test. It's possible to make mistakes when writing out your assertions, they are just generally more invisible and pernicious. Testing code is code like any other, and mistakes look like forgetting to test things, erroneous refactoring of the test or the code, mistakes copying tests around, mistakes writing out extrapolations, mistakes from sheer fatigue at the heft of the testing code you're trying to maintain. Further, the kind of vigilance required for expect-test is mostly not "Tesla kinda driving itself but driver is meant to watch the road". You are not checked out completely and talking to the other passengers or reading a book, but somehow legally responsible for taking control at any moment. You have your hands on the wheel and the car is offering turn-by-turn GPS directions.

Expect-testing is a good tradeoff in the short term (time to create tests) and in the long term (quality and size of test suites produced). The evidence for that is that there are pieces of software that need so many tests for their range of functionality, that you cannot test them any other way than in this style. I am talking about testing orders of magnitude more stuff than you could do manually. A great example is the Rust compiler UI test suite (https://github.com/rust-lang/rust/tree/master/tests/ui). It doesn't have to be that your tests have large amounts of noise, like compiler UI tests do. You can make focused and noise-free tests using this method, as the original post examined. The main thing is that writing the tests faster results in bigger test suites and more opportunity to look at the same code on different inputs. I would rather have two dozen tests that required me to look at their output, than three tests that made me think thoroughly about every single assertion. It's just a better use of your time. The rewards are compounded by the massively reduced cost of maintaining the test suite. The tests update themselves when the code does.

Overall, yes you have identified the negative part of the tradeoff. But you seem to have missed every single one of the benefits.

angio · on Jan 14, 2023

It's a repl, so you build the final output incrementally. Testing becomes part of the development workflow like you would do in languages that rely on the repl like lisps.

For example, you start with the inputs and you apply the first layer of transformations, then check what it does makes sense. Then maybe you refactor it out in its own function and add the generated test for it. Then you move on the next step and so on until you have the final result.

pydry · on Jan 14, 2023

For Fibonacci (or indeed the result of most mathematical calculations) it makes no sense but I use this kind of thing all the time where the expected output is, for example, a templated string like an error message.

There are plenty of kinds of test outputs where rewriting the test and eyeballing the result is quicker, easier and ultimately better.

polio · on Jan 14, 2023

It makes sense in scenarios where it's easier to verify a provided solution than it is to create one.

User23 · on Jan 14, 2023

If you’re autogenerating your tests from a specification and not an implementation then it can potentially be useful.

khuey · on Jan 14, 2023

In many contexts there's value in ensuring the behavior doesn't change without being noticed. You're just moving the developer thinking about the expected behavior from when the test is written to when the test fails.

mannykannot · on Jan 14, 2023

See the related memes "code never lies", "the code is the contract" and “when I use a word, it means just what I choose it to mean — neither more nor less."

avgcorrection · on Jan 14, 2023

> I think you’re supposed to write some nonsense, like assert fibonacci(15) == 8, then when the test says “WRONG! Expected 8, got 610”, you’re supposed to copy and paste the 610 from your terminal buffer into your editor.

> This is insane!

The sane approach is presumably to either expand the call tree and verify all the unique subsolutions. Or to do every step with a calculator if you can’t expand the call tree.

> The %expect block starts out blank precisely because you don’t know what to expect. You let the computer figure it out for you. In our setup, you don’t just get a build failure telling you that you want 610 instead of a blank string. You get a diff showing you the exact change you’d need to make to your file to make this test pass; and with a keybinding you can “accept” that diff. The Emacs buffer you’re in will literally be overwritten in place with the new contents [1]:

Oh okay. The non-insane approach is to do the first thing but Emacs copies the result on your behalf.

eru · on Jan 14, 2023

Well, the non-insane thing is to do property-based testing. Instead of testing only a handful of examples.

c-cube · on Jan 14, 2023

They also do that, the post refers to their Quickcheck library. But how do you property test the Fibonacci function ? There isn't much to say about it...

travisjungroth · on Jan 14, 2023

Properties of the Fibonacci function:

It is non-decreasing monotonic. fib(n) <= fib(n+1)

It is increasing monotonic after 1. fib(n) < fib(n+1)

Its domain and codomain are non-negative integers.

fib(n) + fib(n+1) == fib(n+2) Notice this is like the recursive solution except going the other way (addition not subtraction) and is missing the base case.

xtagon · on Jan 14, 2023

It also converges closer and closer to the golden ratio. Implementing a property test for that would be interesting.

travisjungroth · on Jan 14, 2023

    a, b, c = fib(n), fib(n+1), fib(n+2)
    assert abs(c / b - phi) < abs(b / a - phi)

angio · on Jan 14, 2023

I believe the way to test it is to have a property like `n in integer | fib(n+2) == fib(n+1) + fib(n)`. This is close to the naive (but obviously correct?) implementation of fib and can be used to test the optimized version of the function.

You can also test that the sequence is increasing like `fib(n+2) > fib(n+1) > fib(n)`.

avgcorrection · on Jan 14, 2023

You use the naive implementation as a test oracle, limit `n` to something small (through the property tester), and use the test oracle on your efficient implementation.

Unit testing elegant functions has no value.

(fib is often used as an example. But you asked how to test it.)

davidgrenier · on Jan 14, 2023

In combinatorics the adjusted fibonacci numbers start with 1 instead and is more commonly used as it aligns with many other results. One might want to document in the code, via a test, which sequence is of interest.

This is just an example of course but elegant functions might need to be tested.

eru · on Jan 15, 2023

In addition to what travisjungroth said, you can also check against a reference implementation.

Eg if you coded up an O(n) version of the Fibonacci calculation, you can check against the naive recursive one (or if you are feeling confident, you can check against the O(log n) solution via repeated squaring of matrices.)

bottled_poe · on Jan 14, 2023

Actually, Fibonacci results can be fairly precisely tested with the golden ratio ratio.

davidgrenier · on Jan 14, 2023

You could compare against a closed-form solution:

    let fib2 n =
        let sq5 = sqrt(5.0)
        ((1.0 + sq5)/2.0)**(float n)/sq5
        |> round |> int

MereInterest · on Jan 14, 2023

The problem is that the closed-form solution is vulnerable to floating-point error. If the calculations are done in float32 (including all intermediate steps), then the 32nd fibonacci number is erroneously given as 2178310, instead of the correct value of 2178309. Using float64 does better, but still has an error at the 71st fibonacci number. (I made a quick plot of the error as a function of N at https://i.imgur.com/bbc9OFC.png. As soon as the error crossed ±0.5, the rounding results in the wrong result.)

These are fine for property-based testing, so long as you restrict yourself to the range in which you have a correct value. But at that point, you might as well just hard-code the first 93 fibonacci numbers (the most that will fit in a uint64_t) and be done with it.

postalrat · on Jan 15, 2023

I prefer "code it twice and hope you get it right once" testing.

Complex systems use that system everywhere. Why aren't we doing it for our code?

eru · on Jan 15, 2023

Comparing the output of your system against an oracle is one property you can test.

But you don't always have an oracle. So other properties still make sense.

As a simple example: if you code up a quantum mechanics simulator, that's hard, and I wouldn't be able to code up an oracle for you straight away. But I can tell you that you probably want to check that things like momentum and energy better be conserved.

thaumasiotes · on Jan 14, 2023

Yes, I have difficulty understanding the point of a test-writing system that relies on your explicit assumption that whatever the code already does is correct.

What are you testing? Why?

MichaelBurge · on Jan 14, 2023

A regression test is checking causality: Changes in new code, updating dependencies, updating the OS the software is running on, updating shared libraries, porting the code to a new platform, etc. aren't supposed to change the test results.

"I may not know what cos(x) means, but whatever it is shouldn't depend on what OS version I'm running"

thaumasiotes · on Jan 14, 2023

> "I may not know what cos(x) means, but whatever it is shouldn't depend on what OS version I'm running"

Cosine is a terrible example to use for that idea. It's pretty likely to change, for certain x, in similar circumstances to your examples of "when test results should never change".

MichaelBurge · on Jan 15, 2023

If it's likely to change, then you especially want the regression test so you can decide how to handle the divergence during your port. Maybe one library preserves the signal on NaNs and the other doesn't. Or maybe the CPU's default rounding mode is different when called in this context, and you're off by 1 ulp.

In either case, if the behavior is to change, it should change as an informed decision and not because nobody noticed.

sesm · on Jan 14, 2023

This looks similar to snapshot testing in UI, where you save an output of UI components and test system notifies you when the output changes. This can be useful to detect changes in components that you didn’t intend to change.

Temporary_31337 · on Jan 14, 2023

Do you simply mean regression testing?

avgcorrection · on Jan 14, 2023

Yeah, weird to see all these variations

- Aha, an expect test!

- Oh, you mean a snapshot test!

- This here is akin to UI testing framework X where the test framework can compare an expected screenshot of the UI to a screenshot of the actual UI!

The last one basically requires automation if you want anyone to make use of it. The regression testing automation described in the OP is a nice-to-have, not a so-good-that-it-gets-a-new-name.

avgcorrection · on Jan 14, 2023

… And apparently also “change detector test” https://news.ycombinator.com/item?id=34379175

drothlis · on Jan 14, 2023

...and my favourite term, "characterization test": https://en.wikipedia.org/wiki/Characterization_test

"Regression test" means something else, at least at the companies I've worked at: It means a test that was written after a defect was found in production, to ensure that the same defect doesn't happen again (that the fix doesn't "regress"). It can be a manual test or an automated test. https://en.wikipedia.org/wiki/Regression_testing

avgcorrection · on Jan 14, 2023

That’s fine and I have no objective argument against it. But I don’t see much reason to need two different names for tests that do the same thing merely based on how they were introduced. Sometimes I add a regression test because I fixed a bug, and sometimes I add a regression test because I just implemented a feature that I don’t want my future self to ruin: six months from now they will co-exist in the same suite and serve the same function.

One reason to call bug fix tests for “regression tests” (and only those kinds of tests) is that someone might regress the code base through a merge conflict (maybe they effectively undo a commit?). So that’s one argument I suppose.

drothlis · on Jan 16, 2023

"Regression testing" can also refer to a process: When the QA team says they're doing regression testing, it means they're testing that existing functionality hasn't regressed (as opposed to testing a new feature).

I'm not particularly wedded to any of these terms, I'm just pointing out that "regression testing" has an established meaning, and it isn't snapshot testing (outside of certain industries, at least). I do find it amusing that one implementation of snapshot testing (https://pypi.org/project/pytest-regtest/) links to https://en.wikipedia.org/wiki/Regression_testing but that article doesn't describe snapshot testing at all! Maybe the article changed? Oh well, language changes too. ¯\_(ツ)_/¯

avgcorrection · on Jan 18, 2023

Good points.

gavinray · on Jan 14, 2023

Lol I came here to post this but you beat me.

CGamesPlay · on Jan 14, 2023

Snapshot testing is great, and I wish more test frameworks included first-class support for them. This means that they can auto update with a flag, and can be stored either in the source inline or in an external file (both modes have different use cases). Note that doc tests can also be a form of this, e.g. in Python's.

"Expect tests" seems like a bad name, since that covers all tests.

chii · on Jan 14, 2023

i find that snapshot testing gets overused in javascript - and mistakes can creep in easily, and if the snapshot is big, and in a separate file, code review can miss it.

I much prefer property based testing over expectation based testing. You have to explicitly think about what properties hold true about the thing you're writing.

For example, fib(N+1) = fib(N) + fib(N), so this property can be tested for all N; primitive generators can easily generate the data, and good composition framework can easily generate complex data from primitive data.

Of course, you have to have a property you can specify easily. Otherwise, it'd be exactly the same as expectation based testing.

IanCal · on Jan 14, 2023

Every single time I've introduced property based testing, even as a simple example, I've discovered a bug in either the code or the spec.

I've found a bug in a Haskell program about fib generation - your test would work (if fixed for the subtractions) but incorrectly as there was an overflow in the addition. A basic property of "fib(n+1) > fib(n)" for n>1 finds this.

I like this type of testing as it asks you to more generally consider what guarantees your code is making about its operation.

Edit - your example is a good one and necessary, I just wanted to add a bit extra as I really like property based testing

sesm · on Jan 14, 2023

Snapshot testing works well for component systems, especially with storybook. There is a service called Chromatic that lets you diff component changes visually using storybook output.

tantalor · on Jan 14, 2023

> update with a flag

Yes this is right level of automation, not whatever this article is going on about with the editor integration. Yuck.

thedufer · on Jan 14, 2023

The open source use pattern for expect tests in OCaml (via dune) is exactly as you describe (see https://dune.readthedocs.io/en/stable/tests.html) - you run the tests with `--auto-promote` to tell it to update. The editor integration is a very simple keybinding on top of more generic tooling.

ElliotH · on Jan 14, 2023

I wonder if this has the same downsides as golden and screenshot type tests, where you end up over-asserting resulting in tests that break for unrelated changes?

Obviously that’s a risk for hand written tests too but it’s easier (today… who knows what copilot like systems will offer soon!) for a human to reason about what’s relevant.

potatoyogurt · on Jan 14, 2023

Yes, that is definitely a downside for these tests. The worst is when the text of some exception is printed and it includes line numbers. It does still require some discipline to think about what you're printing and avoid output that will be very noisy. This problem is mitigated quite a bit by the ease of accepting changes when these tests fail for obviously nonsense reasons though (just hit a couple buttons in an emacs buffer).

avgcorrection · on Jan 14, 2023

Q: Why does this test assert the value X?

A: The value X was revealed to me by ChatGPT.

scotty79 · on Jan 14, 2023

Doesn't this approach make you update results of failing tests wholesale and possibly miss where a new result of some test is actually wrong?

https://docs.rs/expect-test/latest/expect_test/

eru · on Jan 14, 2023

At Google the nickname for these kinds of tests was 'change detector tests'.

imajoredinecon · on Jan 14, 2023

Yeah, the OP's counterargument is that you can filter down what goes into the test output. But at that point it seems not too different qualitatively from the traditional bottom-up approach where you just write assertions yourself, except that the framework does the job of populating the assertions' expected values.

mannykannot · on Jan 14, 2023

If you are saying this approach would tend to produce a lot of change-detector tests, then that is an issue, but I think scotty79 is making a different point: this approach would seem to make it easy to overlook any regressions that the latest change has created.

eru · on Jan 15, 2023

Yes, and that's exactly the issue with change detector tests.

wmanley · on Jan 14, 2023

Was discussed recently here: https://news.ycombinator.com/item?id=34350749

vdm · on Jan 14, 2023

A similar approach with pytest and pdb https://simonwillison.net/2020/Feb/11/cheating-at-unit-tests...

This does get me writing tests sooner.

foobarbecue · on Jan 14, 2023

I do this with pytest-regtest's --regtest-reset command.

BoppreH · on Jan 14, 2023

Some years ago I wrote a Python function, "replace_me"[1], that edits the caller's source code. You can use it for code generation, inserting comments, generating fixed random seeds, etc.

And one more use case I found was exactly what TFA describes, but even easier:

   import replace_me
   replace_me.test(1+1)

Once executed, it evaluates the argument and becomes an assertion:

   import replace_me
   replace_me.test(1+1, 2)

I never actually used it for anything important, but it comes back to my mind once in a while.

[1]: https://github.com/boppreh/replace_me

theptip · on Jan 14, 2023

I tend to think that tests should be carefully crafted for readability just like normal code. The “content of a REPL” is unlikely to be well-thought out enough to preserve meaningful invariants while remaining supple in the direction of likely changes. Perhaps in the hands of very good engineers this tool is net positive, but I shudder at giving junior engineers a tool that encourages less structure in tests.

A good set of fixture/helper functions should let you write really short and expressive tests (or tabular parametrized tests, if you prefer) which seems to me to resolve most of the pain points the author is complaining about.

One big advantage I do see with this approach is it seems to be a very compact rendering of a table of outputs; in Python+pytest+PyCharm if I run a 10-example parametrized test, I have to click through to see each failure individually. Perhaps there is a UX learning here that just rendering the raw errors into the code beside the test matrix could help visualize results faster.

As an aside, I have recently been enjoying the “write an ascii representation as your test assert” mode of testing, it can give a different way of intuiting what is going on.

gleb · on Jan 14, 2023

Similar idea in Elixir, where the library itself handles the interactive bits: https://github.com/assert-value/assert_value_elixir

adrianmonk · on Jan 14, 2023

I think this would suffer from the same problem as partial self-driving cars: it's human nature for vigilance to falter if it doesn't feel like you're the sole/primary one in control.

Of course, you can say "I won't let myself do that", but working against human nature is not a formula for success. If my back hurts, I can tell myself I'm just going to go lie down on the bed for 10 minutes but not take a nap, but then 30 minutes later I wake up feeling groggy.

evrimoztamur · on Jan 14, 2023

Here's an older post from 2015 (also from Jane Street) explaining the same process https://blog.janestreet.com/testing-with-expectations/, but at the infancy of the method. It looks like they heavily polished it!

I like the approach, and I was indeed copy-pasting the result from my console...

bmitc · on Jan 14, 2023

I don’t really understand this. How is this different from just writing the code and just assuming that you got it correct, and then locking in a potentially wrong implementation?

> What does fibonacci(15) equal? If you already know, terrific—but what are you meant to do if you don’t?

> I think you’re supposed to write some nonsense, like assert fibonacci(15) == 8, then when the test says “WRONG! Expected 8, got 610”, you’re supposed to copy and paste the 610 from your terminal buffer into your editor.

Who does that? How do you know 610 is correct? That’s just assuming your implementation is right from the get go. For such a function, I’d independently calculate it, using some method I trust (maybe Wolfram Alpha). I’d do this for a handful of examples, trying to cover base and extreme cases. And then I’d do property testing if I really wanted good coverage. Further, this expect test library seems to just smoothen the experience of copying what the function returns into a test.

This whole “expect test” business seems to rely on the developer looking at what the function returns for a given input, evaluating if it’s correct or not and then locking that in as “this is what this function is supposed to do”. That seems backwards and no different from how one implements functions in the first place, so I don’t know what is actually being tested.

The entire point of testing is saying “this is what this function should do” and not “this is what the function did and thus that’s what it should always do”.

angio · on Jan 14, 2023

You're supposed to use it as a repl, so you start with a test for `fib(1) = 1`, then `fib(2)` and so on. Once you're confident of your implementation, you use quickcheck to test general properties of the system.

Similarly if you find a bug in the live system, you add a test for that and the initial output will be wrong. Then you fix your code until it prints the correct value and commit that so any regression will be caught.

CJefferson · on Jan 14, 2023

I work with a language where all test are expect tests ( GAP ). The biggest problem is you can basically never change how built in types are printed, as you'll break all tests in every program. For example, someone wanted to improve how plurals are printed, but that would break every test.

arcturus17 · on Jan 14, 2023

Is there anything like this in Python or C#? I have worked with OCaml extensively in coursework, but there’s no chance I’ll be using it in prod any time soon and I’d love toying with this approach in my working languages.

wildcow · on Jan 14, 2023

In C# https://theramis.github.io/Snapper/#/pages/quickstart. I actually think there are other as well. But this is what i am trying to scratch to see if I can use it somehow.

anuragsoni · on Jan 16, 2023

For python there is https://github.com/ezyang/expecttest which is modeled after the OCaml expect test library.

drothlis · on Jan 14, 2023

https://approvaltests.com/