Hacker News new | past | comments | ask | show | jobs | submit login
Pynguin – Generate Python unit tests automatically (github.com/se2p)
156 points by JNRowe on June 1, 2021 | hide | past | favorite | 79 comments



Just in case you are looking for an alternative approach: if you write contracts in your code, you might also consider crosshair [1] or icontract-hypothesis [2]. If your function/method does not need any pre-conditions then the the type annotations can be directly used.

(I'm one of the authors of icontract-hypothesis.)

[1] https://github.com/pschanely/CrossHair

[2] https://github.com/mristin/icontract-hypothesis


Coding with contracts has always been interesting to me but I haven't had the option/time to try it seriously in a project. I assume you've had experience with it, how much productivity/code maintanability you gain compared to not using it (or using type annotations)?


In my anecdotal experience, it takes very little time for juniors to pick up adding contracts to their code. You need to grasp implication, equivalence, exclusive or, and get used to declarative code, but then it's easy. (I often compare it to SQL.)

I find contracts personally super useful as I can express a lot of relationships in the code trivially and have them automatically verified. For example, when this input is None, then that output needs to be positive. Or if you dele the item it should not exist in this and that registry, and some related other items should also not exist any more.

My email is in the commits of the repository, feel free to contact me and we can have a chat if you are interested in a larger picture and more details.


P.S. I think the important bit is not to be religious about contracts and tests. Sometimes it's easy to write the contracts and have the function automatically tested, sometimes unit tests are clearer.

I tend to limit myself to simple contracts and do table tests for the complex cases to reap the benefits of the both approaches.


If a software module maybe has tests and maybe has contracts, I "reap the benefit" of not being able to rely on either.

Do you have in mind some technique to turn contracts into tests or tests into contracts automatically, in order to close the gaps in quality control?


Sorry, I did not express myself clearly. For certain functions you can express all the properties in contracts and have them automatically tested.

For other functions, you write some obvious contracts (so that those are also tested in integration tests or end-to-end tests), but since writing all the contracts would be too tedious or unmaintainable, you test these functions additionally using, say, table-driven tests where you specifically test with data points for which you could not write the preconditions or you check the post-conditions manually given the input data. For example, sometimes is easier to generate the input, manually inspect the results and write the expected results in the test table.

> [...] turn contracts into tests [...]

icontract-hypothesis allows you to ghostwrite Hypothesis strategies which you can further refine yourself.


I can see this being useful for refactoring legacy codebases. You create assertions for existing code, whether correct or wrong as a result of existing bugs, and then make sure nothing changes after your refactor.

For new code, not so much. The generated tests wouldn't prevent bugs by themselves, but they may uncover bugs if you see assertions that don't make sense. But you have to spend time reviewing the tests.

The tests also don't describe what they are testing in human terms so you have to refactor them before commiting.

They also would end up testing more than necessary. Sometimes the code behavior might be intentionally ambiguous, not expected to matter in the real world and behavior that will likely change in the future that you should not depend on today.


> Sometimes the code behavior might be intentionally ambiguous, not expected to matter in the real world and behavior that will likely change in the future that you should not depend on today.

That is a bad thing and should be addressed. It’s code like this that will inevitably cause you pain later and very likely become significant code debt.


There wasn't much info on the GitHub repo about how it actually works. Here's the paper from the authors: https://arxiv.org/abs/2007.14049

From what I gather from the paper, they frame the problem of test generation as a search problem. An evolutionary algorithm randomly mutates a randomly generated test suite. The evolutionary algorithm optimizes for greater branch coverage.

Excerpt from the abstract:

"Our experiments confirm that evolutionary algorithms can outperform random test generation also in the context of Python, and can even alleviate the problem of absent type information to some degree. However, our results demonstrate that dynamic typing nevertheless poses a fundamental issue for test generation, suggesting future work on integrating type inference"


I have trouble even understanding the point of this. To me "automatic test generation" seems as bad an idea as "automatic code generation".

The point of tests for me is to express intent for what the software should do. The code then expresses the details of how we do it.

I could see a use for evolutionary algorithms to probe code for hidden bugs. But even there it seems limited. I definitely see a use for things like Hypothesis, which makes test expression more powerful: https://hypothesis.readthedocs.io/en/latest/

It makes sense to me that this comes from an academic perspective, and not from people who actually make software for a living.


Automatic test generation in this case is not testing your hypothesis of how the code behaves, but rather records how the code actually does.

This can be very useful if you are dealing with legacy code bases or just don’t want to write tests yourself.

I agree that if your are going to write the tests, using hypothesis is a very wise decision.


Even there, the code already records how the code behaves. The question is always which of those behaviors are intentional, meaningful. I am unable to fathom the utility of recording an arbitrary slice of behaviors, because those are very likely to be orthogonal or even contrary to the actual intent of the system. For example, every bug is part of how the code actually behaves.

I would much rather inherit a code base with zero tests than ones automatically generated by people who "just don't want to write tests".


A test fixes the behavior. When later the need to change the code arises, you have a meaningful way of comparing results.

Any test is better than none, I’m my opinion.


That's definitely not true. Tests that lock in a bug or an accidental epiphenomenon of code are demonstrably worse, because they actively discourage improvement.


Sometimes automated testing is the only way to go.

Consider a case when you have a code that takes about 100 input parameters and makes a decision on whether something should or should not happen. The logic that determines the decision is about 1,000 LOC that doesn't divide easily into smaller chunks. The code consists of a lot of different conditions and calculations based on the input parameters and some constants.

Nobody knows for sure or can explain how every single line works or why is it there, but it seems that everything works correctly based on the decision.

Writing tests manually might be feasible, but is very impractical considering that you don't know whether the code is correct or not in the first place. The code is being modified on regular basis: new inputs are added and logic is changed. Any such change would break the manual test. In my experience, nobody will spend an hour studying why the test broke and do the "right" fix. It will most likely be the easiest fix that makes the test pass.


This all sounds like a race to the bottom, where automated testing will drive further kludge upon kludge and the code will get less understood and harder to maintain and write tests for.

Whilst an automated test suite can at least check if the code may have broken - you'll never know if the break is significant or just an inconsequential side effect of a new piece of data. And if both the code and tests aren't understood, as you say the "right" fix won't be done and something will be added quickly to fix the test.


If no one knows whether the code is correct a generated test could easily do more harm than good.

If no one is able/willing to refactor the code into understandable and testable bits, there's not much left to guarantee the correctness of it.


In that situation, why do you WANT tests?

If, for example, someone intends to refactor the code into something more maintainable and wants tests to prevent any accidental changes in functionality, then this is useful.

If the goal is to catch bugs or to facilitate future development or to meet some test coverage standard then I argue you might be better off not having tests rather than having auto-generated tests that have not been carefully reviewed.


I'm intrigued by this example. What's an example of a domain or problem set that produces a function with 100 input parameters and a boolean result that can't be broken down cleanly?


The decision is actually 5 different variables. Types are: boolean, a decimal percent and integer being a monetary value. The domain is fintech: making a decision on how much money to give and some other aspects about the loan.


> The domain is fintech

What you describe is the opposite of what I expect from (modern) fintech.

* no tests

* god method with no proper documentation/requirements

* no one actually understanding the function deciding how much money to send! Isn't this one of the core functionalities of fintech products?


Just plain scary especially with money if they can't at the very least test the extremes and some in the middle values. Good gravy! What is the name of this company so I can avoid it...


That sounds like no one knows what the correct result is.


The whole thing does not sound very competent. Like noone could possibly guarantee that the result is not off by a factor of 2 or 3 or perhaps even 10 or more in some cases. What would happen in that case? It also sounds like a domain where mistakes could be very expensive or lead to lawsuits or similar such things.


Things are not that bad. While the whole thing is kind of a monolith, there are at least broad sections that split code into logical sections. With some practice, it is possible to easily narrow down changes and any issues to 10-20 LOC. Later, breaking code by them could be a starting place for refactoring. In the end, the decision is manually reviewed. There's also a log that tells how and why the code had arrived at the numbers. So, if you know that client cannot be X but the log says X then you know that there's a bug for example.

This code was around before I joined. It started small, but more and more requirements were added until it became like that. Not much effort went into structuring it. Now it's my responsibility, and I'm trying to make it better for me and anyone who will have to deal with it in the future.


Does this mean there are at least 2^5 = 32 paths that a procedure can take? Seriously, is it even documentable?


This might not be quite the same, but logic systems -- we might have a logic expression on a hundred variables, and need to check if it is true.

While one would unit test with tiny expressions, often bugs only show up in large integration tests, as systems use many optimisations to cache and speed up behaviour, which can interact poorly


Adding a product with many attributes to a database, where some of the attributes depend on the values of others, might be one example.


Years ago, in a similar situation, I wrote a tool to capture parameters and results for the algorithm while QA ran their full application test suite. Ended up with a few thousand cases and expected results; results where QA had verified the software worked as expected.

Used that to create a test suite that we could then maintain normally.

Much better than random.


What you're describing is testing without understanding the code. I've not yet encountered inunderstandable code, and can't conceive what it might look like, though I've encountered code that I'd rather not try to understand.


So:

- you don't know whether the code is correct or not

- code is being modified on regular basis... and logic is changed

If someone knows enough about the code to regularly change it, then they should be able to provide some documentation? How else are they able to modify a "correct" function and know that it's still correct?

Can I guess that the code modifiers are quants, or something?


Pynguin executes the module under test! As a consequence, depending on what code is in that module, running Pynguin can cause serious harm to your computer, for example, wipe your entire hard disk! We recommend running Pynguin in an isolated environment, for example, a Docker container, to minimise the risk of damaging your system.


This warning quoted from the README doesn't seem to suggest anything unexpected or unusual. Obviously running the tests would execute the module under test.

(Although I suppose if someone wanted to generate tests without ever executing them the warning would be relevant.)


This warning is very important. It’s absolutely not expected that generating tests means executing module code, side effects and all.


Was anyone thinking it would "break" their computer just generating the tests? I would have though that only such a thing could happen later particularly if your python code was doing a bunch of file related stuff.


You wouldn't expect it to load the module to inspect it? Loading a module executes top-level code.


You’re confusing loading the module vs executing every method inside. I would not expect loading a module that has dangerous methods to do the destructive thing by default.


I admit to being surprised by that ... I assumed it generated tests (you know, like it says) that I could then inspect those tests and run. If you know how it works, then of course it is creating tests by running the code.


What kind of "reassurance" do a tool like pynguin provide? In my head tests should be written from requirements, before - or at least alongside - code.


If you do TDD, yes.

But there are a lots of projects out there that don't, and write tests afterward.

I do.

It's usually much, much faster and easier for me to write the tests after I figure out how to make what I want to work. I sometimes need to change the API a little to be more testable, but it's still faster.

In fact, if I write a test first, I have to write it with an abstract idea on how my code will work and need to understand the problem perfectly, which I rarely do. My understanding of the problem and solution grows as I write the code. It helps me ask the right questions, it reveals issues I didn't think about, I shows part of the API I didn't know by heart and above all, it cleans my initial idea of the workflow for this part of the code.

So if I write the test first, it will take a lot of time to think it through, and I will realize later I got it wrong and have to rewrite it anyway. You could argue I need to better define my problem or get better specs, but my experience is that they are always wrong as long as you haven't written any code: "no plan survive the contact with the enemy".

Not to mention you may not have written the untested code in the first place.

So I'd love a tool that can output test boiler plate for me.


Slightly off-topic but I thought it might be interesting: you can easily modify tests to adapt to your changing architecture as well.

Personally I use TDD to have an easy entry point to run through my code, so the first instance of a test for an api might only check status code

It makes the process very iterative, as you can continuously run through your code just by saving


> you can easily modify tests to adapt to your changing architecture as well.

Can you? Surely it depends; most systems end up with more LOC in the tests than the tested code, and the tests multiply the cost of refactoring.

I've had success with the "happy path first" kind of TDD, where you write a use case and implement to that and only then come back and fill in other tests.


I might've been too unclear in my comment. I was indeed speaking specifically about the issue the parent comment had with writing detailed tests before the application logic is implemented, just to be forced to change them because the initial idea for the api had to be changed.

Changing existing apis comes with a very different set of challenges


Let's say you test an api, since yoy are taling about status code. You have to test all kind of cases with wrong inputs, various user states and so on.

You can of course write the test to simulate all that, but's way harder than just using your user UI which lets you discover it and put a breakpoint in the endpoint to inspect. I may not remember where to get the header data in this framework, or maybe the client send me a weird queryset param and I must figure out why, or I my validation is failing in some case and I must check it out. I can't simulate something I don't know about or don't understand yet.

Plus using the UI will make you realize you forgot to limit input length or what happen if the user doesn't have any phone number or if it's tuesday.

The minimal use case you will write your code against will likely be only good in theory in your mind unless you do something very basic. You will probably throw it away in the end in exhange for IRL code learned by feed back, and you'll have written all that for nothing.

This is were TDD proponents say ypu shoud write the minimal case and see all that stuff incrementally, but again, it's way easier and faster to just fiddle with the UI and write the code I need then test each case that imagine each case incrementally, and find the proper incatation to make the whole thing works.

I'm exagerating for the sake of the argument, but that's the result I get every time I work on a project where the team requires TDD. Incidently, those are the projects with the worst UI as well, because dev don't use it enough.


your whole argument is a straw-man.

i don't think that anyone would disagree that you can get caught up in writing too many tests too early, consequently loosing sight out of your objective and ending up with a bad implementation.

My point was that just writing super easy and happy-path tests at the start is a valid strategy for TDD as well, and does not cause any of your imaginary issues.


Aggressive tone and a throw away account, this is the end of the thread for me.


Rereading my comment does indeed sound very aggressive. It wasn't intentional and I'm sorry for that. I have to work on that.


One reason to use a tool like this is to supplement hand-written unit tests when doing a refactor. Ultimately, a lot of unit tests' jobs are just to catch unexpected side-effects of a code change.

> tests should be written from requirements

TDD might be too big of a debate to fit in an HN comment thread :)


I can't talk about this specific tool because I never used it and they are a bit short on the details. But if they ensure code coverage then you know at least that every line of code has been run at least once. Tools may also inspect every "if" and "while" and look for typical boundary conditions (when applicable) such as underflow, overflow, off-by-one, None, etc.

Those two checks alone can take you far, specially when compared to the alternative "I tried it once and it worked".


Well yes, but that kind of testing has also proven to be a waste of resources(1).

So an automated tool may be the only path forward as far as testing goes. Whether this one adds value or not is hard to say as their documentation is extremely sparse (probably why they made the tool in the first place).

1) http://www.knosof.co.uk/ESEUR/


Which kind of testing has been proven to be a waste of resources? TDD?

I'm no fan of TDD -- I think there's a lot of cargo culting about it -- but I wasn't aware anything had been "proven" about it. I've seen papers claiming it helps, papers claiming it sort of helps, and some people claiming it mostly doesn't help (or that what actually helps are some practices that go along with it).

I thought there was no industry consensus that it doesn't help.


In my head requirements only provide an overview, but don't dictate specific structure e.g. class/function structure.

A majority of test suites require knowledge of code structure (and thus to write the tests).

To write the code before tests you either need to dictate some strict form of input/output, and test based on those only; OR, use a test suite that is very flexible.

BDD seems to cheat by writing "tests" that aren't fully executable until you fill in the functional specifics later.


The best use case I see for this is when you inherit a testless codebase. It's a way to get bootstrapped, and go from there.

Sometimes you have to honor bugs as part of the (undocumented/evolved/inferred) interface. It could help to discover these, walk through them and see where they can be fixed or where they need to be honored.

It could also help prevent the "rewrite from scratch because I don't understand it" problem. Heck, I've even done that to myself picking up a codebase I haven't touched in 3+ years, sometimes validly, sometimes notsomuch.

I'll have to see how it works next time this happens to me, because this is pure conjecture.


Just skimmed through the paper: https://arxiv.org/pdf/2007.14049.pdf, it seems like they are using some sort of heuristically mutating a test suite until it's fully branch covered, based on calculating `branch distance` for each predicate. Why isn't a SMT solver like Z3-solver being used here to solve for the predicate (generating inputs to evaluate to true/false)? Since it's getting so powerful, and python container/dict/string(regex) operation can also be modeled conveniently.

And I'm also wondering, whether there is a return based automatic test generation, that start from the return value, and resolve all variables used and gather all possible return values with its constraint, and feed those to z3 to generate inputs to cover. It seems like it will help with branch explosion by eliminating unused branch, and only focus on branch that is being used.

Edit: it looks like CrossHair[0] is a similar tool that uses Z3 to find counter examples for predicates.

[0] https://github.com/pschanely/CrossHair


It would be nice if the README included use cases for a tool like this.

In my experience with unit testing, the process of writing the unit tests is where the most value is derived from. It forces developers to more careful examine use cases and potential inputs for the unit under test (the latter being especially important for Python). Therefore I would never consider something like this for testing new code.

The rest of the value of unit tests mostly comes from validating the scope of a change's impact. If a unit tests which once passed fails after I make a change, it could mean my change affected some other part of the code in a way I did not expect. But in my experience it usually just means I overlooked something in the test suite - not that I broke something in the software itself. Auto-generated unit tests might be useful for validating the scope of a change's impact on old code, but I'd expect most failures to boil down to issues with the tests not getting updated correctly.

Where I imagine auto-generated unit tests being particularly useful is for ensuring functionality is not affected when working on bug fixes, security patches, optimizations, cruft busting, or simple refactoring.


>> Pynguin can cause serious harm to your computer, for example, wipe your entire hard disk! We recommend running Pynguin in an isolated environment, for example, a Docker container, to minimise the risk of damaging your system.

That sounds ominous.

Why do they execute the generated tests though? I mean I'd expect this to go only as far as generating the tests. Also that means that at least someone needs to review the tests.


They provide arguments for your code, runs it and record the output.

Based on the input types, it creates a hyperdimensional surface which is then searched for possible output values.

So, it runs your code, and runs it a lot. If your code deletes /, it will delete it a lot.


> mature tools exist—for statically typed languages, such as Java

Is that really a thing Java developers use? I am one, I remember doing a PoC with Jtest (now Parasoft) in 2008, and it was utterly useless, never come across any such since, I'd be genuinely interested to learn more (I'll go duckduckgo now).


They probably refer to EvoSuite, which is/was mainly developed by one of the authors of the paper: https://www.evosuite.org/


Maybe it's a reference to the compiler or the plethora of internal and paid tools out there in the enterprise world.


The README lacks of an example of input and output.


It does, but their documentation's Quickstart has an example:

https://pynguin.readthedocs.io/en/latest/user/quickstart.htm...


This looks like they combined the exploration strategy of property based testing with unit testing. This is not even testing, it tries to find some examples that result in different code paths. There is no concept of "expected result, given a certain input" and it does not try to find extreme cases or equivalence classes. Using it will result in a heap of buggy code that has a useless "but tested" sticker.


I think you are thinking of the use case wrong. In my work I come across a lot of untested legacy code. An autogenerated tests would be like guardrails when working with these code bases. I am not going to work with them blindly, it doesn't even do variable names yet, but its boilerplate stuff I would have written anyways. If this code gives me extensive coverage, adding test cases for specific things becomes trivial.


Agreed. I went digging and found one in the docs. https://pynguin.readthedocs.io/en/latest/user/quickstart.htm...


If I write faulty code, the generated test will make sure I keep the code faulty right?


Yup, and their example shows it: https://pynguin.readthedocs.io/en/latest/user/quickstart.htm...

The example is a function which reports the kind of triangle. The generated tests include one that tests that a triangle with sides 12, 12 and... er... None is "isosceles".


That kinda sucks, since the function is type hinted...


preprint available for those curious about how this works: https://arxiv.org/abs/2007.14049


> Please consider reading the online documentation to start your Pynguin adventure.

Nice wink to Penguin Adventure, brings back very old memories.


This looks great, looking forward to try it out on my next python project. That being said thats what I prefer statically typed languages because the compiler will do "automatic tests" by checking types, and I dont have to write tests that checks a function with different input types to validate that it can handle them all.


Once I found these names funny, replacing any letter after a "P" in a word with a "y" to make it a PYthon library name. It has become a bit trite over time though. What other funny "strategies" are there in Python or other programming language ecosystems?


Add `N` before any Java lib to name a .net library. If `NFoo` is taken, feel free to append `Sharp` at the end. Special case if the library is written in F#, you have to start your project by the letter F to show your love to the functional world.


This is a small excerpt from a draft blog post I am currently writing:

- Python forces the `py` affix in words: scrapy, scapy, rpyc

- Python does wordplay on standard library modules' names: pickle => dill

- Emacs and Julia append the file extension to the end of the name: restclient.el, HTTP.jl, JSON.jl

- Emacs does wordplay on the package the extension is based on: git => magit

These are the ones I was most familiar with. If the replies to your comment provide more insights, I may be able to extend that part of the blog post :)


https://github.com/JuliaSIMD/Polyester.jl is a cheap threading model in Julia.


Adding oxide to the name in Rust community comes to mind.


Can anyone recommend a good online course on software testing (preferably using python)?


Harry Percival, famous by "Obey the Goat" also his new book "Cosmic Python" this guy is a genius at teaching :) both books available at the respective sites

https://www.obeythetestinggoat.com/pages/book.html#toc

http://www.cosmicpython.com/

If you feel that TDD is too much there is a book about pytest, tho never really focused on it.

Hope it helps,

PS: Ty Harry :)


> Testing software is a tedious task.

I disagree with this premise. Testing your code is a great way to (among other benefits) think about the external interface to your software components, and so can help identify whether you have a quality abstraction or not.

(I don't advocate for TDD, even though I get that it might be a workflow that helps some people. But you don't need TDD to get the benefits of writing your own tests.)

Of course, testing is something you have to get "good at", too...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: