I think data science is a perfect example in favor of types -- the code is often terrible because of the lack of typing. Pandas has notoriously poor developer ergonomics, and I recall painfully poring over type errors across the board -- lists, dataframes, numpy arrays, etc. are all iterables, so they can be interchanged in some contexts, but not in others.
Had I had MyPy back when I was working in data science, I would've saved countless hours and headaches.
What Pandas does is notoriously hard to fit into a compile-time type system. Certainly too hard to go into the brains of scientists who didn't grow up coding.
No, the code in data science isn't bad because of the lack of typing. The code is "bad" mostly because those writing it are relatively fresh from starting to program. Also there is more pressure to make things possible, often just to run it once, and neglect repeatability or scaling to larger code bases. Different emphasis. That doesn't mean an experienced full stack developer would do Data Science better, because he might lack a lot of skills that matter more in that domain.
> That doesn't mean an experienced full stack developer would do Data Science better, because he might lack a lot of skills that matter more in that domain.
This resonates with my experience. I had the opportunity to work on a DS codebase written entirely in Scala with all the typing, parallelism, actor model, whatnot. Basically I joined the company because of this technical factor. It was fun until I figured out that DS was "typed IF-THEN-ELSE written by Java devs in Scala returning stuff the users complain about with high reliability within milliseconds". Now I am happy to be back to the single threaded untyped Python world. Still no bugs in production, because we validate all requests to death, have unit tests and integration tests running on real data not on mokups. Basically we follow the principle: if the integration test passes, our typing is just right, or at least good enough. Funnily all the typing errors we catch are caused by wrongly typed data, coming from the productive system written in a typed programming language... what a strange world.
> integration tests running on real data not on mokups.
I can see you are enjoying the life outside of a highly regulated industry. Having certain kinds of production data in tests (or feeding that to test environment) would be a major audit finding in any finance or healthcare company.
...thereby destroying the production-like features for which you want it for testing, which you then need to recreate and reintroduce, so you might as well just synthesize test data in the first place, since that's what you end up doing anyway, in effect.
I know you probably just want to be annoying, but really, there is a world of difference between completely synthetic and anonymized data. Without talking about specifics, anonymizing is just the operation of making the process of deanonymization a lot of harder. "Hard enough" is usually specified in some form by the regulator. You can identify an individual by their ECG data, for example, it's just really hard...
No, in actual practice you don't scrub the stuff you actually need to test.
> I know you probably just want to be annoying, but really, there is a world of difference between completely synthetic and anonymized data.
No, I’ve spent ~20 years in healthcare, with this issue as a frequently recurring issue.
> No, in actual practice you don't scrub the stuff you actually need to test.
In actual practice, the stuff you really need to test often overlaps with the stuff minimally required to scrub to legally deanonymize the data. The most common scenario I’ve seen trying to do this is both creating most of the work of generating synthetic data and failing to legally deidentify the source data.
What I wrote was meant in the context of data science. You can not do ML without having access to real data, not even in highly regulated industries. Obviously you won't touch PIIs. But whether the real data is sitting in your train/test/validation data set or you use it for integration tests, doesn't make any difference from the perspective of an audit.
This is the largest difference. When there's no expectation of code lasting beyond a very short lifespan, why go through the effort to future proof things, improve maintainability, have better ergonomics, etc?
Sure. And once you land on something worth keeping, you clean it up then. But between time point 0 and then, a whole lot of code gets written to be run very few times.
A pandas dataframe types to something like Iterable[SomeKindOfProductType[int, str, str, ... (78 other columns)]]. The formal type of a dataframe in the middle of a transformation is... not very useful to know.
You wouldn't typically type tabular data in any language. Give it (at most) a DataFrame<StoreTransaction> type if you must, where StoreTransaction describes the structure of a row - maybe declaring only columns that model generation code was doing typed operations on (e.g. numerics vs strings) to avoid the need for reflection.
Either you type the tabular data at compile time or you don't get type checking of tabular data at compile time.
The number and types of the columns aren't necessarily known at compile time. Which leads to runtime errors. Even in a statically typed language, such dataframes are a kind of "dynamic typing escape hatch". As complexity of a software increases, such mechanismus of dynamic typing and throwing runtime errors creep up all over the place.
Sure. If you're dealing with untyped data at runtime you can't type it at compile time. Not a new issue, and handled all the time in otherwise typed languages.
> What Pandas does is notoriously hard to fit into a compile-time type system. Certainly too hard to go into the brains of scientists who didn't grow up coding.
I'm not sure if that's true. Doesn't Pandas handle ETL and some anaysis? There is nothing inherent to ETL that makes it a hard problem with compiled languages.
In your opinion, what does Pandas do that's hard to do with compile-time languages?
I didn't say "with compile-time languages" but "with compile-time type systems". And many similar tools in a statically typed language will necessarily create a way to have one static type that doesn't care what the data inside actually looks like.
This even starts with basic Numpy and handling tensor objects. It's not easy for a type checker to understand what operations you can do with what shape of tensor. Worse, most often you don't know (or want to know) some of the dimensions or even dimensionality of some of the objects. Then it is impossible to check all of this at compile time.
> This even starts with basic Numpy and handling tensor objects. It's not easy for a type checker to understand what operations you can do with what shape of tensor.
That doesn't sound like a Python problem.
Instead, it sounds like the natural consequence of numpy being designed in a way where their data types aren't organized into subtypes, and leave that as runtime properties. This is a natural reflection of numpy's take on vectors, matrices, and tensors, which in terms of types are just big arrays with runtime properties.
To put things in perspective, in C++, Eigen supports static dense vectors and matrices whose size is specified and known at compile-time. I'm sure Python doesn't impose addition static type constraints than C++.
Of course it's not a Python problem, all similar tools have the same "problem" that they can't easily fit that stuff into their type systems, so they invent some way to not care about it.
It isn't a matter of "compile time": explicit type declarations and definitions can often be formally sound but practically worthless.
Significant types in ETL-style applications typically come from outside (e.g. a certain CSV column in the input file contains a date in YYYYMMDD format, or maybe YYYYDDMM, figure it out, and don't forget time zones or your accounting will go wrong).
Then types are mostly complex but obvious and easily deducted (e.g. multiplying matrices of compatible shapes necessarily gives a matrix of a certain shape, why should the program say anything more detailed or lower-level than "do a matrix multiplication" or "do a tensor product"?); they are an often dynamic and unpredictable property of the data, not a useful abstraction.
The source code shouldn't need to say anything about the type of the resulting matrix explicitly, perhaps. But why shouldn't the type system keep track of shapes and deduce the accurate type for the result of said multiplication?
Totally agree, I would love for data scientists to use types and a lot of other good practices. I work primarily in mlops and have been trying to do this for some time, it just mostly feels like a losing game.
The output of most analytics is a report. For longer lived processes it seems easier to just hand off
Had I had MyPy back when I was working in data science, I would've saved countless hours and headaches.