I think data science is a perfect example in favor of types -- the code is often...

bayesian_horse · on April 22, 2022

What Pandas does is notoriously hard to fit into a compile-time type system. Certainly too hard to go into the brains of scientists who didn't grow up coding.

No, the code in data science isn't bad because of the lack of typing. The code is "bad" mostly because those writing it are relatively fresh from starting to program. Also there is more pressure to make things possible, often just to run it once, and neglect repeatability or scaling to larger code bases. Different emphasis. That doesn't mean an experienced full stack developer would do Data Science better, because he might lack a lot of skills that matter more in that domain.

NumberCruncher · on April 22, 2022

> That doesn't mean an experienced full stack developer would do Data Science better, because he might lack a lot of skills that matter more in that domain.

This resonates with my experience. I had the opportunity to work on a DS codebase written entirely in Scala with all the typing, parallelism, actor model, whatnot. Basically I joined the company because of this technical factor. It was fun until I figured out that DS was "typed IF-THEN-ELSE written by Java devs in Scala returning stuff the users complain about with high reliability within milliseconds". Now I am happy to be back to the single threaded untyped Python world. Still no bugs in production, because we validate all requests to death, have unit tests and integration tests running on real data not on mokups. Basically we follow the principle: if the integration test passes, our typing is just right, or at least good enough. Funnily all the typing errors we catch are caused by wrongly typed data, coming from the productive system written in a typed programming language... what a strange world.

bostik · on April 22, 2022

> integration tests running on real data not on mokups.

I can see you are enjoying the life outside of a highly regulated industry. Having certain kinds of production data in tests (or feeding that to test environment) would be a major audit finding in any finance or healthcare company.

Makes for both a blessing and a curse.

bayesian_horse · on April 22, 2022

You can anonymize such data or get the necessary agreements for a small subset. All of which is tricky, but not impossible.

dragonwriter · on April 22, 2022

> You can anonymize such data

...thereby destroying the production-like features for which you want it for testing, which you then need to recreate and reintroduce, so you might as well just synthesize test data in the first place, since that's what you end up doing anyway, in effect.

bayesian_horse · on April 23, 2022

I know you probably just want to be annoying, but really, there is a world of difference between completely synthetic and anonymized data. Without talking about specifics, anonymizing is just the operation of making the process of deanonymization a lot of harder. "Hard enough" is usually specified in some form by the regulator. You can identify an individual by their ECG data, for example, it's just really hard...

No, in actual practice you don't scrub the stuff you actually need to test.

dragonwriter · on April 24, 2022

> I know you probably just want to be annoying, but really, there is a world of difference between completely synthetic and anonymized data.

No, I’ve spent ~20 years in healthcare, with this issue as a frequently recurring issue.

> No, in actual practice you don't scrub the stuff you actually need to test.

In actual practice, the stuff you really need to test often overlaps with the stuff minimally required to scrub to legally deanonymize the data. The most common scenario I’ve seen trying to do this is both creating most of the work of generating synthetic data and failing to legally deidentify the source data.

NumberCruncher · on April 22, 2022

What I wrote was meant in the context of data science. You can not do ML without having access to real data, not even in highly regulated industries. Obviously you won't touch PIIs. But whether the real data is sitting in your train/test/validation data set or you use it for integration tests, doesn't make any difference from the perspective of an audit.

jghn · on April 22, 2022

> make things possible, often just to run it once

This is the largest difference. When there's no expectation of code lasting beyond a very short lifespan, why go through the effort to future proof things, improve maintainability, have better ergonomics, etc?

bogwog · on April 22, 2022

Because science doesn't work unless experimental results are reproducible.

jghn · on April 22, 2022

Sure. And once you land on something worth keeping, you clean it up then. But between time point 0 and then, a whole lot of code gets written to be run very few times.

krageon · on April 22, 2022

> Pandas does is notoriously hard to fit into a compile-time type system

What, mutate data with known structure? What exactly do you imagine is hard about this, let alone notoriously so?

qsort · on April 22, 2022

Not hard per se, but extremely unergonomic.

A pandas dataframe types to something like Iterable[SomeKindOfProductType[int, str, str, ... (78 other columns)]]. The formal type of a dataframe in the middle of a transformation is... not very useful to know.

smallerfish · on April 22, 2022

You wouldn't typically type tabular data in any language. Give it (at most) a DataFrame<StoreTransaction> type if you must, where StoreTransaction describes the structure of a row - maybe declaring only columns that model generation code was doing typed operations on (e.g. numerics vs strings) to avoid the need for reflection.

bayesian_horse · on April 22, 2022

Either you type the tabular data at compile time or you don't get type checking of tabular data at compile time.

The number and types of the columns aren't necessarily known at compile time. Which leads to runtime errors. Even in a statically typed language, such dataframes are a kind of "dynamic typing escape hatch". As complexity of a software increases, such mechanismus of dynamic typing and throwing runtime errors creep up all over the place.

smallerfish · on April 22, 2022

Sure. If you're dealing with untyped data at runtime you can't type it at compile time. Not a new issue, and handled all the time in otherwise typed languages.

int_19h · on April 23, 2022

It's not very useful to type, but that's why we have type inference.

It's certainly very useful to know, even if that knowledge is indirect via the type of the resulting dataframe after all the transforms.

bayesian_horse · on April 23, 2022

From dabbling in F# I have a feeling such "information" would be somewhat more annoying than useful.

arinlen · on April 22, 2022

> What Pandas does is notoriously hard to fit into a compile-time type system. Certainly too hard to go into the brains of scientists who didn't grow up coding.

I'm not sure if that's true. Doesn't Pandas handle ETL and some anaysis? There is nothing inherent to ETL that makes it a hard problem with compiled languages.

In your opinion, what does Pandas do that's hard to do with compile-time languages?

bayesian_horse · on April 22, 2022

I didn't say "with compile-time languages" but "with compile-time type systems". And many similar tools in a statically typed language will necessarily create a way to have one static type that doesn't care what the data inside actually looks like.

This even starts with basic Numpy and handling tensor objects. It's not easy for a type checker to understand what operations you can do with what shape of tensor. Worse, most often you don't know (or want to know) some of the dimensions or even dimensionality of some of the objects. Then it is impossible to check all of this at compile time.

arinlen · on April 22, 2022

> This even starts with basic Numpy and handling tensor objects. It's not easy for a type checker to understand what operations you can do with what shape of tensor.

That doesn't sound like a Python problem.

Instead, it sounds like the natural consequence of numpy being designed in a way where their data types aren't organized into subtypes, and leave that as runtime properties. This is a natural reflection of numpy's take on vectors, matrices, and tensors, which in terms of types are just big arrays with runtime properties.

To put things in perspective, in C++, Eigen supports static dense vectors and matrices whose size is specified and known at compile-time. I'm sure Python doesn't impose addition static type constraints than C++.

bayesian_horse · on April 23, 2022

Of course it's not a Python problem, all similar tools have the same "problem" that they can't easily fit that stuff into their type systems, so they invent some way to not care about it.

HelloNurse · on April 22, 2022

It isn't a matter of "compile time": explicit type declarations and definitions can often be formally sound but practically worthless.

Significant types in ETL-style applications typically come from outside (e.g. a certain CSV column in the input file contains a date in YYYYMMDD format, or maybe YYYYDDMM, figure it out, and don't forget time zones or your accounting will go wrong).

Then types are mostly complex but obvious and easily deducted (e.g. multiplying matrices of compatible shapes necessarily gives a matrix of a certain shape, why should the program say anything more detailed or lower-level than "do a matrix multiplication" or "do a tensor product"?); they are an often dynamic and unpredictable property of the data, not a useful abstraction.

int_19h · on April 23, 2022

The source code shouldn't need to say anything about the type of the resulting matrix explicitly, perhaps. But why shouldn't the type system keep track of shapes and deduce the accurate type for the result of said multiplication?

bayesian_horse · on April 23, 2022

Because the shape can be dynamic, for example.

int_19h · on April 23, 2022

> What Pandas does is notoriously hard to fit into a compile-time type system.

Sounds a lot like F# type providers to me.

https://thesharperdev.com/introduction-to-fsharp-type-provid...

bayesian_horse · on April 23, 2022

As I said, it is hard.

And good luck teaching beginning scientists F#.

mountainriver · on April 22, 2022

Totally agree, I would love for data scientists to use types and a lot of other good practices. I work primarily in mlops and have been trying to do this for some time, it just mostly feels like a losing game.

The output of most analytics is a report. For longer lived processes it seems easier to just hand off