Hacker News new | past | comments | ask | show | jobs | submit login
The Data Science Manifesto (2020) (datasciencemanifesto.org)
67 points by alexmolas on Sept 17, 2023 | hide | past | favorite | 51 comments



I've been a data scientist for over a decade and have delivered on a wide range of high-impact projects and products. I don't recognize this as meaningful or helpful to my field. More of a me-too from an inexperienced perspective.

The bulleted list at the top seems to be a random collection of irrelevant, almost unintelligible thoughts, stuffed cargo cult-style into the template of the agile manifesto. What does "APIs over databases" mean? It sounds like "oranges over apples" to me.

The numbered list of principles is better, but still not always that helpful. There's a strange spirit of perfectionism and inflexibility, e.g. in #3 and #5. Maybe some data scientists have time to automate everything they do, but I don't think it's a good general principle. #4, 6, and 7 are better. But overall there is a disorganized, random, unmotivated feel to both lists.

Perhaps something better could be developed, but for now, I think data scientists looking for a manifesto would be best served by going back to the original agile manifesto (https://agilemanifesto.org/) and reflecting on how it applies to our field. Which it doesn't, at least not universally. But it didn't apply universally to software engineering either. Just replace "software" with "data analysis" or "predictive analytics" or what have you and it all carries over pretty well.


Thanks. I was about to ask what "APIs over databases" meant. I design DBs and write APIs and I was struggling to imagine a scenario in which a data consumer - or I as a consumer - wouldn't want to understand both. I'm not even sure to whom they're interchangeable terms. Possibly to a front end coder who doesn't want to craft queries?

I don't call myself a data scientist or even a software engineer (though I'd be well within my rights to, under the current understanding of those descriptions). I'm a coder who writes a lot of business logic and analyzes results. But I have noticed a creeping tendency recently for people who futz around with large data dashboards or LLM conversational strategies to try to distance themselves from the people who write code, as if somehow we who actually think through the logistics of performance and structure are just road workers while they're driving the shiny Tesla of Science, coming to the Big Answers. It's ridiculous, because anyone who's built a NN or a large DB knows how to query one, and is usually moderately bored by the results.


I don't know what the author thought, but to me it sounds like adding layer for the purpose of adding layers.

E.g, Your data pipeline shouldn't be doing a SQL query. It should be asking an API to do the query for you.


Which is funny, because SQL is an API. Security notwithstanding, it's often better than whatever API you can throw in front of it for the sake of "clean architecture" or whatever.


Hell, you could argue that SQL directly is clean architecture, and adding APIs is making it less clean and more complicated.


I’m a biochemist and software engineer, and FWIW I agree with you. This article, in its current form, is nonsense.


Yeah, I'd expand on the "API over DBs" based on my experience. I could make it nest much deeper: DBs over Excel spreadsheets, and Excel spreadsheets over PDFs or scraped HTMLs, yada, yada, yada. But before I waste any time creating an API, my client wants to know if the project is even viable.


I'm in the same boat (in terms of both field and experience) and I find the original post bordering on absolutely nonsensical gibberish. It's like a technical version of an astrology column.


Builds on https://agilemanifesto.org/, maybe overly so.

I disagree with some of these points:

- Minimal Viable Products over prototypes: depends on use case. Prototypes within a timebox evaluation are helpful and don't have the overhead of delivery. Maybe better: prototypes during discovery

- APIs over databases: nah, use both.

- Clever use of computation over convenient assumptions : if the assumptions are well founded, calibrated from external references, etc., then no issue. For example, you don't need to perform raw research to understand how many joules heat water by a degree Celsius

- Dashboards over reports: depends on the use case. Dashboards generally limit use choice.

- Validation, scrutiny and repeatability over convention and ad verecundiam: Reasonable (though argument by authority is the more common name than ad verecundiam).


I lean towards your viewpoint as well. Their assumptions (axioms, postulates?) are highly controversial, while the actual principles seem quite sound to me.

The only issue I can see is with #5. I would argue for decision making, you absolutely need a single metric, otherwise the process collapses into bickering over which measure is more important at the time (often for political or interpersonal reasons). The point is a bit vague on what exactly is being evaluated (product quality, which means what?). For launching products or running A/B tests, aim for a single metric as your decision framework. If you must have more than one, then be explicit about the tradeoffs in a flowchart: e.g., "if X is > 0, we launch. If x <= 0, but y > 2%, we launch, otherwise no launch".


I don’t see any value in these principles.

Minimal Viable Products over prototypes

What’s the difference?

APIs over databases

I feel like this is a terrible decision that has nothing to do with data science.

Clever use of computation over convenient assumptions

Why not both?

Dashboards over reports

Dislike this as well. Dashboards don’t contain analysis.

Validation, scrutiny and repeatability over convention and ad verecundiam

Sure, but this is not controversial


> Why not both?

A financial example would be that some models are designed to be particularly tractable when those in the know could use a tool like autodiff to make a more sophisticated model more usable at the same scale


Who knows if the way you interpreted these 6 random words is even in line with what the author had in mind though? It so needs fleshing out that it borders on meaningless.


Also I think the API thing if some right is the right way. Databases can become an unversioned blob for too quickly.

I don't think raw APIs are really it either, maybe a persistent airflow-y DAG type thing that's secretly backed by a proper database?


I was about to comment about how terrible this entire thing is, but then saw that literally every comment here is a negative take on the article.

It would be nice if someone said something at least semi-positive, so here's my attempt!

I think, although it misses the mark, something like this is really needed. I've worked in the data science field for about 5 years now, and see the same problems at every organisation to varying degrees:

- data scientists struggle to balance long term research efforts with immediate value propositions

- communicating the business value of data science tasks (such as forecasting) seems to often struggle between being rigorous, and understandable to non-data-scientists

- its hard to build data science teams with both good statistical foundations and good software engineering practices, and both of these are table stakes for delivery of anything meaningful.

I think as an industry, data science is getting there slowly. Part of the problem is that tools are touted as the answer much more that practice. So, although I don't agree with all the sentiments here, I think we need more thought, like this, on the principles behind success for data teams.


My take is that it is attempting a similar “intervention” as the Agile Manifesto did, but less well thought through or phrased.

The data science/ engineering domain appears to often produce outcomes like a good deal of 1990s software projects - projects that don’t produce good or expected results [0].

I posted a counterpoint recently [1] that said we should stop approaching data engineering like software engineering. Given what appears to be an almost endless number of data projects that produce little to no value to businesses, what should data professionals do to address the problem? This manifesto at least attempts to make a statement.

[0] I contributed to many of these at the beginning of my career!

[1] https://betterprogramming.pub/data-engineering-is-not-softwa...


In your article, your points about managing stateful data operations vs stateless tools is right on point. It's the eternal and ever-repeating argument that has to be made in every data shop I've ever been in when inevitably a manager or senior person suggests moving to scrum, because of this propagated lie that data engineers are just like software engineers. Glad to see it in writing.


Data engineering isn't software engineering, but it needs better software tools big time.


They have an interesting links page: http://datasciencemanifesto.org/links/

This one seems like a better manifesto actually: https://statisticsblog.com/manifesto/

APIs over Databases links to Martin Fowler's microservices https://martinfowler.com/articles/microservices.html


Holy smokes that is sending me down a rabbit hole haha. Really liked this one: https://www.gamedeveloper.com/programming/in-depth-functiona...


My god that second link is amazing. Brilliant.

> Morality needs probability.

Have never seen this stated so well and so explicitly!


> 2. Data science is about solving problems, not models or algorithms.

I don't like this. How about:

You may not have a problem. If you have a problem, you may not be able to detect it. If you detect it, it may be a false positive. Your intervention may make things worse. You might not detect that you made things worse. The whole operation may have been a statistical illusion made up of small sample size, insufficient blinding, and insufficient control.


I think it needs a more memorable intro. It's a manifesto! It should express more emotion if you are issuing proclamations like this otherwise you are just giving bullet points.


I think pretty much all manifestos, especially in software, are worse than useless and do more harm than good.

Can anyone point to a counter example of a helpful manifesto?


The agile manifesto was excellent. The authors could not foretell that the MBAs would co-opt and neuter the ideas.


I've greatly enjoyed Agile, personally, though have heard and read about horror stories


is there a list anywhere of other manifesto's?

i am not sure if it counts, but i always found https://12factor.net/ really helpful.


> Clever use of computation over convenient assumptions

What does that mean?


Nothing


> Validation, scrutiny and repeatability over convention and ad verecundiam

Does repeatability mean you can't change your product based on the outcome of an A/B test? Because you probably won't get a chance to repeat those conditions.


> Clever use of computation over convenient assumptions

Making falsifiable assumptions is legit. I don't know what "clever use of computation" is supposed to mean.


Every time I hear "Data Scientist" my skin crawls.

Why is "statistician" not a good enough title? That is what "Data Science" is, statistics.


People invent new words to avoid the baggage of old words all the time. Statistics has a lot of old associations and negative connotations, data science sounded fresh.


The Eat Fresh Refresh.


The title "data scientist" was massively overloaded over the past decade, so while it might mean "statistician" on one team, it might mean business analyst, visualization designer, data engineer, or a number of other things that are outside the core skillset of a statistician on another team.


I’ve always felt that “data science” is for people who want to do the job of a statistician but who don’t know what a statistician does haha


I don’t think I’ve ever seen a job listing for a “statistician” outside government roles.


That's quite likely seeing how the term originally designated "the analysis of data about the state" [1].

[1]: https://en.m.wiktionary.org/wiki/statistics


The only one of these that I think is even remotely valid is:

> Validation, scrutiny and repeatability over convention and ad verecundiam

The rest I just flatout disagree with. This sort of feels like a Project Manager's Data Science Manifesto. Which...is fine if it's titled as such. Otherwise...no thanks.


Wonder if anyone has ever heard about Mathematician Manifesto… if not, why not?


Why API over database if computes are cheap?


From now on you are not doing the real data science. But you can hire me to tell how the real data science is done.


Add 2020 to title


I'm not trying to be a dick but I'm having a hard time relating this to data science or agile and then specific mentions of things like APIs over Databases. What's the goal of this?


It reminds me of https://en.m.wikipedia.org/wiki/Financial_Modelers%27_Manife...

Does it not make sense? A lot of people doing data "science" are spraying bullshit out through their teeth, unknowingly even (there are too many people in the field). Far too often the dominant approach seems to be jamming a model onto unfamiliar data.


Once you become the "dashboard guy" at your organization, it's game over - that is all you'll be doing - there is literally a never-ending demand of dashboards in any org that works with lots of data, and those dashboards will grow into full-blown apps, sooner or later.


Is "game over" good or bad in this context? Maybe good for job security, bad for career?


good for job security, but you'll get bogged down with creating and maintaining/updating dashboards.


Your tone implies that you think that is a bad thing. But if you have become the dashboard guy, you must have been good at it... which typically only happens if you enjoy it. And being "bogged down" in stable work that you enjoy is most people's goal.


It depends on what you want to do. If you're a regular analyst or data scientist, working on reports, dashboards, tooling, R&D, etc. - most in those roles (me included) have many balls in the air.

As far as the technical hierarchy goes, these organizations are usually somewhat like this:

1. Excel (everyone knows excel)

2. PowerBI / Tableau / etc. (few know these, because they are either a bit more complex than Excel, or a more simple than R/Python/Julia/etc. + your favorite viz and analysis packages. The people that only use Excel don't bother learning it, the guys only programming won't either)

3. Programming (some know these things)

If you love working with either 1) or 3), then becoming the Tableau/Power BI expert can be a drag.


An https site would be advisable. check out lets encrypt.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: