I've been a data scientist for over a decade and have delivered on a wide range of high-impact projects and products. I don't recognize this as meaningful or helpful to my field. More of a me-too from an inexperienced perspective.
The bulleted list at the top seems to be a random collection of irrelevant, almost unintelligible thoughts, stuffed cargo cult-style into the template of the agile manifesto. What does "APIs over databases" mean? It sounds like "oranges over apples" to me.
The numbered list of principles is better, but still not always that helpful. There's a strange spirit of perfectionism and inflexibility, e.g. in #3 and #5. Maybe some data scientists have time to automate everything they do, but I don't think it's a good general principle. #4, 6, and 7 are better. But overall there is a disorganized, random, unmotivated feel to both lists.
Perhaps something better could be developed, but for now, I think data scientists looking for a manifesto would be best served by going back to the original agile manifesto (https://agilemanifesto.org/) and reflecting on how it applies to our field. Which it doesn't, at least not universally. But it didn't apply universally to software engineering either. Just replace "software" with "data analysis" or "predictive analytics" or what have you and it all carries over pretty well.
Thanks. I was about to ask what "APIs over databases" meant. I design DBs and write APIs and I was struggling to imagine a scenario in which a data consumer - or I as a consumer - wouldn't want to understand both. I'm not even sure to whom they're interchangeable terms. Possibly to a front end coder who doesn't want to craft queries?
I don't call myself a data scientist or even a software engineer (though I'd be well within my rights to, under the current understanding of those descriptions). I'm a coder who writes a lot of business logic and analyzes results. But I have noticed a creeping tendency recently for people who futz around with large data dashboards or LLM conversational strategies to try to distance themselves from the people who write code, as if somehow we who actually think through the logistics of performance and structure are just road workers while they're driving the shiny Tesla of Science, coming to the Big Answers. It's ridiculous, because anyone who's built a NN or a large DB knows how to query one, and is usually moderately bored by the results.
Which is funny, because SQL is an API. Security notwithstanding, it's often better than whatever API you can throw in front of it for the sake of "clean architecture" or whatever.
Yeah, I'd expand on the "API over DBs" based on my experience. I could make it nest much deeper: DBs over Excel spreadsheets, and Excel spreadsheets over PDFs or scraped HTMLs, yada, yada, yada.
But before I waste any time creating an API, my client wants to know if the project is even viable.
I'm in the same boat (in terms of both field and experience) and I find the original post bordering on absolutely nonsensical gibberish. It's like a technical version of an astrology column.
- Minimal Viable Products over prototypes: depends on use case. Prototypes within a timebox evaluation are helpful and don't have the overhead of delivery. Maybe better: prototypes during discovery
- APIs over databases: nah, use both.
- Clever use of computation over convenient assumptions : if the assumptions are well founded, calibrated from external references, etc., then no issue. For example, you don't need to perform raw research to understand how many joules heat water by a degree Celsius
- Dashboards over reports: depends on the use case. Dashboards generally limit use choice.
- Validation, scrutiny and repeatability over convention and ad verecundiam: Reasonable (though argument by authority is the more common name than ad verecundiam).
I lean towards your viewpoint as well. Their assumptions (axioms, postulates?) are highly controversial, while the actual principles seem quite sound to me.
The only issue I can see is with #5. I would argue for decision making, you absolutely need a single metric, otherwise the process collapses into bickering over which measure is more important at the time (often for political or interpersonal reasons). The point is a bit vague on what exactly is being evaluated (product quality, which means what?). For launching products or running A/B tests, aim for a single metric as your decision framework. If you must have more than one, then be explicit about the tradeoffs in a flowchart: e.g., "if X is > 0, we launch. If x <= 0, but y > 2%, we launch, otherwise no launch".
A financial example would be that some models are designed to be particularly tractable when those in the know could use a tool like autodiff to make a more sophisticated model more usable at the same scale
Who knows if the way you interpreted these 6 random words is even in line with what the author had in mind though? It so needs fleshing out that it borders on meaningless.
I was about to comment about how terrible this entire thing is, but then saw that literally every comment here is a negative take on the article.
It would be nice if someone said something at least semi-positive, so here's my attempt!
I think, although it misses the mark, something like this is really needed. I've worked in the data science field for about 5 years now, and see the same problems at every organisation to varying degrees:
- data scientists struggle to balance long term research efforts with immediate value propositions
- communicating the business value of data science tasks (such as forecasting) seems to often struggle between being rigorous, and understandable to non-data-scientists
- its hard to build data science teams with both good statistical foundations and good software engineering practices, and both of these are table stakes for delivery of anything meaningful.
I think as an industry, data science is getting there slowly. Part of the problem is that tools are touted as the answer much more that practice. So, although I don't agree with all the sentiments here, I think we need more thought, like this, on the principles behind success for data teams.
My take is that it is attempting a similar “intervention” as the Agile Manifesto did, but less well thought through or phrased.
The data science/ engineering domain appears to often produce outcomes like a good deal of 1990s software projects - projects that don’t produce good or expected results [0].
I posted a counterpoint recently [1] that said we should stop approaching data engineering like software engineering. Given what appears to be an almost endless number of data projects that produce little to no value to businesses, what should data professionals do to address the problem? This manifesto at least attempts to make a statement.
[0] I contributed to many of these at the beginning of my career!
In your article, your points about managing stateful data operations vs stateless tools is right on point. It's the eternal and ever-repeating argument that has to be made in every data shop I've ever been in when inevitably a manager or senior person suggests moving to scrum, because of this propagated lie that data engineers are just like software engineers. Glad to see it in writing.
> 2. Data science is about solving problems, not models or algorithms.
I don't like this. How about:
You may not have a problem. If you have a problem, you may not be able to detect it. If you detect it, it may be a false positive. Your intervention may make things worse. You might not detect that you made things worse. The whole operation may have been a statistical illusion made up of small sample size, insufficient blinding, and insufficient control.
I think it needs a more memorable intro. It's a manifesto! It should express more emotion if you are issuing proclamations like this otherwise you are just giving bullet points.
> Validation, scrutiny and repeatability over convention and ad verecundiam
Does repeatability mean you can't change your product based on the outcome of an A/B test? Because you probably won't get a chance to repeat those conditions.
People invent new words to avoid the baggage of old words all the time. Statistics has a lot of old associations and negative connotations, data science sounded fresh.
The title "data scientist" was massively overloaded over the past decade, so while it might mean "statistician" on one team, it might mean business analyst, visualization designer, data engineer, or a number of other things that are outside the core skillset of a statistician on another team.
The only one of these that I think is even remotely valid is:
> Validation, scrutiny and repeatability over convention and ad verecundiam
The rest I just flatout disagree with. This sort of feels like a Project Manager's Data Science Manifesto. Which...is fine if it's titled as such. Otherwise...no thanks.
I'm not trying to be a dick but I'm having a hard time relating this to data science or agile and then specific mentions of things like APIs over Databases. What's the goal of this?
Does it not make sense? A lot of people doing data "science" are spraying bullshit out through their teeth, unknowingly even (there are too many people in the field). Far too often the dominant approach seems to be jamming a model onto unfamiliar data.
Once you become the "dashboard guy" at your organization, it's game over - that is all you'll be doing - there is literally a never-ending demand of dashboards in any org that works with lots of data, and those dashboards will grow into full-blown apps, sooner or later.
Your tone implies that you think that is a bad thing. But if you have become the dashboard guy, you must have been good at it... which typically only happens if you enjoy it. And being "bogged down" in stable work that you enjoy is most people's goal.
It depends on what you want to do. If you're a regular analyst or data scientist, working on reports, dashboards, tooling, R&D, etc. - most in those roles (me included) have many balls in the air.
As far as the technical hierarchy goes, these organizations are usually somewhat like this:
1. Excel (everyone knows excel)
2. PowerBI / Tableau / etc. (few know these, because they are either a bit more complex than Excel, or a more simple than R/Python/Julia/etc. + your favorite viz and analysis packages. The people that only use Excel don't bother learning it, the guys only programming won't either)
3. Programming (some know these things)
If you love working with either 1) or 3), then becoming the Tableau/Power BI expert can be a drag.
The bulleted list at the top seems to be a random collection of irrelevant, almost unintelligible thoughts, stuffed cargo cult-style into the template of the agile manifesto. What does "APIs over databases" mean? It sounds like "oranges over apples" to me.
The numbered list of principles is better, but still not always that helpful. There's a strange spirit of perfectionism and inflexibility, e.g. in #3 and #5. Maybe some data scientists have time to automate everything they do, but I don't think it's a good general principle. #4, 6, and 7 are better. But overall there is a disorganized, random, unmotivated feel to both lists.
Perhaps something better could be developed, but for now, I think data scientists looking for a manifesto would be best served by going back to the original agile manifesto (https://agilemanifesto.org/) and reflecting on how it applies to our field. Which it doesn't, at least not universally. But it didn't apply universally to software engineering either. Just replace "software" with "data analysis" or "predictive analytics" or what have you and it all carries over pretty well.