The code written by a couple of "data scientists" I was working with is the worst code I have ever seen. They don't care, they just want to have an experimental results. The problem starts when their experimental "code" needs to be used on production or they are asked to describe how it works. Why cannot we just get good programmers and train them as data scientists?
I've worked with a few people who are software engineers -> data scientists. They were great at bringing good coding / database practices into the team. That said, their lack of formal statistics training was definitely a problem from time to time, and they seemed to show as much disregard for it as data scientists have for engineering practices (at least the ones talked about on this thread). This can be equally damaging.
Dealing with code (including tests and versioning) is on the same level as knowing the basic math notation. Should be embarrassing to not to apply the practices.
The only problem is that these good coding practices aren't that exact, and tend to go on and on all the way to infinity.
I agree, many (most?) data scientists spend most of their time coding, it's ridiculous how little time they're willing to hone this skill.
Data science practices are similarly inexact - a lot of good decision making comes from experience, knowing when to apply each tool to a specific problem, when to just throw in a hack etc.
Why cannot we just get good programmers and train them as data scientists?
For the same reason we can't just get good programmers and train then in biology or chemistry or structural engineering. Sure they exist, as do data scientists that are really good programmers, it's just that they're more rare and in very high demand.
Often much easier to find a domain expert and a programmer and have the programmer rework the code done by the domain expert. In fact that used to be my job for a while (working with physicists), and it was actually quite fun.
> Often much easier to find a domain expert and a programmer and have the programmer rework the code done by the domain expert. In fact that used to be my job for a while (working with physicists), and it was actually quite fun.
It's basically what people are now calling 'Research Software Engineers'
sounds like this is a process issue! EG, why data scientists and software engineers are both needed, and how they can work together to produce quality code which will produce quality data analysis
I work in data science (came in via a maths background) but I agree 1000%. My software engineering is pretty self taught, but I make an effort to follow best coding practices at all times.
That said, I think it goes both ways: people from the sciences tend to be cavalier coders, and people from software background tend to be cavalier about the underlying mathematics.
Seems to me that the solution needs to be a stronger culture of both increased scientific and software engineering rigor.
So there is hope for you :) You care, you will learn it. There is nothing wrong in not knowing something.
My main point is that too many people in data science don't care at all. They don't care about the repeatable results, about the code quality, even about units (they can even use `mb`, `Mb` and `MB` for megabytes in the same document).
I hope companies will learn that it's really important that if you have code it should be a good quality code, not a randomly gathered set of lines.