Hacker News new | past | comments | ask | show | jobs | submit login

would you accept “very badly”?

a colleague of mine attempted to share a dropbox link of a git repo and working directory he had helpfully zipped up. including 4 datasets.

So in order to get the 50 lines of code i was meant to merge in, he thought it was reasonable to have me download a headless 4 GB.

I told him no.




it's good to hear that people are reaching for these sorts of things which are obvious to most devs (I mean this is something I did once early in my career - with a container, and was told "never do this again"), but maybe not so obvious for ML devs.


It is due to one central point: the product is not simply software that processes data, the product is a product of data.

Imagine if you write a mobile application that stops working if the mood of the user changes. How much of a headache would you have developing, deploying, and maintaining that kind of apps?

Concrete example: You work on a churn problem. You're good and you have support, so you get the data fast. You produce a great model. That model is perishable. The market changes and the model you trained with the data your client gave you becomes stale and loses its "predictive power". In the simplest scenario, you must get fresh data, and you do it all over again with training, deployment, etc.

One other difference: for normal software development, your stack is pretty much set and you spend most of time using that stack to develop, test, push. In ML, a lot of the effort is in exploration. You want to try a new paper, a new algorithm but that algorithm is only implemented in one library and not the other, and that library conflicts with another one. You want to try as many combinations as possible. This doesn't really happen in standard software development. Components change relatively slowly.

There's also the data problem. Unless one does Kaggle competitions, you don't get to have JSON or CSV in projects. In most cases, you get whatever the client has [emails, powerpoints, files, archives, audio, video, esoteric third party systems you have to interface with without vendor support]. There's no "API" to tap into, and there isn't only one source of data you can build an interface for and call it a day. Hence a lot of custom code to process that.

There are many problems like these. We spend time with applicants who do competitions and imagine that the job is building models to tell them we're not there yet.


Hey it's slightly off topic, but if you work with Datasets, a struggle I often have is sharing them, so I built a service for this that you can find here: https://Joule.Host


I recently had to work on a git project that was created by taking a zip file of an existing git repo from another team, a junior engineer stomping on and corrupting the existing git project due to lack of familiarity with how git works, then committing said codebase to a new bit bucket managed git repo sans the 7+ years of commit logs from the other team. Fun times.


so, did you use a haddock or a skate when you slapped them in the face?


It didn’t immediately occur to me to try and preserve the history tbh. As I would find out the code base had so many other problems I doubt the history was worth anything at all. New job and all I just didn’t realize I was thrown into a boat that was taking on water until it was too late. Managing to get out intact but wow! What a hilarious project.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: