Hacker News new | past | comments | ask | show | jobs | submit login
Surviving Data Science at the Speed of Hype (john-foreman.com)
224 points by mistermcgruff on Jan 30, 2015 | hide | past | favorite | 43 comments



I'm a data scientist that works with companies on their analytics problems every day. This article is spot on.

By far the biggest factor influencing the success of an analytics project is that the company has a human who has the time and inclination to think and reason about the business. They figure out what questions are important to ask and then go look at the data to see what they find. Collecting the data is the easy part. There is no analytics product that asks & answers your most important business questions for you.

I enjoyed the jab at predictive modeling; it's almost comical how many companies dream about predictive when they haven't yet got basic tracking in place for what's _already_ happening in their business.

Love the post, thanks for sharing.


Exactly - the human with domain knowledge is vital. I get scared when I see people trump up black boxes. Black boxes don't help with "Which questions should we be asking?" and "What are the missing variables?"


Domain knowledge is also really useful for spotting bugs. I recently worked on a project where I had very little domain knowledge. So anyway I wrote my code, ran my tests, crunched the data, double checked that all the results seemed reasonable, produced the pretty pictures and everything looked spot on. However once I started showing the results to a domain expert it took him 30 seconds to point to one of the outputs and go "that's impossible, you have a bug in your code". Sure enough I did. As a generalist the results looked fine to me (right size, seemingly reasonable relationship to surrounding values etc.), but to a domain expert the error stuck out like sore thumb.


True. The ability to sanity check is very important.


Not doing data analytics but selling software that has forecasting with a model that we build and calibrate. We have fairly good performance, recalibrating the same model tales a few seconds but building the model or changing it is never quick.

The effect of these marketing campaigns on would be clients is terrible. They start going after crazy crackpot solutions to gain revenue while they haven't addressed the simplest easy to reach low risk revenue gains. In a a lot of cases integrating complex side effect data costs a lot and provides only marginal revenue gains.


Good article. The author is completely correct that people often underestimate the fragility of predictive models, and that summary analysis (I group this into a more general concept called "insights") are simpler and more robust. I think the article is a little harsh towards predictive models though.

The primary difference between a model and an insight is that insights require a human to process - anything more automatic is a model. Insights are easy to implement and are great for finding patterns and anomalies (the human mind is basically designed to pick these out). But the human element makes insights less scalable with significantly higher latency. For some problems these are unacceptable tradeoffs, and this has little to do with how stable a company's environment is. It's purely a product/strategy question, and about understanding all the tradeoffs.


Great to know model/insight difference. You have drawn a clearer line for me, which I always struggled to think it through, even I was quite aware that they are quite different.


I once worked at a major big box retailer where somebody came up with a visualization that purported to show, for a given product category, purchases made in other categories. One surprising purchase correlation was customers bought TV stands after buying DVD players. So, this nugget was trumpeted at countless meetings about the value of big data analytics. Multiple marketing campaigns were designed around this discovery.

Of course, that made no sense, so I checked a little deeper. You know what else people also buy when they buy DVD players? TV's. The DVD/furniture relationship was an artifact of the high degree of correlation between TV's and DVD players, which the visualization tool failed to account for.

I brought this up immediately, but received tepid response. Of course, months later, I was still hearing about DVD players and furniture. It had become part of the institutional lore, and no facts were going to replace that.


> One surprising purchase correlation was customers bought TV stands after buying DVD players. So, this nugget was trumpeted at countless meetings about the value of big data analytics. Multiple marketing campaigns were designed around this discovery. > Of course, that made no sense, so I checked a little deeper.

And there was me thinking it made sense, because I'd done the same thing. A TV can stand on many surfaces, but once you've got a DVD player (some thin, wide rectangle) it makes a lot more sense to get a TV cabinet to put the DVD player in.

Perhaps not so much sense now, but 8-9 years ago I went through this logic.


This is absolutely hilarious. One of the things I talk about in my presentations is that the hardest part of "data analytics" and especially "advanced visualization" is that it's hard to know when you're sucking. It's actually pretty easy to come up with some basic interesting things, but then what do you benchmark against? If you don't do the hard work to evaluate significance, you can start convincing yourself that you're more insightful than you actually are.

What the business people don't appreciate is that the ML models don't know they're looking at DVD players and TV stands. They just know that vector elements 27 and 291 have the strongest correlation. It takes a human in the loop to say, "item 291 is technically just TV Stands, but we've done dimensional reduction to pool TVs, TV Stands, and Projectors all into cluster 14, which then correlates to cluster 12, consisting of DVD players and Xboxes".


Whenever you say "that made no sense", I think that you are using too much bias and not giving enough credit to what the data is telling you.

If you look at the most "controversial" data science paper from 2013 where a study correlated intelligence to Liking the Facebook pages "Curly Fries" and "Thunderstorms" (here is a summary: http://www.wired.com/2013/03/facebook-like-research/), there were a lot of proponents saying that there was no causation, and the correlation was not founded, etc.

Of course, you would say the study "makes no sense". Intelligence can't be predicted by Facebook Likes. There is no correlation there, etc. But why not? If you read the paper (http://www.pnas.org/content/110/15/5802.full.pdf) their logic is sound. Is the marketing campaigns that the company bought based on the TV Stand<>DVD Player connection any different than other marketing campaigns? Facebook does all of their ad display based on similar data analysis as above, and it seems to be working for them.

Note: There is the not-so-hidden machine learning feedback loop now (explained better here: http://www.john-foreman.com/blog/the-perilous-world-of-machi...), where people Like the 'Curly Fries' and 'Thunderstorms' pages because of the research.


Whenever you say "that made no sense", I think that you are using too much bias and not giving enough credit to what the data is telling you.

What? If a data scientist sees something seems illogical, there is no reason not to investigate it and see if he/she can find a more logical explanation. Sure, if the effect seems real but unexplained, you can accept and use it but advocating a kind of big data mysticism, "don't investigate, accept" seems to be buying into the senseless hype. And if you read the post, you'll notice the parent actually discovered the association was just an artifact of an easily explained association.

And, no, there's no much reason for companies to advertise just a TV stand and DVD player. Common sense tells one what the data actually data, that those two items, by themselves aren't and weren't what many people were just dreaming about.


How are association rules "big data analytics"?

The article is very refreshing and I bookmarked the site. What I am more frustrated with is that a lot of people use this stupid term "big daata" for things which do not fit the description. If it's structured, it's not big data. If it comes at 2MB/s it's not big data. If it fucking fits in your RAM, it most certainly is not big data.


What on earth are you talking about ?

(a) Association rules are big data when you are doing them on large data sets with many variables. I work at a company that sells tens of thousands of different products and tens of millions of customers. Definitely takes us a while to compute those rules.

(b) The majority of big data is structured. For most big data projects it is typically stored in old school Oracle/Teradata/etc data warehouses and shipped into a Hadoop cluster. It may not be consolidated but it is definitely structured.

(c) The total RAM of our Hadoop cluster is 4TB and ours is small. I would consider that to be big data in the sense that it overwhelms any applications that directly try to access the raw data.


You can stick 6TB in a single 4U Proliant from HP: http://www8.hp.com/us/en/products/servers/proliant-servers.h...

If you need a few PBs of spindle storage, hook that server up to a DDN or Panasas rack.


Very good post. Refreshing.

I think that the hype and buzzwords around Big Data and data science cause more than just bad business decisions. I believe they are also damaging the industry and creating a larger sense of disillusionment (I'm mostly thinking of "deep learning"). Not sure what this means for data science in the long term though, just thinking out loud.

I'll also add that I frequently see sledge hammers being used to hang a picture frame. By that I mean using huge clusters to run algos that would actually run in Tableau, Excel etc.


I had a conversation with a 'big data consultant' some time back. He mentioned one of his clients needed to set up a Hadoop cluster and wanted me to work with him. I said, 'why do they need a cluster, they probably don't have that much data'. His response was 'if a client wants to jump off a building, you don't say don't do it, you ask them what floor'.


Firstly, someone needs to explain to me why smart people get worked up over vendor marketing. Since the beginning of time it has always been about exaggerated claims, bold, specific numbers e.g. 80% better and always targets those who make purchasing decisions. Do people really expect them to say, "Hey our product is great but you know you probably don't need it. But maybe buy it anyway ?".

Secondly, the author seems to have conflated two different parts of the data science picture. Yes great analysts who do amazing work is important. But it relies on (a) having data available and (b) in the right format. For those of us doing significant volume ingestions it is not trivial to do this. Hadoop is painfully slow and overall data science end to end tooling is slow, fragmented and incomplete. Some of us do need vendors to be bold and coming up with new technologies/approaches.

And the point about IBM is just stupid. Did you ever think that maybe Watson DID help them slow their sales losses ? Weird that a data scientist would make predictions based on inadequate data.


You are 100% correct that data availability is always the first problem to solve. However, I think this is addressed indirectly in his thesis that advanced analytics are brittle in a rapidly changing business. Any change that breaks your data by definition breaks your models.


>Secondly, the author seems to have conflated two different parts of the data science picture. Yes great analysts who do amazing work is important. But it relies on (a) having data available and (b) in the right format. For those of us doing significant volume ingestions it is not trivial to do this. Hadoop is painfully slow and overall data science end to end tooling is slow, fragmented and incomplete. Some of us do need vendors to be bold and coming up with new technologies/approaches.

I think you are doing Hadoop wrong, or confusing current technical reality with "Hadoop". Hadoop is very cheap, and it allows all the datas to be in one place This is huge for large scale data science, because in the past we had to pull data across networks fiddle, sample and chuck. The business case for single enterprise datawarehouses was difficult to make (because of the cost) and maintaining them when a CIO with vision did make the case was impossible because it took about 10 minutes for some genius to start running a tactical operational system on it, which was followed (in about 10 minutes more) by a howling call of rage from an MD about why his operational system was locked up due to someone doing stupid queries, which was followed by a lock down on queries in the warehouse.

If your hadoop cluster is slow then 1) move to CHD5 and use spark, use Impala, upgrade to 40Gbe throughout and make sure that you have balance in your architecture, for god's sake do not be telling people Hadoop is slow if you are using AWS. 2) brew your own cluster with GPU's and the various crazy infrastructures supporting said architecture (good luck) 3) go talk to an FPGA vendor or a super computer vendor and upgun (but you must be rich) Exalitics or Yark might work for you.

>And the point about IBM is just stupid. Did you ever think that maybe Watson DID help them slow their sales losses ? Weird that a data scientist would make predictions based on inadequate data.

Every IBM rep I have met for the last 3 years has told me that Watson will deal with churn and provide better offer management. I have repeatedly tried to get POC's and always always failed. Then we saw Watson tools on Bluecloud and all our suspicions of what Watson is and was are confirmed. Cudos to the Watson team, they spotted that Jeopardy questions can be rewritten as search queries, and spotted that search responses can be rewritten as jeopardy answers.

BTW. did anyone get far with Deepdive?


> If your hadoop cluster is slow...

You're right, there is a lot of misinformation and hope about Hadoop out there, and I think there is a lot of value in Hadoop as a cheap data integration archive. But I think the parent poster's point still stands. A Hadoop-based infrastructure currently has a lot of impedance mismatch for full end-to-end advanced analytics with a bunch of stats, linear algebra, or graph stuff from native code which are not Java-based.

I would love to see a TCO analysis on Hadoop+analytics versus buying a more traditional "supercomputer" stack with infiniband or one of the nifty Cray/SGI NUMA systems. Current data warehouse and BI folks are fixated on cost per PB of storage, and Hadoop is very cheap based on that single metric. I suspect that if enough human factors and accuracy/agility of modeling results are considered, the latter may be quite cost effective. It's just that the "big iron" vendors are still in the middle of retooling their marketing for the BI/DW/ETL crowd. When they finally figure it out, it's going to be a bloodbath.

For instance, SGI UVs can give me 24TB-64TB of RAM in a single "system". I still have to make sure I do multithreading/multiprocessing well, but the interconnects are lower latency than 40GBe. https://www.sgi.com/products/servers/uv/

HP ProLiants now can fit 48-60 cores and 6TB in a single 4U system: http://www8.hp.com/us/en/products/servers/proliant-servers.h...

Buying a few of these scale-up systems is a LOT cheaper than hundreds of nodes of Hadoop sitting around maxing out I/O while their expensive Xeons have 10% CPU load. Especially given than you can hire anyone out of science/engineering grad school and they can program these scale-up systems, whereas writing a bunch of Java MR jobs for Hadoop is quite foreign to them.


I think that the disruptions are : - Twill with everything (inc unikernels) under Yarn (or Mesos) - The Machine (if it's real) - Datacentre scale integration (so things like 500 different processors in each u which are powered up by the fabric manager to efficiently meet the workload at hand)

I think any vendor who wants to compete with the Open-Source/commodity world will need to do as well as / better than the above to get anywhere!

Programming MR is all done - I wrote MR in Java in 2008->12; never will I again as it's rdd's, transformations and actions now, and it's dead easy (MR is too but the API wasn't)!


This is perhaps the first halfway sensible post on "big data" or "analytics" that I've seen hit the front page of HN in a long time.


Timely. I did a "big data" presentation yesterday and hoped to convey how important it was to read original source materials to form opinions and avoid the hype.

Since slide decks get busy I moved my bibliography of links to a gist. So, while it didn't factor into my presentation I've now added this blog post. :-)

https://gist.github.com/JayCuthrell/8bcd9597d37a8602c639


I just love the way this guy writes. His book, Data Smart, is hands down the most approachable intro to data science you could ever possibly read if you don't have the sufficient math background to dive into full on textbooks. And it's hilarious too.


I have mixed feelings about that book. I enjoyed his writing style and humor, but the amount of beating on Excel he has to do to manipulate all that data hurts my head. I kept thinking of how much easier it would be to do with code.

Maybe it's just not a good book for developers? shrug I would love to have a copy of that book that doesn't use Excel.


I share your same yearning for a code equivalent of the book. However, I think writing the book using only Excel was a smart move on his part, simply because:

1) Excel is "visual" in the sense that you can watch the data change as you tweak things. There is no command line or program to execute, it's all happening live

2) For programmers, there's no "well I'm a python guy and this book is written in Java so it's not for me." None of us as coders really depend on Excel for writing code (basically) so it's kind of a way to take the technology decisions out of the equation. It's just the techniques.

All that being said, it's not trivial to port the logic of a spreadsheet over to code, and I think if anything that would make a great followup book.


I agree with both your points. For #2, the only decent option may to make the book more focused on R, instead of just chapter 10.


Write it!


It looks like an interesting book. Anyone have experience with the ebook version? Does it hold up with the illustrations, or should I hold out for a hardcopy?


I have the ebook version, it's fine if you're ok with ebooks for this sort of stuff.


Can you recommend some other books that are good introduction to analytics/data science? I often get asked for recommendations, but can't seem to find something suitable for beginners.

Dickrolls to be avoided if possible, but secretly kinda wishing...


I unfortunately don't know too many others. I have this book as well: http://www.amazon.com/Machine-Learning-Science-Algorithms-Se... and because I'm so far removed from any academic math study now in my life, by about page 3 I was lost.

My plan is to dive into Linear Algebra and Statistics book and courses first before I head into ML again.


Doing Data Science is a pretty good intro. http://www.amazon.com/Doing-Data-Science-Straight-Frontline/...


More on the presentation of data, but I like Stephen Few's work...


First of all, John Foreman is great. Read his book "Data Smart" and http://analyticsmadeskeezy.com/blog/

(disclaimer: I am in no way tied to John Foreman. Also, I work at a company that provides a data processing/collaboration SaaS...for big data! http://www.treasuredata.com)

A quote from the OP:

>If your business is currently too chaotic to support a complex model, don't build one. Focus on providing solid, simple analysis until an opportunity arises that is revenue-important enough and stable enough to merit the type of investment a full-fledged data science modeling effort requires.

This is consistent with what we see in our customers. The use cases we see most with processing big data boils down to generating reports.

Generating reports may sound really prosaic, but as I learned from our customers, most organizations are very, very far from providing access to their data in a cogent, accessible manner. Just to generate reports/summaries/basic descriptive statistics, incredibly complex enterprise architectures have been proposed, built by a cadre of enterprise architects and deployed with obscenely high maintenance subscription fees billed by various vendors. That's the reality at many companies.

As bad and confusing the buzzword "big data" is, one good byproduct is that it has forced slow-moving enterprises to rethink their data collection/storage/management/reporting systems.

Finally, I am starting to see folks do meaningful predictive modelling on top of large-ish data (in the order of terabytes). Some of them are our customers at Treasure Data, some aren't, but they are definitely not "build[ing] a clustering algorithm that leverages storm and the Twitter API" but actually doing the hard work of thinking through how (or if) the data they collect is meaningful and useful.

And that's a good thing.


An important distinction is that the author's experience is mostly with the businessy side of data science, and his jab is at people who use buzzword tools that add complexity rather than simple solutions.

In defense of the hype, many tools like storm are worth their hype many times over when used for the right application.

The author makes this distinction, but it can easily be lost in the post.


There's lots written in the credit scoring space that I think other industries could look at - especially when it comes to calibration of models. It doesn't matter if the prediction is weak just as long as it is consistent over time periods. Banks rely on this consistency to ensure they are provisioning properly for losses.


I'm a working scientist, rather than someone in the corporate world, but this rings true for me as well. During a recent outbreak, we've had very fast turnaround demands, and while we've done great work in that time, I think some of our best ideas have come from being able to slow the hell down and think.


I view IBM or especially HP jumping on a bandwagon as a strong negative signal for that technology.


all this will change with the internet of things. once every "thing" is networked, then these optimization platforms won't need to wait for some human to input info about altered environments. the platform will "sense" it.


Amen


>And that is not primarily a tool problem.

>A lot of vendors want to cast the problem as a technological one. That if only you had the right tools then your analytics could stay ahead of the changing business in time for your data to inform the change rather than lag behind it.

many people like the author just don't get it and it is fine. The same way like people didn't get the search before Google.

>But how do I feel good about my graduate degree if all I'm doing is pulling a median?

the graduate degree is what allows to receive $Nx10e5/year (for a respectable value of N) for that pulling of a median

>If your goal is to positively impact the business, not to build a clustering algorithm that leverages storm and the Twitter API, you'll be OK.

on the other hand if your goal is power(OK, OK) instead of just OK then the clustering algorithm/storm/twitter is the way to go.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: