Hacker News new | past | comments | ask | show | jobs | submit login
Foundations of Data Science [pdf] (cornell.edu)
419 points by Anon84 on Oct 7, 2016 | hide | past | favorite | 122 comments



There's a great quote by some bodybuilder. "Everybody wants to get big, but no one wants to lift heavy weights."

Paraphrasing this to data science: "Everybody wants to have software provide them insights from data, but no one wants to learn any math."

The top two comments here illustrate this perfectly. Anyone who is serious about learning data science will read this book and will not shy away from learning math. You can also learn about data pipelines, but that's not a substitute for what's in this book.

There are also a variety of other algorithmically focused machine learning books. They are also not a substitute for this book.


Also consider that those who shy away from learning this material aren't exactly wrong. They're making an optimal choice given their skillset. I've been a DS for the past 4 years, used to be a risk quant at the bank before that. So the territory is familiar, especially as a math major. But my colleagues who are enamoured by this stuff often underestimate how dry and boring it can get to a non-math person. I took 2 of my colleagues under my wings and after getting their feet wet with a few DS projects, asked them what they'd like to do. Their response was basically - we want to get back to engineering. The thing with software engineering is that you can iteratively make progress. You can come in knowing zero Ruby and zero JS on day one, and cut and paste from stackoverflow and get by, and slowly, in the next month or so, figure out enough to do a good job. Unfortunately for DS that's not a hot option. The fundamentals are best learnt as theorems sitting in a classroom, not as Spark APIs. So the ability to invoke a classifier api or do an svd with Spark isn't actually giving you any intuition as to what's going on under the hood. Soon people realize this, and say hey this stuff is fine but it isn't really my thing. Which is ok.


I feel that what you describe is largely a pedagogical fault of the community though: most of the people who understand the topic gained their education in a particular way, and just haven't thought very hard about how they might make that process better using new technology. This isn't to call anyone out, of course, because that exact process happens to literally every field, and even within mature fields as new paradigms or technologies are developed.

I think there's a lot of room for mathematics education to be revamped to be more iterative and exploratory, which I think would make it appeal to a wider audience. I think that the mathematics community is actually beginning to pick up on this, with the publishing of python notebooks which contain increasingly sophisticated models as you read in to them and provide controls for readers to manipulate the parameter space of the models to explore the consequences of claims, theorems, etc. I think this kind of exploration of the numbers is a better way to learn a lot of ideas than the way I was taught in class, particularly things like statistics, where you can see distributions change as you manipulate parameters.

I figure that in a few years, as people my age who were mostly taught using older methods, but got a little exposure to explore-via-computer learning in university mature, we'll see an increasingly "internet-y" version of teaching methods appear, where there's a mix of video lectures, reading materials, discussion posts/blogs, and interactive models.

Computer programming and hobby crafts are among some of the first fields to make that transition, traditional academics are somewhere in the middle of the pack, and specialized topics like data science are keeping pace with the wider academic community.

tl;dr: I think your point is really just a temporary pedagogical fluke, as we take time to transition our teaching methods.


I complete disagree. Dr.Neal Koblitz (inventor of elliptical curve cryptography)spent a good bit of his life fighting this obsession with bringing the computer into math so things can be visualized & manipulated. You should check out what he has to say. https://www.math.washington.edu/~koblitz/mi.html

btw, I used to think like you before I got into math. Its a very valid albeit flawed viewpoint - to think that hey if only there was a javascript doodad somewhere so I could move a slider and manipulate the parameter space, I will instantly "feel" CLT in my blood instead of working through CLT as a dry formal math proof. The reality is, this sort of dumbing things down works maybe in 2D and best case 3D, but beyond that, its utility is rapidly diminished. Also, the results you get in the lower dimensions don't map into higher space. As yummyfajitas points out elsewhere, in a reasonably high dim space, all pairs of random points will be fairly equidistant, which is simply not the case for say 2D. So even the little bit of useful intuition you learn in the lower dims becomes quite useless as you scale up.

Proofs are best learnt by doing proofs. Math is not cinema.


You responded to a strawman of what I said, made no points that even remotely address or refute what I said, and further, decided that you were going to personally attack me. In fact, your link generally agreed with my point! (In summary: traditional reading and lecture methods have benefits; videos and other newer media can help (some) people, and give a new perspective; computers have a few useful behaviors; engaging imagination and exploration are better than rote learning.)

That's really kind of pathetic, and a disservice to your view.

I'd be happy to speak with you, though, if you want to actually read my comment and address a (charitable) interpretation of what I said, rather than your strawman one.


Traditional math education: a professor explains proofs of selected theorems on a blackboard and students are then required to memorize the proofs and reconstruct them during the examination. They are also required to solve a lot of problems that involve proving some consequences of the main theorems or doing some rote calculations using nothing but pencil and paper.

The advantage of this approach is that it works. You can acquire solid intuition this way. But it is arduous as hell and also pretty one-sided - problems are limited to what an average undergraduate can do by hand.

Now, when I think about using computers in math education I don't think of some "javascript doodad" with moving sliders. I imagine instead implementing an algorithm or doing a numerical experiment, learning about possible pitfalls in the implementation, running the algorithm on some real data, seeing it fail etc. I think this experience is pretty valuable too and can nicely complement traditional approach.


That's why that approach has been offered in courses traditionally called "Numerical analysis" that have been available to math students for... how long?


Taking the sarcasm out, your point is that such thing exists. Agreed. But usually it is just a single course, it does not cover everything. And why it is a separate course at all? Why should you study some field with pencil and paper and then (possibly, after a couple of years) study a subset of that field "numerically"?

As an example of what I am talking about, the book "Structure and Interpretation of Classical Mechanics" (a cousin to SICP) tries to teach classical mechanics with the help of computer programs. But this approach remains unconventional.


I should probably not have been sarcastic, especially when I agree completely about the importance of looking at math with a more constructive/computational point of view.

Now, I'm not a mathematician or a scientist of any kind, merely a dilettante textbook reader, but I'm not sure that approach is sufficient, or even most efficient to teach all subjects.

I think it arises naturally when you approach subjects where it's directly relevant: engineering, complexity theory, type theory, etc. and people who will need those insights will get them in due time.

But yeah, broadening the audience could help people get a better appreciation of the connections between pure and applied math.


This is one of the reasons that I really dislike the term "data scientist." Essentially the job is Analyst, which is something we've had for many years. It's not a bad thing, but there's no reason to embellish it into something it's not.

The primary reason I dislike this term, though, is because all scientists are data scientists. Data is how the whole thing works.


The problem is that industry has coopted the term data analyst to mean something more like data reporter or data descriptor. There often isn't a difference between business intelligence and data analyst. Many data analysts are using tools like Excel, and don't have good programming skills. A data scientist will generally be involved with prediction, and use more sophisticated tools than Excel. This isn't to knock on data analysts (or Excel), they do good and useful work. The proof is in the numbers - a data scientist makes a significantly higher salary than a data analyst. Similarly quantitative analyst got taken over by finance. The term statistician is actually pretty accurate (data science is basically applied statistics), but usually people don't call themselves statisticians unless they have a degree in statistics. With that said, the term data scientist is not at all well-defined, and the field is quite diverse.


Originally "data scientist" meant a statistician who could write distributed systems to implement their algorithm in production.

Now that the term is diluted, I propose we use adopt the banking industry term "quant developer" to refer to that.


I'm sorry, but I'm not sure that you're correct that a data scientist is a statistician who can write distributed systems. Problems have been solved for decades by people with quantitative training, from well before distributed systems (or computers) existed. Very roughly, a data scientist is a person with a combination of quantitative training and some programming skills, who is able to effectively analyze (and explain) different kinds of data, be they audio, text, census information, or whatever it might be. What we call data scientists today are the engineers, computer scientists, and mathematicians of yesterday.


How about "data monkey"? At least then when the corporations post job ads and bring up buzzwords in their meetings, we can silently laugh at them.


I have no experience in Data Science. How much need is there for a distributed system? How many data points would one need to necessitate it? What if we had a billion data points. Would it be sufficient to run on a fast system overnight to crunch data?


According to me (note zintinio's reply - he disagrees), the etymology of the term is the following.

Back in the day, if you had small data you didn't need a data scientist. You needed a statistician. He'd do some shit in SAS/Python/etc, reading one CSV and writing another. Then the developers could run that on a beefy server with cron and push the CSV output someplace else.

At some point during the 2000's this stopped working - things became complicated and tangled enough that you couldn't just munge CSVs like this. You needed folks who understood the math well enough to come up with algorithms, and who also understood the computer science well enough to scale out. These folks were termed "data scientists". Banks call them "quant developers".

Very often you don't need a distributed system, or anything that isn't trivially parallelizable. Most of the time when I make money it's based on a CSV < 100GB - often it fits in ram. Nowadays I don't think it's misleading to call yourself a "data scientist" if you can only handle such data sets.

I recently learned that Facebook calls their business analysts data scientists. I guess it's just sexier.


Science is more than just data. A theoretical physicist is a scientist who does not necessarily care about data collection. Some anthropologists focus on case studies, rather than data, especially if they're just doing descriptive study rather than predictive.


100% agree. When I made the switch from technical business analyst to head down the road of data science, I had the option to use a year of postbac study to study computer science or math. the CS dept at the school wasn't fantastic, it was mostly in Java, and DS is moving so fast I wasn't certain what percentage of what I studied would be of any use in 5 years time. So I studied math. All the math I learned is still good as new. Still "true. Still useful. Many DS tools have come out since I finished that year. Many are Python based.

One thing I am constantly frustrated by is the lack of mathematical rigor that pervade DS tools, and programming in general. Seems like everyone is happy to be given some programmatic tool and all the mistakes that where programmed into it. "You just need to deal with that." "You'll get used to the syntax." Yeah, but, I'd rather not, since it's often completely arbitrary and specific to a given tool. And, well, sometimes just doesn't make sense in the way something that has mathematical rigor built in might.

But, I often feel when I bring this issue up I'm shouting down an empty hallway.


Part of the problem with DS is that there are known engineering solutions without good theoretical explanation, that is, we just brute forced some basic engineering solutions and keep hacking at them to get more performance, but don't have a good theoretical model of why those things do what they do. (Or have a theoretical model too complex to compute useful insight, which is another kind of problem.)

I think that this kind of approach is going to get us pretty far -- see how far we got on building bridges with no more sophisticated insight in to gravity than "things fall down; some things are heavy" or material science than "some things hard and not bend" -- but I think that ultimately, our longer term goals like AGI will require that we build theoretical models capable of explaining our current engineering progress and predict a new, deeper valley of progress to move to.

Of course, that's a lot like saying making progress on physics paradigms requires explaining high temperature superconductors (or other currently open problems). It's actually pretty normal for a field to be a mix of engineers being ahead and theorists being ahead.

Mathematical rigor is really a sign that you're in established territory and explorers have moved on, and much of data science is still under heavy exploration.


There are tons, tons, of examples of "established territory" in CS that do not exhibit "mathematical rigor". When something works in CS people seem to say "good enough" and use it. In mathematics (or as you pointed out, in physics) it becomes an "open question" and people begin the hard work of explaining the phenomenon by applying rigor. In maths, it's a constant march to push the boundaries, where the boundary is defined as that which is interesting but not yet well defined. In CS it seems that the boundary is that which is not yet a solved problem, where "problem" is something that needs getting done. Once, in CS, we can get it done, people move on. It's not often that in the enterprise CS process (academic is a whole other beast) it is assumed that there should be an application of rigor or an attempt to well define "solutions" before moving on. There is simply the accumulation of technical debt. But, in my opinion, technical debt has amassed at the systemic level to such a point where it's starting to look like a ponzi scheme. Sure, it works, so long as we keep throwing good money after bad. But, stop investing, and one can see it for what it is. In maths on the other hand, step away from it all for a year or two, and when you go back everything is still just as valuable and beautiful.

edit: what i would love to see is a "category theory" for programming languages / paradigms such that moving between them is well defined. it boggles my mind that translating between two programming languages isn't trivial. if both are well defined, one should be able to translate one to the other precisely given one is willing to define the translations. there should be zero guess work or heuristics in the process.


It is baffling that machine translation on human language is approaching human accuracy but it is still far behind on programming language, given that it is supposed to be more closed/matching to metal/machine.


I find that the opposite of baffling: human language is information sparse and redundant; computer languages are information dense and, by choice, are made as unredundant as possible. It just seems like trying to hit a big fluffy cloud on one side and a bald tree on the other. Obviously the big fluffy thing is easier than the spindly, narrow one.

How does machine translation do at poetry, in the sense of capturing the figurative meaning and stylistic elements, not just the literal meaning of the tokens?


I love this quote. For accuracy purposes, it is from Ronnie Coleman, former Mr. Olympia and goes more like "Everybody wants to be a bodybuilder, but nobody wants to lift no heavy-ass weights...I do it though"


Great quote but sadly lifting weights were not enough, heavy doses of HGH were used too if you take a look at how Ronnie looks these days.


That's true of literally every bodybuilder he competes with.

There's natty bodybuilding for people who don't want to use PEDs.


I never said to not read the book or learn the maths. They are obviously important. I apologize if I sounded dismissive. But I've met people in my field that are great at the maths (which should be a given), but can't execute the analysis because they can't do simple "janitorial" tasks. Perhaps this is a problem more prevalent in academia. I think there should be equal attention paid to both aspects of the job.


I absolutely agree that a data scientist must be a good software engineer also. But there are lots of books on software engineering. There are lots of books on algorithms.

As far as I'm aware this book is one of a kind. This book covers ground that nothing else does, at least not in any comprehensive way.


> Anyone who is serious about learning data science will read this book and will not shy away from learning math.

The problem is not math. The problem is the way math is explained. Most books do not go into the intuition and are very dry. People generally shy away due to the dryness of the material, not the content.


One of the major points that this book is trying to convey is that your intuition is wrong. You need to burn it down, learn the formalism, and maybe develop new intuition based on that. Even then you need to recognize that intuition can be wrong.

An N-dimensional vector space is not like R^3. A monad is not like a burrito, and >>= is not like chipotle forgetting to put chicken in your burrito, opening it up, adding chicken and wrapping in a new tortilla.

Math is a new thing rather than the same old thing with new syntax. It needs to be understood on it's own terms. If you are serious you'll learn it.


I agree with your sentiment that math is important, but it is also true that math is not taught well. Just briefly glancing at chapter 2 the first part, the author presents the law of large numbers out of the blue, then just goes on to present proofs of it. There is no clear discussion of why this law is important, when you can use it and so on. If you contrast how wikipedia explains it:

> In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.

This book: > If one generates random points in d-dimensional space using a Gaussian to generate coordinates, the distance between all pairs of points will be essentially the same when d is large. The reason is that the square of the distance between two points y and z ...


I always get a eerie feeling for people who excessively stress on formalism. It almost goes to the point where I feel they wouldn't want to share their knowledge or intuition.


Clearly the guys who wrote the only comprehensive introduction to a topic don't want to share their knowledge.


The previous comment was not directed towards the authors of this book.


They are trying to help you get a correct intuition. If you have never approached things this way, there are almost certainly very large holes in your understanding. Not to say that everyone needs to be formal all the time, but you need to sometimes.


> If you have never approached things this way, there are almost certainly very large holes in your understanding.

When anyone tells me that X is the ONLY way to do it, almost exclusively I have found them wrong - beyond data science. Formalism is a means of communication, not the end. You can always communicate ideas without formalism.

FYI, I am a Ph.D Computer Science and a practicing Data Scientist for many years now.


You are essentially saying - the way I think of math is how it should be learned. Sorry but if you're serious, you will find a way to explain your mathematical intuition.

The problem is detailed in this wonderful essay by Paul Lockdhart

https://www.maa.org/external_archive/devlin/LockhartsLament....


We can explain mathematical intuition. We just can't explain it as quickly or in as interesting way as you seem to want.

Here's a simple concept: all polynomials with coefficients in a number system (field) have a solution (possibly in a larger field).

Do you know how hard it is to convey what exactly this means and what the consequences are? I like Lockhart's essay. It resonates with me. However, I don't see his ideas being useful in terms of learning advanced topics. At some point one has to get their hands dirty and slog through the material. Intuition will come with experience.


> Do you know how hard it is to convey what exactly this means and what the consequences are?

I never said it is easy.

> Intuition will come with experience.

And this experience that some people have already reached should be shared, not limited to a select coterie of people.


It is shared. It's just that everyone has to go through to mental work to truly understand. It takes me a semester to convince my students that 3x+5x=8x and that this is true because of the distributive property. And this is why ax+2x = (a+2)x. And the reason we can't simplify further is because our language does not have a word for (a+2) but it does have a word for (3+5). But I know that after a semester of teaching this most still don't actually understand the distributive property. It takes a lot of effort on the part of students for the concept to click.


> It is shared. It's just that everyone has to go through to mental work to truly understand.

Except, I can hardly see that. At least, I see you trying - most people do not make an attempt at all.

FYI, I am a Computer Scientist.


Most math textbooks focus too much on proofs, as does this book. Data science is a broad field, but for most use cases formal proofs are not really needed. I would prefer a text that focuses on examples and how/when to apply the theorems, rather than here is a theorem and here is a proof. Oh and here is another theorem and here is another proof. That isn't to say proofs are not important, but the focus of data science is usually on applied statistics.


I would prefer a text that focuses on the java standard library and shows how/when to apply a java.util.List<T>. That's far better than a Data Structures and Algorithms book that teaches you how you'd implement a linked list or other complex data structure.

90% of the time I use the java standard lib. Sometimes I need something almost the same, but a bit different. The difference between a middling developer and a good one is what happens in that 10% of the time.

The same applies to proofs. What happens when you need something that's almost like the theorem, but not exactly? Can you tweak the proof to apply to your case, or alternately recognize that it can't be fixed and you need to do something different?


I agree that knowing and understanding the fundamental math is important, but I think proof based texts are not a good way to learn math. A lot of people get turned off from math because of how it is taught. Personally I learn best by doing problems and seeing examples. Proofs are useful, but I want to know why the theorem is important before I learn how to prove it is true.


I don't think anybody is saying that proofs don't matter, or that they aren't important. But it looks like you're choosing to focus on uncommon edge cases. I think a lot of people would question that line of thinking. It's almost like the old saw about how you don't need a degree in mechanical engineering in order to effectively use a car. But if you need something "almost like a car, but also kinda like a bulldozer" then you're out of luck. OK, how often does that scenario come up? Enough to care about? For most people, the answer is a resounding "no".

Likewise, one can certainly apply a lot of mathematical techniques in a lot of situations, without proving new theorems. So there are edge cases... not the end of the world. Deal with it when it comes up.


Most math textbooks focus too much on proofs, as does this book.

My problem with (most) maths texts isn't the proofs in and of themselves, it's just that they make so many assumptions about what you already know, and don't always state those assumptions clearly. And then too many maths texts, to me, fail to include enough expository text to explain the math and provide more of an intuition around what's going on.

That said, I also agree with you that more examples in terms of applications would be nice. Learning really abstract stuff in isolation is OK, but it's always nice to have a few (or more than a few) examples of how to apply the technique to something concrete.


Most well-written textbooks list the assumed background knowledge in the preface/introduction.

Just skimming that part for the word "prerequisites" or "background" will usually find it.


> Most well-written textbooks list the assumed background knowledge in the preface/introduction.

Most actually-existing books simply write that the student should be "mathematically mature", by which they mean that the student should know all of multivariable and vector calculus, most of differential equations, much of optimization, and some amount of analysis. And also be able to do proofs, and also have good intuition.

In short, most "applied" math textbooks aim themselves at Master's degree students or beginning PhD students in math itself, to teach material which can actually be taught to anyone with a bachelor's in science or engineering (except possibly computer scientists, whose continuous maths degrade because we never use them).

Ironically, the first math textbook I've ever picked up that didn't assume far too much background was my real analysis textbook, because analysis teachers assume that they are teaching the gateway course to "real math" and have to educate little babbies who only just got done with calculus.


I don't see the problem?

If you want to learn something difficult that's above your level, and learning that something includes learning a prerequisite subject that's also above your level, you simply start at the node in the DAG that's at your level and work through the prerequisites if you're serious.


I sometimes jump in at the place above my level and read both books simultaneously. You often need only a few things from the prerequisite class.

But yes, there is no substitute for learning all the things you need.


Most well-written textbooks list the assumed background knowledge in the preface/introduction.

Exactly. And a surprisingly large number of books aren't well-written, exactly because they don't include that bit.


Over 90% of any medium is garbage which is why you have to take the time to make sure you're spending the time on the best material.

For most well-studied topics, finding the books that are good enough is not that difficult.


If you understand a mathematical theory, then you should be able to figure out on your own when that theory is applicable and how to apply it. So I can't help but read your comment as “I don't want to have to learn”.


You are being condescending and over simplifying the issues, try to not rush to judgment. I have spent a lot of work and time learning how to apply machine learning, so your comment was offensive and wrong. Applying mathematical theory is not just a matter of understanding the theory - in the real world there are a whole lot of reasons why you won't be able to apply the theory, even if you understand it perfectly. You will always have to make assumptions and approximations. I like learning, but I prefer learning for a purpose. Proofs don't really give you a practical understanding of the theory.


> You will always have to make assumptions and approximations.

The ability to determine when an approximation is good enough comes from a solid understanding of the underlying theory. For example, we understand that special relativity doesn't invalidate classical mechanics, because, in the limit, as speeds and energies become arbitrarily small, their predictions coincide. But, if all you have is a bunch of computational recipes, even the tiniest unexpected thing renders you unable to solve problems.

Anyway, I'm not saying that examples of applications are a bad thing. But proofs are indispensable. Even for “intuitive” people, proofs are necessary confirmation that their calculations will match their intuitions.


You bring up an interesting example in special relativity. I am a physicist, and there isn't really a "proof" in the mathematical sense of special relativity. Googling it provides this quora answer: https://www.quora.com/What-is-the-proof-for-Special-Relativi...

There is no proof of special relativity, we believe it because it makes experimentally verified predictions. Your example of limiting cases of special relativity is how I wish statistics texts were taught - it isn't based on the "proof" of special relativity. Of course I know when special relativity is applicable and when it isn't (when beta = v / c is close to one, special relativistic effects are important, and gamma = 1 / sqrt(1 - beta*beta) indicates how good the approximation is).

But I do agree that proofs are useful and needed, I think that complicated proofs could easily be placed in appendices and looked at after understanding why the theorem is relevant. Of course this is just my personal preference.


> There is no proof of special relativity, we believe it because it makes experimentally verified predictions.

Right, that's because special relativity is a scientific theory. A scientific theory has two components: a mathematical theory (in which you can perform a priori physically meaningless calculations) and a physical interpretation (which turns the results of such calculations into predictions).

However, statistics isn't a scientific theory. It's just math, and like all math, it's about itself and nothing else. It doesn't make sense to “experimentally confirm a mathematical theory”, because, without a physical interpretation, math doesn't make any predictions about the real world.

> I think that complicated proofs could easily be placed in appendices and looked at after understanding why the theorem is relevant.

Then you're looking for books on applications. That's fine. But a book whose subject matter is a mathematical theory (it even has “foundations” in the title!) can't relegate proofs to appendices.


Are there any MOOCs that cover this math? I suspect there's only so far away from introductory material Coursera and Edx can get and have enough students to justify the costs.


I have a PhD in Biomedical Engineering, and am working as a DS for the past several years, and I can tell you that there is a hard ceiling for many of us out there, because we don't have enough of a math background (I'm looking at you "Real Analysis") to ever really understand the "deep theory" that someone with a PhD in Statistics or Math (or hell even a BS in Math) can have.

However, having said that, I still consider myself a very productive DS, and I get stuff done. I'm not going to ever be on a research team at Google, but those aren't the only kinds of DS jobs out there.

And I'm not just plotting stuff.


to your point about analysis, the folks with a CS education here (myself included) should ponder this quote from the intro: "...we have written this book to cover the theory likely to be useful in the next 40 years, just as an understanding of automata theory, algorithms and related topics gave students an advantage in the last 40 years. One of the major changes is the switch from discrete mathematics to more of an emphasis on probability, statistics, and numerical methods." some of the key stuff in statistical learning isn't even real analysis, it's functional analysis (usually two courses later).


You are right, but an important part of providing insights from data is actually providing the insights, meaning the interface problem between the data science and users is crucial.

For many industries, this problem is not well-solved at all. And it's not trivial. It's one thing to have great models, it's another to have a good way of making people smarter with it.


I really appreciate your viewpoint, a friend and I had a discussion about this exact problem last night. Nobody wants to dirty themselves in the details, but that's precisely where you need to go. Also, you have a very interesting blog, you've now gained another reader.


yummyfajitas:

THANK YOU for posting this comment. I absolutely LOVE these two quotes, and can't tell you how many times I could have used them over the years.

These two quotes are now a permanent part of my arsenal for responding to busy individuals who say they "want to understand" a new technology or idea, but what they actually want is for the new technology or idea to be explained to them using only simple concepts with which they are already familiar, so they don't have to learn anything new.


The high theory stuff is great, but a significant portion of the job is being a data janitor. Being experienced and fast at manipulating data structures, recognizing patterns in text datasets, understanding common formats used in the field and just having domain knowledge in what you are analyzing should be more emphasized in my opinion.


I sort-of agree but the "data janitor" knowledge can be learned "on the job" ad-hoc and as-needed.

Mastering basic theory, on the other hand, needs a coherent and structured study-plan which requires extended focus and single-minded emphasis (at least for most folks).


The on the job learning of "data janitoring" may be a contributing factor to why so much janitor action is required!


>I sort-of agree but the "data janitor" knowledge can be learned "on the job" ad-hoc and as-needed.

Logically yes. But it's amusing how many scienty types (PhDs et al) who know all the fundamentals in theory can't do such practical tasks if their life was depended on it.

It's like they thought the theorems and abstract objects they've learned would never be encountered in the wild.


Yet PhDs are disproportionately more likely to get the interesting jobs that require stats: quabts, experimental.positions at Google, etc.etc.


Listening to Software Engineering Daily, I heard that the there is a trend toward developing 'data engineering' as a discipline that handles sanitizing the data pipeline so that data scientists can spend more time working at business abstraction layers.

There's probably a scale at which that works better, two pizza data science teams for example.


Yeah, DE seems to have emerged as the title for those who manage large databases -- who regularly import raw data, cleanse, normalize, update, construct analysis pipelines, scale up and parallelize processes, and validate results. It lies somewhere between a DBA, sysadmin, and HPC engineer, with an awareness of basic stats -- where mastery of Hadoop's many components might converge.

It seems like a natural evolution of DB admin for very large scale noSQL and DBs, esp. those with unstructured data and often non-commodity architecture. I've seen numerous companies looking for such folks and I suspect demand will rise.

IMO, it's not a job you'd want to outsource. The role is too mission critical, the skills not predictable enough to be a commodity, and the penalty for screwing up is too great.


My issue with these sanitizing pipelines is that they only really work for fields where the data generation is relatively stable. Meaning, whatever instrument/method used in generating the data doesn't change dramatically every year. It's extremely challenging to design a all-purpose sanitizing pipeline.

So I can imagine a pipeline developed for use with well established social media APIs or standard scientific experiments that have been in used for decades. But it is hard to imagine a pipeline that can handle amorphous emerging high throughput instruments/methods.


I have to disagree on this. Data scientist is a very fluid and fast changing term. I know companies who now call what would traditionally have been "machine learning researcher", "machine learning engineer", or "research scientist in <any data related stuff - data mining, computer vision, signal processing, machine learning, statistics>" as "data scientist" positions. Because it is sexy, in many places, management and job hunters get a kick out of the term "data science" even if it is actually a renaming of other traditional roles. This book perfectly captures the diverse nature of the term.


As someone shooting for a 'Data Janitor' (or what I think of as a digital plumber) position upon graduation, It would be cool to speak to the Data Scientists in their own language and understand what the hell they mean when they mention higher dimensional vectors, etc. But I'd much rather leave the high math to an expert and they leave the plumbing to me!


You can read similar to chapter two, high dimensional spaces, in https://jeremykun.com/2016/02/08/big-dimensions-and-what-you...

Also, chapter three about SVD, is in https://jeremykun.com/2016/04/18/singular-value-decompositio...

and https://jeremykun.com/2016/05/16/singular-value-decompositio...

the advantage is that you have the python code available.

https://jeremykun.com/2015/04/06/markov-chain-monte-carlo-wi...

The book seems to be interesting.


Indeed, the Monte Carlo post was inspired by some discussions after reading that chapter. I also borrowed and adapted a proof or two from the SVD chapter for that second post.


Your Monte Carlo post is interesting, but it doesn't consider the context of markov networks, for example reading Norvig Modern IA there are good examples. There is a subtle point that having probabilities locally you must prove that the global structure is a probability and this requires some order in the vertexes to propagate the information from the root to the leafs, also Markov models can have higher degree and then they are not like random walks. Anyway, the context of matrices and eigenvalues is interesting.


I was surprised to see so little attention paid to regression. Regression is quite powerful and is a powerful thing in a DS's toolkit.

For example: - What is the distribution of the residuals, how does it change over time as data comes in. How Gaussian they are(or not), analyzing weird/oddities especially around the tails - What kind of features offer the most significant signal to the model and which ones are not.

These skills are even applicable to SVM and other classification analysis.


Agree with this, simple concepts taught in-depth with lots applications and practice is very useful. Knowing 10 ML algorithms superficially will never be as good as knowing 1 with all of it's caveats.


"Background material needed for an undergraduate course has been put in the appendix."

So just being honest, the Appendix is still rather terse and advanced for me. Does anyone have suggestions for prerequisite readings that would help getting someone prepared for this text?


Introduction to statistical learning http://www-bcf.usc.edu/~gareth/ISL/ is a great text for beginners interested in machine learning. It is designed to be accessible (there is a more advanced book covering the same topics) but is still quite comprehensive, in terms of machine learning basics.


Have you taken the standard undergraduate engineering math coursework?

- Calculus covering Derivation, Integration, Multi-Variate

- Linear Algebra

- Differential Equations (May not be relevant here)

'Discrete math' is also useful.

Some of the derivations in the book you can take on faith and not fully prove out to save time, but you should feel 100% comfortable/confident with the notation used.

Let me know what's confusing you and we can try to figure out what you are missing.


No remember this is Hacker News. Most people here learned to code "in a weekend" and they can pick up a new framework "in a weekend". They also see no use wasting money on a 4 year computer science curriculum if you can just learn to program "in a weekend".

Ok I'm being extremely facetious here but it still shocks me when people comment in machine learning or data science threads asking about how they'd go about picking this up but you can clearly tell they have no formal background in the sciences. I guess the only reason it angers me is because if they had just done a CS degree instead of trying to "hack it" none of this would seem like magic.


I'm in this category. I guess my excuse is that I did Bio instead. What's the best way to get up to speed without having to go back to school? I stopped at linear algebra in undergrad but I probably need to refresh that as well. Would appreciate any thoughts you have.


Without getting bewildered I would suggested going through these 3:

1. Book of Proof by Hammack (http://www.people.vcu.edu/~rhammack/BookOfProof/)

2. Calculus by Spivak

3. Linear Algebra Done Right by Axler

Be prepared to work through all (or at the very least only the odd numbered) exercises. If you can't stomach that or find that life gets in the way of you completing even these very basic books, you do not have the time or discipline required to advance in mathematics.


might be worth noting that "calculus by spivak" is famously mistitled and is what people today call an intro analysis text. (says so right in the intro :-)

there are now "warm up" books (alcock) as well as even more basic real analysis books (abbott, about half the length of spivak).


Thank you!


If you make it through all of those, pick up Hubbard and Hubbard's Vector Calculus for a unified treatment of multivariable calculus and linear algebra.


The first two years of math in any engineering curriculum should cover these topics. Look up the books, lecture materials and videos from various universities online. It is important to try out the homework exercises to absorb the knowledge.


Thanks, I've started on a couple MOOCs but the ones that sound interesting end up being over my head,


Yeah that is why the coursework is spread over two years to help students absorb the knowledge. You just have to put a lot of time into it.


Schaum's Outlines are also good if all you really need is a refresher, or to use concurrently with learning the material fresh.


try a statistics class and a linear algebra class from coursera.


Thanks!


What sort of foundational mathematics is required to fully consume this book?

I assume multivariate calculus and linear algebra?


Skimming that's the impression I got. Discrete math is always super helpful, some rigorous stats pops up here and there, and then there are things here and there that you just need to wikipedia.


Despite being familiar with most of the material here and agreeing that it is generally useful to know, I still don't know if I'd call this book "Foundations of Data Science". It feels more like "Assorted topics in algorithms, machine learning, and optimization": data science from the perspective of a computer scientist.

Notably missing are causal inference, experiment design, and many topics in statistics--causal inference being one of the primary things we'd want to do with data.


Wow, book from Avrim Blum. Definitely worth reading.


Would working through this book be sufficient to be able to start doing data science/analysis work?

Some context: I did my undergraduate degree in Economics (in a pretty math intensive university), have been working in marketing for the last 2 years and want to go back to do work in something more analysis centered.


No - while it covers the theoretical foundations, it doesn't give you any of the practical skills you actually need to work with data.


I'd say it'd be plenty. You could get away without this book but for new applications or research level work with private datasets a company would really want someone that knows what's in this book.


I was planning on going through Pattern Recognition and Machine Learning by Bishop, for those who have gone through this PDF, which do you think is more useful for learning data science?


It's a book, a real book, quite long.

It has a lot of topics on applied math.

Some of its main topics are linear algebra, probability theory, and Markov processes.

Really, the book just touches on such topics. Usually in college each of those topics is worth a course of a semester or more. So, what the book has on such topics is much less than such a course. E.g., for linear algebra, the book gets quickly to the singular value decomposition but leaves its treatment of eigenvalues for an appendix and otherwise leaves out about 80% of a one semester course on linear algebra. Similarly for probability and Markov processes.

Some of the topics the book has or touches on are unusual with, likely, few other sources in book form. E.g., early on the book has Gaussian distributions on finite dimensional vector spaces where the dimension is larger than is common.

So, for the topics rarely covered in book form, the book could be a good reference.

For topics such as from linear algebra, a reader might get misled without an actual course in linear algebra from any of the long popular books, e.g., Halmos, Strang, Hoffman and Kunze, Nering or more advanced books by Horn, Bellman, or others.

Usually in universities, probability and Markov processes quickly get into graduate material with a prerequisite in measure theory and, hopefully, some on functional analysis, e.g., to discuss some important cases of convergence.

So, the book seems to have some good points and some less good ones. A good point is that the book is a source of a start on some topics rarely in book form. A less good point is that the book gives very brief coverage of topics otherwise usually covered in full courses from popular texts.

A student with a good math background could use the book as a reference and maybe at times get some value from the coverage of some of the topics rarely covered elsewhere. But I would suspect that students without courses in linear algebra, probability, etc. would need more background in math to find the book very useful.

E.g., early in my career, I jumped into various applied math topics using very brief treatments. Later when I did careful study of good texts with relatively full coverage, I discovered that the brief treatments had been misleading. E.g., no one would try to learn heart surgery in a weekend and then try to apply it to a real person. Well, for applied math, maybe learning singular value decomposition, etc. in a weekend might not be enough to make a serious application.

It is good to see a book on applied math try to be a little closer to real, recent applications than has been traditional in applied math texts. I'm not sure that the being closer is crucial or even very useful for making real applications, but maybe it will help.


Does anyone know if this is available as a hard copy?


you can use a service like printme1.com


439 pages is $22.29 at printme.1com


That's not too bad considering the content is equivalent to a first year graduate course textbook.


Which would likely cost $200 or so new, in current textbook market. So $22 is a bit better than "not too bad".


It's awesome that it's freely available.


Heads up - you can use the promo code `print-is-good-F15` to get 10% off. I chose to get a nicer binding and with the code it came to about $23.20


No. I attended a presentation by John Hopcroft (author of the book) and he told that China is the only place where you'd find a hard copy.


That seems more about computer science and graphs than what the average analyst would be doing.


Table of Contents:

* High Dimensional Space

* Best Fit Subspace & SVD

* Random Walks & Markov Chains

* Machine Learning

* Massive Data: Streaming, Sketching, Sampling

* Clustering

* Topic Models, Hidden Markov Process, Graphical Models, and Belief Propagation


I'm new to the field (which his why the book interests me), so I have to ask: Is your TOC a confirmation or a contradiction to the parent comment?


The table of contents is pretty much the core topics of serious data science (as opposed to "learn to use data science libraries in $lang in 21 days")

So yes, this book that is avaialble for download from a TURING Award winner's web page IS a Computer Science book for Data Science.


Avrim Blum is not a Turing award winner; his father Manuel is.


this book was posted on John E. Hopcroft's webpage - https://www.cs.cornell.edu/jeh/ who also has won a Turing award.


A contradiction - data science is the place where statistics and computer science meet but this book definitely isn't characterised as a book about computer science and graphs. From the TOCs, it is heavily based in statistics.


Data Scientist here. I use these, often.


hi, sorry for responding in this thread but I saw your profile and was wondering if you had contact info I could reach you at? would be really interested in learning more about your work


See my profile.

Edit: I'm not sure why email is not showing up. Here is an old site with contact info: https://sites.google.com/site/thomasroderick/


How many dimensions would you prefer your hardcopy to span?


There is a difference between Data Scientist, Data Engineer, and Data Analyst.

"Data Scientist" is for Ph.D.s or sat least Masters, those are the less common jobs.

At least that's the theory, we all know what happens with titles... anyway, the point is that we can safely assume that this particular article "Foundations of Data Science" refer to the actual "scientist" role.

https://www.datacamp.com/community/tutorials/data-science-in...

https://bigdatauniversity.com/blog/data-scientist-vs-data-en...


That would be a very useful separation. Unfortunately, job listings and titles used in industry do not separate themselves nearly so cleanly as that. I managed a "data science" team for a few years; nearly all our work is analysis. We still would call ourselves "data scientists" because it elevates our prestige within the company, which makes our results more respected and listened-to. It's a political thing, and it also helps in hiring.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: