The university I work at just reorganized the administrative structure to put institutional research at it's core. Involved in that mission is data mining, statistics and assessment efforts. We (the IR folk) have been pushing data for years and it's just recently been grabbing the proper attention at the decision making/VP/COO administrator level.
I perceive the ability to perform predictive modeling on large data sets being increasingly valuable. E.G., A basic example would be to score how likely a particular student from a set of 4000 high school seniors in a county will be to enroll at one university and achieve a degree.
It is still early stages but online marketing and optimization will move from simple reports and metrics to sophisticated modeling of trends and behaviors of visitors. Large ad companies already do a ton of data mining and modeling but increasingly small-medium sized businesses will start leveraging statistics to optimize their businesses and websites.
Right on. I work in healthcare IT and our most revered department is clinical analytics. These people are not technically savvy, but tools like Cognos can turn anyone with some basic math skills into a data analyst. You can literally drag and drop fields from any SQL database and make pivot tables out of them and generate reports. Of course there is a huge difference between someone that can drag and drop fields and someone that knows enough about the underlying data to actually generate good analysis.
+1, especially to your last sentence. That right there is my biggest medical informatics pet peeve- a lot of hospitals and clinics have "analysts" who are good at using the tools, and maybe even have a good grasp of the data schema of their repositories... but have very limited clinical knowledge, and even more limited knowledge of the workflows that generated their data in the first place.
So, what happens is that they generate some numbers without fully understanding the "story" behind them. For example, they might get a request for data concerning the frequency with which patients with condition "X" are treated at their hospital. In most EHR systems, the way to answer questions like this is by using ICD codes... however, there is rarely a 1:1 relationship between what we might think of as a "diagnosis" and a code. Depending on how it's defined, even something seemingly simple such as "Asthma" might be represented in an EHR by (for example) a dozen different codes, and which codes are used can depend heavily on a wide variety of factors: how the EHR's designers implemented the diagnosis system and user interface, how the clinicians were trained to use the system, specifics of how the patient's symptoms presented themselves, how the billing department coded the clinicians' diagnoses, the phase of the moon, etc. etc. etc. As a result, instead of a simple query ("find all patients with ICD code A"), the query ends up looking like "find all patients with codes A, B, C, D, E .... or J; or code K, but only if it co-occurs with L, M, or N; or code O, if the patient was seen in clinic number 4 or 5 between such-and-such dates; etc. etc. etc." And that's for a simple and straightforward clinical question. Imagine if it was something more complex, like "how many patients with condition X also develop condition Y after having treatment Z".
Coming up with a query like that takes significant clinical knowledge, but, more importantly, it requires intimate knowledge of the organization that created the data in the first place. It also requires some pretty serious "people skills"- the clinicians that the analyst will be working with to formulate the question will know virtually nothing about computers or databases, and so it will fall on the analyst to work with the clinicians to elucidate the implications and edge cases of the original question. It's kind of like being a detective. This, by the way, is a big part of why it's so hard to get good--- as in, reliable, valid, and comparable--- quality measures from large health care organizations. The data's often way more complex and ambiguous than novices realize, and (speaking from personal experience, here) it often takes people who come from non-clinical backgrounds and are used to more straightforward analytical questions quite a while to realize just how far down the rabbit hole they've gone. What might seem like
Of course the fun doesn't stop once our analyst has finally generated some numbers. Whoever wanted the numbers in the first place usually doesn't think much about where they came from (cf. "automation bias"), and as such can go on to make ill-informed decisions as a result of some subtle mistake in the data (i.e., unbeknownst to anybody, the analyst's query missed a whole block of patients coming from a particular clinic, thereby underestimating the prevalence estimates of asthma). This is doubly true when the consumers of the data are generic statisticians (as in, not specialist biostatisticians who are experienced in clinical data analysis) The first commandment of statistics is "Know thy data", and medical data is one of those areas where that's a tricker problem than usual.
You raise some excellent points. Our software is mainly showing providers and payers how to minimize waste, fraud, and abuse of the system so it requires a huge amount of customization because of each organizations use of coding. The clinical analytics comes in where they can analyze historical data and tell them "you could have saved X amount of money by coding this procedure differently" or "there is no medical reason to do procedure X if you've already done procedures Y and Z, thereby saving XX money." To analyze this data not only requires a statistician's grasp of math, but it requires medical knowledge and organizational knowledge as well.
If you were smart enough to be a data/stats geek and also had an MD, plus years of experience working as a doctor in a hospital, I'm sure you are worth your weight in gold as this skill set is very rare.
It sounds like you guys make really useful software!
As you say, MDs who have the skills and are inclined to do this sort of stuff are few and far between. My grad program in medical informatics has a master's track whose graduates are mostly MDs, and they would be quite well qualified for this sort of thing... except that most of them go on to either be CIOs or implementation consultants, and typically make far more than analysts do.
I think it's something that they're going to have to start teaching in medical schools, however. As more and more places start taking quality improvement seriously, being able to think systematically about clinical data is going to become a very important skill for doctors to possess. Of course, our experience thus far with trying to get it into the curriculum has not been very encouraging. It's amazing- doctors love trying out new gadgets or drugs, so they're clearly not inherently afraid of technology or of change... but try and get them to modify their curricula, and they look at you like you're crazy.
I like this line: "The rising stature of statisticians, who can earn $125,000 at top companies in their first year after getting a doctorate, is a byproduct of the recent explosion of digital data."
So... someone who majored in math/physics/hard sciences, got the grades and test scores to gain admission to a top university, and goes through a program with a high attrition rate that you're doing well to complete in 6 years can earn a bit less than a JD (half the time) or MBA (1/3 the time).
Like Right Said Fred Said:
I'm too sexy for this field,
too sexy for this field,
holds no ap-peal...
(ok, I'm a data geek, and this sort of thing actually sounds like far more fun than corporate law, and $125k starting is decent... but still, let's still recognize that the reward to effort ration is still not quite comparable with the professions, and the journey is quite a bit harder).
Here is a question from a layman's (Stat 101) perspective: Why don't we see many Statistics PhDs, armed only with with a computer, access to free public data and some programming skills, CONSISTENTLY make a killing on Wall St?
Is it because Statistics fails once the number of variables & complexities increase to reflect the real life?
They are making a killing, they just don't talk about it. I know several people who are doing this. Currency trading, predicting dog & horse races, options trading - all one man shows with a tiny little grasp that nobody else has. Each making a couple million a year trying to figure out the next little trick (all the straightforward tricks are owned by Goldman's super fast computers). More than once I have seen 1 second of lag cost them big, but overall they are maxing out the opportunities their little (< 20k loc, usually in vb of all things, sometimes in python or ruby) stats programs have found.
Furthermore, I work as a business intelligence quant for the online tech space. I've DRASTICALLY increased ROI rates for online ads, as well as conversion metrics, with my formulas and clustering models. There is so much low hanging fruit out there it is crazy.
If anyone wants to get into this field I'd be more than happy to point you in the right direction.
"But how much of that data will be privately owned?" ... and how much of that data will leak out anyway? Or maybe people will stop caring about privacy altogether? Even data acquired in dubious ways will require statisticians for analysis, maybe even more so.
it's not the raw skills. those have been in demand for a long time. the next big thing is going to be the marriage of analytics with intuitive interfaces. the standard model currently is pretty much power point presentations. PP sucks and yet thousands of businesses use it as the main way of communicating important statistics.
Completely agree - Currently my company is doing a lot in analytics and data analysis. Having a background in statistics is something highly desired but I find my of the computer engineers lack a lot of the math skills to take on the analytics problems alone.
I don't think it should be taken literally, because statisticians come in different guises. Nobody ever says that being statistician is hot, yet there is a lot of interest in quantitative analysis and HFT (even here on HN) which is all about statistics.
Looks like the article might provide some 'luster' to IBM; they are going to put 4000 people on this. Hmm ....
What the article predicts would, could, and should happen but won't. Here's the problem:
Let's start with the 'status' of statistics:
Academic Teaching: In academics, the courses available rarely go beyond just some Stat 101, experimental design, or applied regression analysis. The teachers rarely have much expertise in statistics, e.g., rarely understand the strong law of large numbers, the Radon-Nikodym theorem and its connection with sufficient statistics, or the Lindeberg-Feller version of the central limit theorem. Net, the teaching sucks.
Academic Research: The quantity of good academic research in statistics is meager. The applied statistics research such as in the article would not be regarded as solid research. The grant support is far behind that for physics (theory, particle, applied), biomedical, computer science, engineering, or pure math. Net, the research sucks.
Ph.D. Programs. One can count with shoes on all the good Ph.D. programs in statistics. So, over the past 40 years might count Berkeley, Stanford, Chicago, Cornell, Yale, Hopkins, and UNC.
Computer Science. Yup, to do much in statistics, need computing. So, much of the public and academic computer science swallows the idea that computer science has expertise in statistics. No it doesn't, not while they can't state the strong law of large numbers, and nearly no one in computer science can; for that they just didn't take the right courses in grad school. About all CS can do is pull equations they don't really understand from cookbook statistics and try intuitive heuristics, and that is similar to medicine in the days of snake oil cooked up on wood stoves. Suckage.
Professionalism. Law, medicine, and parts of engineering are 'professions' with certifications, licensing, liability, and strong professional societies. Statistics isn't a profession in this sense. Uh, such 'professionalism' is from important up to crucial 'branding' and credibility for customers outside the profession. Medicine has it; statistics doesn't. Indeed, in academics, a suggestion that statistics should be 'professional' is an anathema. Students who want to get their fellowships renewed will keep their mouths SHUT and never say such things. Suckage.
So, net, the status of the field sucks.
Okay, now we can move on to why the field won't catch on in business:
We have to notice that nearly no one high in business now or on the way to being high in business knows more than just some elementary applied statistics, from long ago, that they never understood very well, never really used, and was likely poorly taught. Also they have not seen much of significance in business from anything at all serious in statistics. They know about the importance of computing, the Internet, and maybe some of assembly line robots, supply chain optimization, comparisons among planes, trains, trucks, biomedical research, even efforts in applied nuclear fusion, but they nearly never attribute significant importance to statistics.
So, suppose there is a good statistician, in a business, with some good data and with some powerful techniques in statistics that can convert that data into new information valuable for the business. Suppose this statistician writes an internal memo to his supervisor and proposes that the company fund the statistician to work on delivering the value to the business.
Here's what happens: The memo goes up the management chain of the statistician to the first manager who doesn't have much respect for statistics. Given the status of statistics, don't expect the memo to go up very far.
Then this manager sees two cases:
(1) The project fails. Then the manager will have a black mark on his record for sponsoring some contemptible, risky, wasteful, 'blue sky, far out, ivory tower, intellectual self-abuse, academic research project'. Bummer.
(2) The project is successful. Quickly everyone in the management chain who does not understand statistics will feel threatened. There is a rumor that a women in the office complained that once from 100 feet away the statistician looked at her in a way that made her feel "uncomfortable", and the statistician is GONE.
So, the manager sees only disaster whether the project is successful or not, and the project doesn't get funded. If the statistician proposes a second such project, then he's a 'loose cannon on the deck', out of control, insubordinate, not a 'team player', and gone.
Or a big organization middle manager can fund big projects in computing, supply chain optimization, assembly line robots, etc. he doesn't understand, but, due to the status of the field of statistics he can't fund a project in statistics.
There is really only one way for statistics to come forward in business now:
The guy with the valuable work in statistics starts his own business and sells just the results. The customers like the value of the results for their businesses and don't have to address anything else.
But, for this business the statistician is totally on his own: There isn't an 'information technology' venture partner anywhere in the US who would touch his project with a 10 foot pole, again, for much the same reason as the business manager.
The statistician MIGHT get some seed funding if he shows a good user interface or Series A funding if he shows good ComScore or revenue numbers, but the role of 'statistics' he can be advised to keep quiet.
Or, the venture partners believe in Markov processes: The future of the business given ComScore numbers is conditionally independent of the statistics in the 'secret sauce'! So, look at the ComScore numbers and f'get about any 'statistics' in the 'secret sauce'. This Markov assumption is not fully justified, and likely not a single venture partner in the country could give a solid definition of conditional independence, but this is still the situation.
And that's the way it is.
So, it's tough to make statistics applied; call this situation a 'problem': Then, for someone with some new, powerful, difficult to duplicate or equal work in statistics that can take some of the oceans of data available now and deliver valuable results and sees their way clear with just a bootstrapped company to high profit margins and rapid organic growth, the flip side of this 'problem' is an opportunity.
A lot of "applied statistics" is now performed under the headings of "machine learning" and "data mining". Both fields are thriving.
Furthermore, Bayesian methods have come back to the fore in the past ~15 years. They are quite likely the future of statistics, especially in academia.
You can't solve all problems with machine learning and data mining. How would you apply those methods, e.g., to psychological experiments or test planing when every single test costs a considerate amount of money?
Situations like that are not why statistics is becoming the new in-demand skill. Expensive trials have existed for a long time. Petabytes of barely-structured data haven't.
I didn't comment on the article but on the statement above about applied statistics being dominated by ml and about Bayesian statistics being the future.
<quote>We have to notice that nearly no one high in business now or on the way to being high in business knows more than just
some elementary applied statistics, from long ago, that they never understood very well, never really used, and was
likely poorly taught. Also they have not seen much of significance in business from anything at all serious in
statistics. They know about the importance of computing, the Internet, and maybe some of assembly line robots, supply
chain optimization, comparisons among planes, trains, trucks, biomedical research, even efforts in applied nuclear
fusion, but they nearly never attribute significant importance to statistics.</quote>
You've never worked in finance, have you? Modern economics, especially the portions dealing with finance, is
indistinguishable from applied statistics. The Black-Scholes option pricing formula that is at the heart of almost all
trading algorithms is statistical. VaR and other risk metrics all rely heavily on statistics. Now, its an open
question as to whether such a reliance on statistical methods is a good thing for the industry. However, you cannot say
that managers in finance companies don't appreciate statistics. They appreciate it very much, since those statistical
models determine their bonus at the end of the year.
From all I know, your rosy picture for statistics, etc. on Wall Street is not realistic.
The Black-Scholes option pricing formula is simple to program and has been programmed many times by people with essentially no knowledge of statistics -- or stochastic differential equations or the connections between probability theory and potential theory.
I could go down a long list of evidence, but I will mention just two points:
I still have a letter back to me from Fisher Black saying that he saw no such opportunities.
Jim Simons has stated that he doesn't use math in his work.
I'm not sure which schools you're referencing when you write about that status of academic teaching. If a university has a statistics program they must offer something beyond a course in basic statistics.
Grants for statistics research do seem to be relatively lacking. Possibly the increased emphasis on statistics in data analysis and machine learning will change that.
The listing of good statistic programs is missing at least a handful of schools: UCLA, UMich, UWisc, CMU (machine learning dept), and GMU (computational stats PhD).
I'm coming from computer science, but it appears the state of statistics in industry and academia is improving. Off the top of my head, here are some tech companies that depend on and employ statisticians: Facebook, Google, bit.ly, FlightCaster, Twitter, BackType, OkCupid, and Microsoft. The title might be Data Scientist or Search Engineer, but it's still statistics and probability.
Yes, problems in machine learning, computational statistics, data scientist or search engineer should be in 'statistics'.
But your generous attribution that this work now is actually statistics has a serious failing: The backgrounds of both the students and the professors rarely have the prerequisites for making serious progress with statistics.
Maybe their work NEEDS 'statistics' but their backgrounds rarely permit them to make progress in 'statistics' even on the problems they are addressing.
E.g., I worked in one of the world's better artificial intelligence groups, and coming through were bright graduate students, part time, summer, etc. all awash in 'machine learning', etc. The work was junk, and here's why:
There were some people in computer science departments who wanted some progress in something roughly related to computer software that 'learned' in some sense. So, they tried things. Mostly they tried just heuristics, especially ones they got from just maybe guessing at how humans did things. Or for a while they were all hot on 'neural nets' -- promising for simulating a few neurons in an earthworm. The criterion of progress was mostly just, did the resulting software appear to do something good?
Basically they were just starting with a blank slate and next to nothing significant in powerful background in anything. About the deepest thing they knew was, maybe, LALR parsing. They knew how to program computers but didn't know what to program.
In addition, and much more serious, there was a 'methodological gap' the width of the Pacific Ocean. That criterion of, does the resulting software appear to do something good, is, in the history of statistics, applied math, pure math, and mathematical applications in physical science and engineering, just JUNK. Such software might be this and that, but it's NOT 'statistics'.
Finally I had to explain to one of those intuition pushers:
Here's how to exploit mathematics to do applied problem solving. In a transportation analogy, we start with a real problem a point A and we want to get to a real solution at point D. For this trip we can start out walking, without a map, across deserts, rivers, swamps, and oceans, and maybe get to D.
But here's what we SHOULD do: First get a taxi ride to a local airport Assumptions at B. At B get a plane trip on Mathematics Airline to another airport Conclusions at C and close to D. Then take a taxi to D.
The math connects the assumptions to the conclusions. Here the math is based on theorems and proofs. The taxi trips are supposed to be SHORT.
Then for the logical validity of the work, we get to check the math and then argue over the taxi trips.
The key is the math, and there we look carefully at assumption, theorems, proofs, and conclusions.
The 'research' content is new math for new connections for a new pair of assumptions and conclusions and, usually, a new pair of problem and solution.
To do such math, one needs a good background in math, typically an undergraduate major in pure math and about two years of graduate school in focused applicable math. Of high importance will be the 'mathematical sciences' with probability (based on measure theory), statistics (also based on measure theory and functional analysis, e.g, for weak convergence and sufficient statistics), stochastic processes, especially Markov processes and martingales, and optimization, including combinatorial and stochastic.
E.g., without the first half of Rudin, 'Real and Complex Analysis', just F'GET about it; that material is not sufficient but it is NECESSARY.
As an example, once I took an important problem in practical computing and executed the 'methodology' above. I got some nice results for the real problem and, really, a nice step forward for computer science. My paper connected carefully with instances of the real problem and with real data, but the core of the work was the new theorems and proofs.
So, I went to get the paper reviewed. I got back from two chaired professors of computer science at famous research universities and editors in chief of top computer science journals essentially the same wording, "Neither I nor anyone on my board of editors has the background to review the math in your paper.". For a third such person, I wrote them tutorial notes for two weeks before they gave up. At one journal, the editor gave up but the editor in chief stepped in and handled the paper himself. Likely the computer science people he asked said, "It's nice for practical computing and computer science, but I can't say if the math is correct.", and likely math professors said, "The math is correct, but I don't know what it means for computer science.".
Broadly, so far research computer science can rarely do research in applied math or mathematical statistics. A Ph.D. in computer science just does not have the right prerequisites for such research. The prerequisites are a focused applied math Ph.D. Sorry 'about that. So, yes, for broad areas of needed research progress in computer science, the computer science community is essentially irrelevant.
As I said, for 'statistics', the computer science people are limited to intuitive heuristics and picking formulas they don't understand out of elementary stat books, and that is like snake oil cooked up on a wood stove. Again: The reason is, they don't have the prerequisites.
Uh, my paper could have been called 'machine learning' or 'artificial intelligence', but I just called it some theorems and proofs for some progress in computer science for an important problem in practical computing. It would also be appropriate to call the work some nice progress in mathematical statistics.
We have seen this at KU - but now it's changing due to receding budgets. The culture has moved away from creating some project and sorta 'guessing' that they work using anecdotal evidence and student surveys since there simply isn't the money to keep continuing on the programs. As an entire organization, we've had to focus our efforts on proving that spending money in such a way is worth it in the long run.
The Institutional Research department(s) here have a long standing track record of trust because they work so closely with the faculty, I'm not sure that the same environment exists in the private sector. In fact, I could be pressed to say - if the trust between Faculty and the IR statistician/data miners didn't previously exist that we would be focusing elsewhere.
Maybe this loose cannon statistician should make the first couple pages of his proposal a refresher course on the dependence that a couple big industries, agriculture and pharmaceuticals, have on statistics, and how SAS is installed in nearly every Fortune 500 company.
"Academic Teaching: In academics, the courses available rarely go beyond just some Stat 101"
Not if you study statistics. The problem with learning only "applied statistics" is that people lack the mathematical foundation to understand what they are actually doing.