I’ve worked in the field for 7 years now so not that long but long enough to build some heuristics. The best data scientists are just people who try to understand the ins and outs of business processes and look at problems with healthy suspicion and curiosity. The ability to explain the nuances of manifolds in SVMs is not something that comes into it outside these contrived interviews. I prefer to ask candidates how they would approach solving a problem I’m facing at that moment rather than these cookie cutter tests which are easy to game and tell me nothing
>> I prefer to ask candidates how they would approach solving a problem
Word. Totally off topic, but: I work in the field of information technology since more than 20 years now, more or less. Not always the same focus, not always full time, but always IT related. I consider myself a good problem solver because of my self learning and analytical skills.
I recently applied for a job as a BI developer. The interview consisted of 10 questions about SQL. I more or less answered them, just 1 or 2 wrong. Not wrong as in "not correct" but rather "Not what we exactly expected" or "you did not see the little traps".
Comes out they didn't take me because of my lack of SQL skills. I do not understand how this kind of recruiting process will help anyone getting skilled people and how this is still common practice. It's frustrating for people like me, who do not have the complete SQL syntax in mind, but are flexible in choosing their problem solving approaches. A couple of years ago I started in a big data company, never heard of MongoDB before, little skills in Bash. If they would just asked me questions about that, hiring me would be a total no-go. They did hire me. I improved process, like measurable, and mastered MongoDB. Nothing, that one could expect from a questionaire.
A second interview, same outcome. They not even asked me detailled questions, just wanted to know what my SQL skills are. I answered: Immediate, but I'm good in learning. The did not take me, too.
Although, I understand that it's hard to evaluate this kind of skill, I'm really frustrated, when I face those "hiring techniques". Or maybe I'm just not good in SQL, and they anticipated it.. ;)
I think this must happen in all computer engineering fields. It's happened to me numerous times when applying for DevOps positions.
Interviewers don't seem to realize that possessing knowledge and fluency are a trade-off. If I'm amazing at SQL, I'll have a gaps elsewhere and vice-versa. There's just too much to learn, and stay on top of.
My takeaway when I fail an interview due to nonsense like this is that these aren't places I would've been happy working at anyway so they did me a favor by not hiring me.
There's another possibility but only because you mentioned SQL specifically. People often use SQL as shorthand for "understand how to manipulate data," and if a new hire doesn't have this skill it can really set a data team back, so interviewers are touchy about SQL. It would be helpful if they clarified if it was the SQL syntax skills or the data manipulation skills they had a problem with. But generally I agree with you that companies should prioritize problem solving skills over technical minutiae.
The - to my mind - most absurd version of this was a well-known Swedish fintech I interviewed with where the one barrier they had set up was one of those online IQ tests (rotating, mirroring etc. blobs on a grid).
I was interviewing for a senior data science role and the other points of contact I had in the process were surprisingly non-technical conversations with a VP of data and another senior data scientist.
Alas, my skills in rotating blobs on a grid failed me that day so that was that.
Yeah, I instantly know which company you're talking about...
I know a couple of developers there and as far as I understood all applicants have to do the IQ test. They also thought it was ridiculous, but the CEO really really likes them, so it stays.
The way I see it, an interview goes both ways. They try to assess your fit in the company, and you're also trying to assess if the company is something you want to spend your time on.
I call this the "game show style" interview. You know the answer, you get $10k. Next question. You didn't know the answer? You're out of the show. Next contestant.
To me this is very disrespectful of your time and your capabilities. Once I was pumped to go into an interview with a very well known company in Sydney, but was dismayed to learn that they do this game show interview (with IQ test, no less!). I wouldn't want to waste my most precious resources (i.e. time) for a company that doesn't respect me.
I can try to help you pass the SQL round. I’m going through DS/Analytics Eng interviews myself. I’ve passed all of my technical rounds, but failed the final rounds so far. Email me at aok1425 at gmail.
I completely agree! I run a data science department at a corporation and I give applicants real world data and ask them to answer real world questions. "Here's a dataset of the dates we sent an email and the open rate. Does one day of the week have a higher open rate than the others?" People have to know how to turn a string into a date, turn a date into the day of week, and then apply statistical reasoning and then give me a one sentence answer. That kind of thing is far more common at work than explaining the kernel trick.
Agree. In addition to curiosity another quality I think is very important is persistence. A surprising amount of being an effective data scientist comes down to being able to learn new things quickly and being able to make the computer do what you want it to, especially when the first few things you try don't work.
Success if often a trial and error process, involving slowly building a deep understanding of the problem you're trying to solve (business and technical), hitting lots of problems, and not giving up too easily (at least, not giving up due to surmountable technical hurdles).
This often means spending many hours banging your head into a brick wall; but in my experience these are often the times I'm learning quickest even if it doesn't feel like it at the time.
Diving deep into a domain and interfacing with a subject matter expert goes a long, long, way. We've been building custom data products for enterprise for about seven years, and a lot of the work is listening and grokking large amounts of content about whatever domain we were helping with in general, and the specifics of our clients. Retail, banking, telcos, energy, communication.
Having a background in acoustics, reservoir characterization, telecom networks, opens up clients because you 'get it' or at least you work hard to get it, which improves buy-in of the experts to sit down with you and answer your questions. You did your homework.
If you don't and just storm in talking about something something neural nets, they'll see it as a waste of time, won't bother explaining nuances, will delay sending data you desperately need. You won't have their cooperation even if you have executive support. There's no data in CSV form or an API to hit in most real world projects, so you need their help getting data, and their expertise to understand it.
Another major point is specifying the metrics. The real world metrics, not AUC or F1 scores. You need collaboration to get there, too.
There's so much to be done before there's data to work with, let alone good data. And there's so much after the model building step.
It can drive people to quit. One reason is that when you storm in and consider that people are morons, you get frustrated rapidly.
It should be taught. Lack of humility almost cost us a project. In that instance, humility unscrewed the project and unblocked 400k by simply sliding a sheet of paper and a pen to the client across a table, and asking them to draw the dataflow they thought we were going to build. A couple of boxes and arrows made it clear what the problem was. We drew the actual dataflow. The security person said "Oh, I thought... OK.. if it's like that then we're good to go". Legal said they're OK with that. Data people said OK. A dozen people were relieved.
The previous person on the project, although brilliant technically, thought they were "idiots who didn't understand crypto", as if it were the end goal. All it took was to keep quiet for a second and listen to what they had to say, and let them talk about what was problematic, instead of snark.
I am actually starting to believe that (too) smart developers can actually be a hindrance in the wrong circumstances. We have a prima dona on my team and while smart, he doesn't make things easy to follow for the next dev.
> The best data scientists are just people who try to understand the ins and outs of business processes and look at problems with healthy suspicion and curiosity
I guess the unsaid part of this is “... and curiosity AND are often able to leverage this in data-driven solutions to bussiness problems”. Because nobody cares whether you’re curious or not if you don’t bring any value to the company. With that said, the part you mentioned almost seems like an innate ability, while the part I filled in could probably be trained.
Suppose you were still in college and wanted to become a data scientist. Based on your current knowledge and values, how would you go about it, being a student? Which skills would you hone and how?
Yes, it’s a given you need to understand what the different techniques are doing and when to apply them, but rarely to the level of depth expected in these white-board interviews that are often just thinly-disguised ways to grill and belittle nervous candidates.
If I could redo my twenties, I would tell myself to choose some topic I cared about, find publicly available data on that topic, and just start exploring the dataset with basic pivot tables and graphs. Ideally write up what you found and publish it somewhere. As you find interesting things in your data you'll naturally start asking more questions and you'll learn modeling techniques as a function of those questions and writing about will help you become a clear communicator (this matters far more than technical knowledge)
This is why I'm switching from DS to SWE. The communication hurdles with the business people are so hard for me. I've talked with other nerds my whole life and struggle to connect with and dissect the other side. That and the pay is better.
SWE has its own problems. Business will want X done by Y time and doesn’t speak Agile/sprint/tech debt/etc. Also, you’re going to deal with resume-driven developers who want to use some shiny new tech when there’s no valid reason over something plain and boring that works. The grass is never greener.
The quality and depth of answers here is pretty inconsistent. But this in particular is a pet peeve of mine:
> Plot a histogram out of the sampled data. If you can fit the bell-shaped "normal" curve to the histogram, then the hypothesis that the underlying random variable follows the normal distribution can not be rejected.
This is commonly taught in undergrad stats, but you shouldn't do this. I'm of the opinion that normality testing in general is usually a red herring, but this is specifically not a productive way of doing it. Use the other methods. A visual test that relies on how much the histogram approximates a bell curve is very prone to error, because a sample from a variety of other distributions can look visually normal even though it isn't.
More broadly speaking, the reason I don't like this is because it's an example of the kind of formulaic, cargo-culted recipes that are often used in statistics without critical thinking. You should strive to obtain a deep understanding of your data and its distribution, and you should be deeply skeptical if the sample you happen to have looks normal. Nature abhors normality, and the central limit theorem can only promise a tendency towards normality as n approaches infinity. It says nothing about what size sample you'll practically need for your specific data to be able to treat it as normal.
> You should strive to obtain a deep understanding of your data and its distribution, and you should be deeply skeptical if the sample you happen to have looks normal.
Although normality testing is useless in many situations, the parent comment somewhat overstates the degree of caution required. In many contexts the exact distribution doesn't matter; sort-of-normal is good enough. For example, the t-test is used ubiquitously. It assumes normality, so we would expect possible non-normality to be a major problem, right? Not so. The t-test is extremely robust to departures from normality given equal sample sizes [1,2,3]. Or you can use a so-called non-parametric test. Rather than investing great effort in specifying exactly what distribution you're dealing with, it's more productive to simply use a test that is robust against your unknowns and move on to pursuing your actual objectives.
It's true that if you are interested in predicting events in the tails of the distribution, you really do need to study the distribution in detail. Predicting rare events is very difficult. But if you're just interested in differences between group means, don't overthink it.
[1] http://www.jerrydallal.com/LHSP/student3.htm
[2] Posten, H.O., Yeh, H.C., and Owen, D.B. (1977). Robustness of the two-sample t-test under violations of the homogeneity of variance assumption. Communications in Statistics - Theory and Methods 11, 109–126.
[3] Posten, H.O. (1992). Robustness of the two-sample t-test under violations of the homogeneity of variance assumption, part ii. Communications in Statistics - Theory and Methods 21, 2169–2184.
> The t-test is extremely robust to departures from normality given equal sample sizes
That's over selling it. Its very sensitive to fat tails and skew. That's the reason robust testing and estimation is a thing. Wilcox's research would be a good near contemporary place to start [0][1].
Abstract: Conventional hypothesis‐testing methods such as
student's t, the ANOVA F test, and methods based on the
ordinary least squares regression estimator, are not
robust to violations of assumptions. In fact, there are
general conditions under which these methods can provide
poor control over the probability of a type I error and
inaccurate confidence intervals, no matter how large the
sample sizes might be. Relatively poor power is yet another
concern.
Conventional statistical methods have a very serious
flaw. They routinely miss differences among groups or
associations among variables that are detected by more
modern techniques - even under very small departures from
normality. Hundreds of journal articles have described
the reasons standard techniques can be unsatisfactory,
but simple, intuitive explanations are generally
unavailable. Improved methods have been derived, but they
are far from obvious or intuitive based on the training
most researchers receive. Situations arise where even
highly nonsignificant results become significant when
analyzed with more modern methods. Without assuming any
prior training in statistics, Part I of this book
describes basic statistical principles from a point of
view that makes their shortcomings intuitive and easy to
understand.
> Its very sensitive to fat tails and skew. That's the reason robust testing and estimation is a thing.
Yeah, I shouldn't have said "extremely". Its much more robust than is commonly perceived, but that does not make it "extremely" robust. Thank you for the correction. But please note that I specified equal sample sizes, so the fact that "there are general conditions under which these methods can provide poor control over the probability of a type I error" is not really a refutation—I already implied that the t-test is not generally (= in all cases) robust. I also mentioned non-parametric tests as a potential alternative. I do not wish to imply that the t-test is always the right choice. But I stand by the assertion that approximate knowledge of the distribution, such as obtained by inspecting a histogram, is perfectly adequate to choose a test. The main point remains — list what you know about your data, and pick a test that tells you what you want to know with a tolerable error level for your application.
You can spend an awful lot of time picking the "correct" statistical test (if there is such a thing; tradeoffs exist) with little gain. Worse, making a lot of decisions about what test to use after looking at the data leads to p-hacking, potentially leaving you with more bias than if you naively used a slightly-wrong test from the start.
To elaborate on the t-test discussion:
From your ref [1], "If sampling is from nonnormal distributions that are absolutely identical, so in particular the variances are equal, the probability of a Type I error will not exceed 0.05 by very much, assuming the method is applied with the desired probability of a Type I error set at α = 0.05. These two results have been known for some time and have been verified in various studies conducted in more recent years" (page 79). I think Wilcox makes this statement under the equal sample size caveat, but the text is unclear on this point. This does not guarantee robustness under nonnormality, but it does refute the idea that nonnormality is always fatal to the t-test.
Yes, a t-test can be affected by skew even with equal sample sizes. Whether this is a problem depends on the application. The example in [1] uses a lognormal distribution and ends up with actual α = 0.15 for n = 20, desired α = 0.05. Which isn't great, but is somewhat tolerable. Also, a lognormal distribution is strongly skew right and left-truncated, and this is easily noticeable on a plot and can be transformed to a normal distribution. So this does not refute the idea that a sort-of-normal distribution is fatal to the t-test.
You can look up the potential α level and β level errors from violating the assumptions of a particular test—they're tabulated. E.g., applying the t-test to a Pearson distribution produces a typical α level error of 0.005 at desired α = 0.05 [2]. That's definitely tolerable. This is why I stated that sort-of-normal is good enough for a t-test. It's also fairly straightforward to calculate the expected errors yourself for a particular situation, if reassurance is needed. If your statistics are intended to support a high-stakes decision this may be a good use of time. Working to validate your finding by other means is better, though.
I'm not sure what your ref [0] (the "Robustness of Standard Tests" book chapter) is arguing for or against from the abstract alone. I don't have a copy of that book. The abstract mentions most flavors of maximum likelihood ratio tests, which is too broad a set of topics to discuss effectively. The implication seems to be that null hypothesis statistical tests are bad and something else (Bayesian analysis? visual inspection?) is better, which I don't necessarily disagree with. If you could please clarify its contents and if it is worth tracking down, I would appreciate it.
[0] Wilcox, Robustness of Standard Tests
[1] Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy
[2] Posten, H.O. (1984). Robustness of the Two-Sample T-Test. In Robustness of Statistical Methods and Nonparametric Statistics, D. Rasch, and M.L. Tiku, eds. (Springer, Dordrecht), pp. 92–99.
No worries and I totally agree that t-test are surprisingly robust to some benign deviations from Gaussianity, a lot more than one would have thought. In my line of work I have had to watch out for fat tails (ridiculously common) and skew -- they can be potent t-test killers.
In the book Wilcox champions robust estimators and tests (a la Huber, Tukey) because efficiency of MLE is very brittle.
I'm guessing his concern is visually looking at a histogram and gauging its normality vs doing so with a q-q plot, which will likely show deviations from normality much better.
There's a test for normality. Well, I knew one, Kolmogorov, but Wikipedia already lists 8. What data scientist doesn't know of such a test?
And how would you fit the data? There's not one, unique way to fit. And as you say, a small deviation can mean a lot: if you use L2 distance, the errors at the outer regions of a normal distribution are probably dwarfed by any deviation that occurs more towards the center.
There are myriads of ways for two things to be unequal. In fact its fair to state that unless you are taking two sets of measurements of some fundamental physical constant or measuring artificially generated data, they are not going to be from the same exact distribution -- you cannot bathe in the same river twice is very true.
The way to address the problem you speak of is to think hard about the kind deviations from equality that would be most damaging to the application (were it to slip in undetected). Once you know what's the most damaging deviation, you can then select a test that is very powerful in detecting that specific type of deviation.
For example, you mention Kolmogorov(Smirnov) test. Those are daft blind at the tails of the distribution. So in case you need to catch deviations at the tail, then you can safely skip KS test, they are no good. On the other hand KS tests are very powerful around the median, if your application requires high resolution there, KS is the test you want.
The weirdest thing about data science interviews (when I was actively interviewing) is SQL gotcha questions. Especially with window functions. Here's a long HN thread a few months ago about an annoying situation with an interviewer asserting uncommon syntax is common: https://news.ycombinator.com/item?id=23053981
Another related SQL gotcha I saw multiple times is finding the top n records of each group in a table. Which is a know-it-or-you-don't implementation, and the interviewer can still be a jerk if they want by slamming the interviewee if they include ties, or not (RANK vs. ROW_NUMBER).
It's telling that there aren't any SQL window function examples in this repo.
Another fun aspect of SQL interviews is dialect-specific questions, particularly with how date/times are handled. Years ago, a company famous/infamous for primarily using mySQL explicitly noted in a take-home assignment problem definition that the database was PostgreSQL, which allowed them to ask the aforementioned window question problem, the AT TIME ZONE syntax for filter, and allow a specific definition for "beginning of week" which required me to download the database and test it manually.
These kinds of stories seem insane to me. I currently have to do some data science-y stuff in the more generalist consulting dev role I have. I used to do a lot of SQL stuff years ago, but have forgotten most of the syntax beyond the basics. That being said, all it took is some minor googling and reading of a few blog posts to solve some middling hard problems for my client.
What kind of companies make people jump through these nerd hoops? Do they actually have real work that needs doing or is this all just posturing by interviewers?
I think an element of it is that you can have unambiguous questions with SQL, which taken to the extreme do fall into gotcha territory.
I was an interviewer for a business analyst & data science team recently and we needed to hire several analysts & data-scientists who had strong SQL skills because 90% of the data manipulation/processing used SQL. I was definitely aware that this limits who we hire, but we were very short-staffed and needed people so it was easiest for us to hire people who had a good understanding of SQL. That said, we did not care about syntax subtleties, just more if could you generally answer questions using SQL.
And for the data-scientist role there was a lot more than just SQL but it was a useful check.
Linear regression does not require errors to be iid, normal, and homoscedastic for it to "work". One of the ways to separate candidates is to push on what (and how) assumptions can be weakened, what the consequences are for estimation and inference, and what sort of corrections can be incorporated for maintaining consistency, improving efficiency, correcting biases, etc. An entry level candidate may not have (nor need to have) a complete understanding of asymptotic theory, but they should know what the purpose of robust standard errors are and how to use them.
Interesting and perhaps shows the cultural differences between ML and stats people. I took a machine learning course in my bachelor's and two more ML courses in my master's (CS). These weren't some "deep learning lite", mess-around-in-Keras courses, because DL wasn't even big back then. We covered lots of stuff, Bayesian linear regression, Gaussian processes, Gibbs sampling, Metropolis-Hastings, hierarchical Dirichlet processes, SVMs, multi-class SVM, PCA, kernel PCA, perceptrons, CMAC neural nets, Hebbian learning, AdaBoost, Fisher vectors, EM algorithms for various distributions, fuzzy logic, optimization methods like conjugate gradients etc etc.
But not once were the "Gauss-Markov conditions" mentioned. Frequentist theory was only marginally addressed. I taught myself some of that stuff from the Internet, such as hypothesis testing theory, p-values, t statistic, ANOVA, etc.
Also, I'd say I'm good with data structures and algorithms, complexity theory, graph theory etc.
I thought these skills would be a good fit for data science jobs, but I guess it's really such a wide umbrella term, that probably you're more looking for people trained in the frequentist, statistical side of it. What application field are you in, if it's no secret?
By tribe I am firmly in the machine learning camp but I have serious doubts that one can be a good hands-on data-science practitioner if one does not have a good foundation in statistics.
Statistics is not really in focus in most CS programs. Indeed I'm not sure where it is. Perhaps in applied math programs. Stats is kinda too dirty and realworld for pure math types, and in science programs and medicine it's usually just taught as a bunch of magic formulas to memorize and rules of thumb passed down from generation to generation without understanding. Perhaps physics departments do have both the necessary math skills and the need for stats so they may provide a good education in this.
But having studied in 3 universities in different countries, CS just doesn't care about stats. Probability theory yes, but frequentist topics like statistical tests not really.
This was my experience at a high ranking engineering public state school in the US. Stats is delegated to the applied math program usually (in fact, my degree was titled "Applied Mathematics & Statistics"). You can choose to concentrate in subjects like algorithms, operations research, fin stuff, statistics, etc. CS as well as other sciences had one required intro to prob & stats, but that's it outside electives.
Further, despite having a fantastic reputation, my program only discussed frequentist ideas with near 0 mention of Bayesian reasoning/methods (outside the same Bayes rule questions asked in the first weeks of every stats class). Overall the education felt too traditional, I would have liked to seen mention of more modern methods like the bootstrap and certainly mention of Bayesian.
> Stats is delegated to the applied math program usually
Which is not unsreasonable considering that these things arent really related to computers (although they happen to involve computation). I think its just an artifact of history and how things happened that ML is associated with CS and EE departments, but really its applied math, not a core CS topic like say compilers, formal languages and complexity.
Data Science jobs cover a wide gamut from theoretical math to applied machine learning to statistics to data engineering to analytics. You can generally tell which one they're aiming for from the job description and requirements. Some are basically looking for a Statistician while other looks for a CS Machine Learning Engineer. If the title is ML Engineer than that generally indicates it's a lot more focused on what you studied than stats.
Not really sure why your comment is marked dead when it is correct. Maybe it's a new account thing? As you say, the Gauss-Markov conditions are necessary (and sufficient) to make OLS BLUE, but OLS still works fine under a variety of pathological conditions, and many of those conditions can be tested for and adjusted for using common techniques.
I only read the first question in theory.md and think the answer is quite weak.
> What is regression? Which models can you use to solve a regression problem?
The current answer only list some names that have "regression" in it, and the description of what a regression is doesn't say anything that distinguishes it from classification.
It fails to mention that regression (in ML terminology) is prediction of a continuous variable. And that almost any method can be used to do regression: knn, neural network, random forest, svm.
If other answers are of similar nature you might fail the interview.
One of my favorite data science factoids is how "regression" (to return to a former state) came to mean "prediction of a continuous variable".
In the late 1800s, Sir Francis Galton noticed that extremely tall or short parents usually had children that were not as tall or short as themselves, i.e. the children's heights were regressing (returning) to the mean. He collected hundreds of data points, graphed them, and estimated a coefficient describing this relationship, thereby inventing "linear regression."
We call them "regression" models simply because the first linear regression model was created to demonstrate the concept of regression to the mean.
The error here is ML terminology; the article answers correctly, if shallowly, the statistical definition of regression.
Classical regression techniques can be (and are correctly) used on binary, ordinal, or categorical dependent variables. I know we teach people doing ML that the two forms of supervised ML are classification and regression, but that does a disservice mostly in order to make visual examples in teaching easier by making every topic a binary classification question and then saying "oh yeah, this works for regression too".
Granted in an interview you'd probably want to use context in case the hiring people were trained on a specific vocab, but that maybe speaks more to the folly of using these dial-an-answer systems in place of actually learning the material.
Sure, I’m not saying that you have to correct it. Just that the project doesn’t assume that they have correct answers for everything and expect some participation of “the community” at large.
Sorry, I should have put it in my original reply. If you search for where it asks for an implementation of the standard deviation, you'll see that it asks for an implementation including Bessel's correction (using N-1 instead of N for the normalization, used to correct for the one degree of freedom lost by the calculation of the mean). However, when you look at the answer, provided, you'll notice that it divides by N, and does not provide 'NaN' for the condition N=1, as was asked.
Do people realize that, these interview question collection do not help? I think there is 2 things to address here:
- Interviewers will know "what is known" by every candidate (with the help of these pages) and harder questions will be asked
- If these questions are asked at >junior levels, then RUN! the work will not satisfy you. The interview should be fun, and show the creativity of the candidate. These ones could be answered by anyone who read it a few times. I would not like to work with somebody who only know the answers to these questions and not more
I think it's a bit far fetched to assume that recruiters tailor interview questions based on these Github repo's. For me as a junior data scientist it really useful to test my own knowledge and highlight areas which I need to study more.
Lots of tech and finance companies (particularly those with standardized interview processes) will blacklist questions if they're found online. Those companies will constantly check GitHub, GeeksForGeeks and Leetcode to see if their questions are listed there with solutions.
This probably won't be the case for a question as basic as, "what is regression?" But for any intermediate to advanced interview question involving regression, I would expect companies to jealously guard it.
If you're earnestly interested in building and testing your knowledge, I would recommend you read The Elements of Statistical Learning and Data Analysis Using Regression and Multilevel/Hierarchical Models. Also a good upper undergrad textbook in probability, like A First Course in Probability.
A couple recommendations piggybacking off of yours:
A First Course in Probability has a lot of problems (with solutions) and worked examples, but it’s light on intuition and pedagogy. It’s not an easy book to learn from, on its own. I highly recommend listening to Joe Blitzstein’s STAT 110 lectures and reviewing the wealth of problems/notes. The greater mastery of probability theory that you have, the easier studying ML and stats is. https://projects.iq.harvard.edu/stat110/home
Elements of Statistical Learning is a true textbook—a comprehensive bible that could occupy you for many thousands of hours. ISLR is the better book for a crash course: http://faculty.marshall.usc.edu/gareth-james/ISL/
Also, Regression and Other Stories is the new edition of the Regression with Multilevel models book, and it's much, much better (especially for n00bs).
A bit off topic but how much of data science work requires this probability/statistics knowledge on the job? I've heard you basically need a PhD to do modelling and "real" data science
From time to time I'm a hiring manager. I absolutely watch for these kinds of things, be it codegolf type challenges or large banks of interview questions. Depending on the role this can be helpful or hurtful to the expectations I would have on a candidate.
If the goals is to have a skill hire, typically someone who can maintain an existing, well-documented system, then having them know trivial details and banked information can be quite helpful. On the other hand, talent hires I would take in a different direction. If the candidate's only stand out quality is a clear memorization of banked answers I would wonder whether they could work from fundamentals.
Having somewhat standard "objective" questions helps even with senior candidates. You'd be surprised how many senior's would struggle with the basics, or who aren't as senior as their resume would lead you to believe.
Very interesting. I wonder in what type of interviews would these answers be considered ideal? Do people at OpenAI ask these sorts of questions, or is this more targeted toward other industries that require data scientists but are not populated by "stats" experts?
For example, consider "What is regression?".
The answer given is one way to go, but if the job posting would involve causal analysis or more statistical know-how, for example, it would probably be insufficient. I would want the candidate to speak about linear projections, about sampling assumptions, about probability models underlying the process and so on.
On the other hand, I could imagine that it would not be a good strategy to start to lecture about minute details of regression analysis when applying for a standard data scientist position when sitting on front of "applied data scientists" or even HR folks.
My experience is that interviews like these are for supporting role data science jobs. E.g. company x has a product (tech or not), they have some data, and they want to hire someone to make that data useful in improving their product or selling more of it.
The general data science process is that when faced with a problem, you 1) select the appropriate algorithmic tool(s) for the problem and 2) apply the tool(s) to the data. One of the challenges in data science generally is that the tools can get pretty fucking complicated.
The point of theory interview questions in general is to assess the first point, whether or not a candidate has the capacity to pick the right tool for a given problem. They want to hear that you understand some of the standard tools and what sorts of problems they are good for. If you "get" the common tools, you'll likely be able to reason about the application of new/different tools for weird problems as you face them, or so the logic goes.
Everything I just said was more or less objective. This is my opinion: bad companies that do not know how to hire data scientists usually do what you're describing. They ask theory questions to assess whether or not the candidate already understands the specific tools they expect them to use. Good companies tend to pay more attention to whether the candidate is capable of understanding the universe of tools in general and are less worried about their specific application area.
I should note that this is less relevant when hiring consultants or "plug and play" senior people. Of course for those roles you want to know that the candidate has done something similar already and is primed for success.
I have to agree that these interview questions function more as a cheatsheet review than actually anything practical that would be seen in an interview. Data science interviews don't function as a biology test where you're just rattling off memorizations to how neural networks or linear models work.
Ultimately these types of questions like "What is feature selection" are more likely to be encapsulated into case studies where the answer to the question itself will be, using feature selection.
For example: "Let's say you have thousands of categorical features for an anonymized dataset involving human traits, how would you figure out which predictors are the most important?"
>I have to agree that these interview questions function more as a cheatsheet review than actually anything practical that would be seen in an interview. Data science interviews don't function as a biology test where you're just rattling off memorizations to how neural networks or linear models work.
Me and my co-workers have been asked exactly that by top companies although it was more for machine learning engineer/applied scientist positions. Many textbook questions asked one after the other. It's not the only interview type they did but it definitely mattered and was often the first filter. So if you didn't answer well enough then you were out.
These are at best machine learning questions, not data science. ML is definitely a sub-category of data science, but if you see a job posting for data science, you're 100% not going to be doing machine learning. That would have been labeled as a machine learning job posting.
For Machine Learning theory related to Data Science I'd highly recommend "The Hundred-Page Machine Learning Book" by Andriy Burkov [1].
According to the author, by reading one hundred pages of the book (plus some extra bonus pages), you will be ready to build complex AI systems, pass an interview or start your own business.
It is a read first book and buy later if you think it is good enough.
I've seen multiple companies ask candidates to write working code for machine learning end-to-end from scratch. As in, write a stochastic gradient descent logistic regression model with training, inference, etc. without any libraries beyond pandas/numpy If you're lucky they'll provide you the equations or let you google them. So something to memorize including the various numpy/pandas gotchas.
I wish. I saw it asked by otherwise great teams run by very competent people. It's done in place of a leetcode algo question and those are even more divorced from day to day work. System design questions make more sense but I've noticed that you score better on them if you study the area beforehand but convincingly lie that you barely know it (and are just that clever and fast on your feet). Interviews are a shit show in general.
Years back someone asked me an interview question to explain sd and confidence intervals. I just gave the wrong answers and left. Especially in nascent fields such as ML and DS, it’s very important to work with people who actually know what they are doing. Otherwise you will have wasted years of your precious twenties doing absolutely nothing productive.
That response applies to 80% of modern interview questions at tech companies. So if you want a job there you smile, keep quiet and answer the problem. Since the job at large companies usually involves putting up with BS it's probably not a bad filter either for candidates.