Practical advice for analysis of large, complex data sets (2016)

sgt101 · on May 2, 2021

There is a split in the comments below as I write this.

Some people are coming from the Kaggle / Text book/ Statistics perspective - the data set is ready for analysis so I think we should use analytical tool x or y to give us some information about the conclusions that we are likely to be trying to demonstrate from it - or the models that we will build.

Other people are looking from a Data Science pov. The data is made of bits and is likely to have pathological features due to it's history and the biases of the different collection and assembly methods. The task is to find these and mitigate them - at that point we have a decent data set and can move to analysis and modelling.

heresie-dabord · on May 2, 2021

For any small data set, the inefficiencies and Library Hell of $tooling don't matter... much.

For any non-trivial data set, the efficiency and internal consistency of $tooling is critical. Throwing stupendous hardware resources at a problem because the $tooling is unstable rubbish is the wrong approach.

For any non-trivial data set, the data must have a cohesive reason to exist. In other words... Big Data Garbage In, Big Data Garbage Out. Telemetry is often BDGI.

zmmmmm · on May 2, 2021

Interesting that PCA or any kind of dimensionality reduction does not rate a mention (at at least, not obviously ...).

If I had one piece of advice to give on the subject it would be PCA the crap out of everything and understand what the top components are doing. 9 times out of ten they will turn out to be very significant confounders and in the cases where they are not they will be confirming the structure of your data set in very useful and significant ways.

vatican_banker · on May 2, 2021

>If I had one piece of advice to give on the subject it would be PCA the crap out of everything and understand what the top components are doing

There are at least four issues with this advice (wrt doing data analysis):

1. How do you link your PCA components to the original data? Let's say you are tasked to find the main drivers of sales on a given city. You run PCA on the data and find two main components on the dataset. What do you do next? How do you make this information actionable?

2. How do you treat categorical variables? There are PCA methods for dealing with categorical variables but by the time you apply these methods plus the issues in 1) your data has lost all actionable meaning.

3. PCA is _very_ difficult to explain to business stakeholders. The more difficulty business stakeholders have to understand the analysis, the less they will use it.

4. Data-driven business stakeholders will favour clarity and simplicity over sophistication (somewhat linked to 3)

zmmmmm · on May 2, 2021

The goal isn't to present PCA directly to stakeholders. It is something you do at the start to understand your data. The premise is that it is extremely likely that there are significant batch effects or other statistical correlations in the data that you are probably unaware of at the outset. You should aim to discover these early on. To do this you need to use an unsupervised method because the whole point is you don't what they are.

> you are tasked to find the main drivers of sales on a given city. You run PCA on the data and find two main components on the dataset

Obviously it depends what comes out. But in all likelihood you will see some significant clusterings / divisions in PC1 and PC2, so you will try to interpret what properties of the points are driving those. You can do it in a data driven way (what are the significant coefficients in principal component vectors) or you can often do it in an exploratory way ... are they related to geography, are they related to age demographics, are they seasonal ... you color the data points by different possible explanatory variables to see what group things together. And you will very likely see things jump out (eg: you could find that the main reason a particular month was down in sales was due to a technical problem with the web site and you'll want to put that aside, because it doesn't have any predictive value).

efavdb · on May 2, 2021

Shameless plug: I previously wrote up a blog post on how to use an unsupervised feature selection analog of PCA to avoid many of the issues you point out here, and an associated python package to carry it out ("linselect", which you can pip install):

https://www.efavdb.com/unsupervised-feature-selection-in-pyt...

sgt101 · on May 2, 2021

I think that this is something to do when you have a "good" data set in hand - when you've filtered and checked and wrangled the data into something approximating a training table. Or if the data is set to you after that process or the collection process has produced good clean data (for example survey results).

Until then I don't think that PCA really tells you what you need to know - which is "is this data set what I think it is " and "is the information that I need to extract actually in here"?

Galanwe · on May 2, 2021

Agree with you on the general idea that dimension reduction is a very important tool. But from my experience performing a PCA first and then trying to make sense of the coefficients is a level 100 reverse engineering task.

The coefficients will most likely be numerous and noisy, making sense of them will be impossible. You need to have an hypothesis first of what the principal components might be, and only then compare your idea with the PCA coefficients to see if that works.

Der_Einzige · on May 2, 2021

Don't use PCA, use modern dimensionality reduction techniques like UMAP or Ivis.

zmmmmm · on May 3, 2021

Despite the usefulness of those other techniques, I still prefer to start with PCA. Its direct interpretability in terms of well understood statistical attributes of the data mean that for first line data interpretation it's better (in my mind) for the "QC" portion of analysis. The other methods are great to do after your PCA is looking sane.

Der_Einzige · on May 3, 2021

I don't think even starting with PCA is the right approach anymore. Ivis has better interpretability than PCA (e.g. built in features to watch it as it separates the data points out, feature importance explanations just like PCA, and it also supports inverse transforms (so does UMAP) just like PCA. The only thing that PCA has on top of UMAP or Ivis is estimated explained variance calculations which has almost no value.

There will be many problems where you can't easily get a "PCA that looks sane", but you can definitely get a UMAP/Ivis that looks sane. I guess I see where you're coming from in regards to like massive datasets and making sure you're not doing silly things (e.g. misreading an ordinal as continuous) - but outside of this I think PCA is antiquated.

The antiquation of PCA yet it's still ubiquitous usage is most likely one of the many factors holding back fields like bioinformatics dramatically. Please switch over to modern techniques. Your data will thank you!

zmmmmm · on May 3, 2021

Interesting - will admit I use UMAP but have not tried Ivis.

Thanks for the suggestion!

fighterpilot · on May 2, 2021

The two best titbits

(1) Slice your data, and check metrics in each slice

(2) Check for consistency over time, which is actually a special case of (1)

This is a type of as hoc bootstrapping. If your estimator is stable over various subgroups then it implies the estimator variance is low and you can be more confident in an observed effect.

sgt101 · on May 2, 2021

Or (in more pragmatic and plainer terms) it's a way of seeing that there isn't a huge gap or disjunction in the data due to some IT related event such as a data migration, someone introducing a bug into the feed that's been unnoticed or a deeper failure in the sensors that are providing the data.

js8 · on May 2, 2021

Train yourself on the data before you try to train computer on it?

plaidfuji · on May 2, 2021

This is such a great read not because there’s any individual piece of advice that’s novel or mind-blowing, but because it’s both concise and comprehensive... I’ve never seen anything approaching a true “SOP” for big data analysis, especially one that manages to stay as technology- and solution-agnostic as this one.. not one mention of the author’s favorite algorithm for X!

But man, if there was one section that makes this a crucial read, it’s this:

> Be both skeptic and champion

Nothing makes me tune out someone’s presentation/proposal faster than hearing only upside. Everything has trade-offs and no analysis has perfect information. If you can’t acknowledge this, I can’t trust you.

antman · on May 2, 2021

Slice your data and try to predict the slices with random forest, see what feature importance tells you. e.g. instead of looking at the distributions parameters of time streams to see if they changes label them to old data and new data and see if you can predict which is which. Or if you have missing data try to predict the rows with missing data to see if they are missing at random. Explore through small decision trees.

fighterpilot · on May 2, 2021

Once you have your decision tree that sort-of predicts the missing row of data, what then?

antman · on May 5, 2021

You see in what way it is related to the features, so if the missingness of a feature is correlated to some other feature you don't have Missing at Random (as in statistics) so you might have a biased sampling procedure that you can fix, that is very common.

anaselkhaloui · on May 2, 2021

Cool ! any advice for time series ? I've been working on those for a while (prediction, anomaly detection), and I find it hard to set good rules or best practices to follow.

sillysaurusx · on May 2, 2021

xarray has some wonderful docs for that. http://xarray.pydata.org/en/stable/

Well, mostly about time series in general, not anomaly detection specifically. For more general advice about best practices for this type of ML, you might like “rules of ML”: https://developers.google.com/machine-learning/guides/rules-...

fighterpilot · on May 2, 2021

What kind of time-series?