I have to agree with your point about EDA. The library is neat, but even the exa...

kushalkolar · 2025-03-11T19:28:02 1741721282

Hi, one of the other devs here. As the poster below pointed out what you're missing is that in this case we know that an eigendecomposition or PCA will be useful. However if you're working on matrix decomposition algorithms like us, or if you're trying to design new forms of summary matrices because a covariance matrix isn't informative for your type of data then these types of visualizations are useful. We broadly work on designing new forms of matrix decomposition algorithms so it's very useful to look at the matrices and then try to determine what types of decompositions we want to do.

sfpotter · 2025-03-11T19:49:11 1741722551

I've also worked on designing new matrix decompositions, and I've never found the need for anything but `imshow`...

kushalkolar · 2025-03-11T20:02:31 1741723351

ok, different libraries have different use cases, the type of data we work with absolutely necessitates dynamic visualization. You wouldn't view a video with imshow would you?

sfpotter · 2025-03-11T20:10:02 1741723802

Every time I've needed to scrub through something in time like that, dumping a ton of frames to disk using imshow has been good enough. Usually, the limiting factor is how quickly I can generate a single frame.

It's hard for me to imagine what you're doing that necessitates such fancy tools, but I'm definitely interested to learn! My failure of imagination is just that.

kushalkolar · 2025-03-11T20:19:01 1741724341

The example from the article with the subtitle "Large-scale calcium imaging dataset with corresponding behavior and down-stream analysis" is a good example. We have brain imaging video that is acquired simultaneously with behavioral video data. It is absolutely essential to view the raw video at 30-60Hz.

wtallis · 2025-03-11T18:43:10 1741718590

Aren't you missing the entire point of exploratory data analysis? Eigenfaces are an example of what you can come up with as the end product of your data exploration, after you've tried many ways of looking at the data and determined that eigenfaces are useful.

Your whole third paragraph seems to be criticizing the core purpose of exploratory data analysis as though one should always be able to skip directly to the next phase of having a standardized representation. When entering a new problem domain, somebody needs to actually look at the data in a somewhat raw form. Using the strengths of the human vision system to get a rough idea of what the typical data looks like and the frequency and character of outliers isn't dumping the job of exploratory data analysis onto the reader, it's how the job actually gets done in the first place.

kushalkolar · 2025-03-11T19:31:29 1741721489

> Using the strengths of the human vision system to get a rough idea of what the typical data looks like and the frequency and character of outliers isn't dumping the job of exploratory data analysis onto the reader, it's how the job actually gets done in the first place.

Yup this is a good summary of the intent, we also have to remember that the eigenfaces dataset is a very clean/toy data example. Real datasets never look this good, and just going straight to an eigendecomp or PCA isn't informative without first taking a look at things. Often you may want to do something other than an eigendecomp or PCA, get an idea of your data first and then think about what to do to it.

Edit: the point of that example was to show that visually we can judge what the covariance matrix is producing in the "image space". Sometimes a covariance matrix isn't even the right type of statistic to compute from your data and interactively looking at your data in different ways can help.

kkoncevicius · 2025-03-11T19:10:48 1741720248

As a whole, of course you have a point - big visualisations when done properly should help with data exploration. However, from my experience they rarely (but not never) do. I think it's specific to the type of data you work with and the visualisation you employ. Let me give an example.

Imagine we have some big data - like an OMIC dataset about chromatin modification differences between smokers and non-smokers. Genomes are large so one way to visualise might be to do a manhattan plot (mentioned here in another comment). Let's (hypothetically) say the pattern in the data is that chromatin in the vicinity of genes related to membrane functioning have more open chromatin marks in smokers compared to non smokers. A manhattan plot will not tell us that. And in order to be able to detect that in our visualisation we had to already know what we were looking for in the first place.

My point in this example is the following: in order to detect that we would have to know what to visualise first (i.e. visualise the genes related to membrane function separately from the rest). But then when we are looking for these kinds of associations - the visualisation becomes unnecessary. We can capture the comparison of interest with a single number (i.e. average difference between smokers vs non-smokers within this group of genes). And then we can test all kinds of associations by running a script with a for-loop in order to check all possible groups of genes we care about and return a number for each. It's much faster than visualisation. And then after this type of EDA is done, the picture would be produced as a result, displaying the effect and highlighting the insights.

I understand your point about visualisation being an indistinguishable part of EDA. But the example I provided above is much closer to my lived experience.

sfpotter · 2025-03-11T20:08:39 1741723719

Yeah, I agree with the general sentiment of what you're saying.

Re: wtallis, I think my original complaint about EDA per se is indeed off the mark.

Certainly creating a 20x20 grid of live-updating GPU plots and visualizations is a form of EDA, but it seems to suggest a complete lack of intuition about the problem you're solving. Like you're just going spelunking in a data set to see what you can find... and that's all you've got; no hypothesis, no nothing. I think if you're able to form even the meagerest of hypotheses, you should be able to eliminate most of these visualizations and focus on something much, much simpler.

I guess this tool purports to eliminate some of this, but there is also a degree of time-wasting involved in setting up all these visualizations. If you do more thinking up front, you can zero in on a smaller and more targeted subset of experiments. Simpler EDA tools may suffice. If you can prove your point with a single line or scatter plot (or number?), that's really the best case scenario.

macleginn · 2025-03-11T18:51:47 1741719107

Eigendecomposition of the covariance matrix, essentially PCA, is probably the first non-trivial step in the analysis of any dataset. The idea in the comment above seems to be that it's more useful to combine some basic knowledge of statistics with simpler visualisation techniques, rather than to quickly generate thousands of shallower plots. Being able to generate thousands of plot is useful, of course, but I would agree that promoting good data-analysis culture is more beneficial.

wtallis · 2025-03-11T18:55:43 1741719343

> Eigendecomposition of the covariance matrix, essentially PCA, is probably the first non-trivial step in the analysis of any dataset

For a sufficiently narrow definition of "dataset", perhaps. I don't think it's the obvious step one when you want to start understanding a time series dataset, for example. (Fourier transform would be a more likely step two, after step one of actually look at some of your data.)

mturmon · 2025-03-11T22:20:53 1741731653

I agree, but: the technique of “singular spectrum analysis” is pretty much PCA applied to a covariance matrix resulting from time-lagging the original time series. (https://en.wikipedia.org/wiki/Singular_spectrum_analysis)

So this is not unheard of for time series analysis.

kushalkolar · 2025-03-11T20:21:20 1741724480

Exactly that's a good example!