> The accuracy of any of the selection methods we use in education is very poor....

thanatropism · on April 17, 2019

> The SAT when combined with the high school GPA (HSGPA) has an adjusted correlation correlation coefficient of 0.56 with first-year GPA, meaning the combined measurement accurately predicts how a potential college applicant will perform in their first year of college 56% of the time.

Thaaat's not what "correlation" means.

geebee · on April 17, 2019

This is a pretty good write-up:

https://blog.udemy.com/correlation-coefficient-interpretatio...

ordinaryperson · on April 17, 2019

I'm summarizing for a general audience. I could say, r is " the strength of the linear relationship between two variables on a graph" but I'm not sure that helps the average person understand the connection.

If you have a better description, it's more helpful to chime in with that instead of "You're wrong!"

kristjansson · on April 17, 2019

A better summary would be that those two quantities explain about half of the variation, not that they predict accurately half the time.

If you took a random sample of cases, half of them wouldn’t exhibit a direct relationship b/w SAT and first year GPA and half nothing (unless the data is _super_ weird). Instead, SAT would be instructive-ish in predicting first year GPA for all those cases.

ordinaryperson · on April 17, 2019

Explaining half the variation, and the other half?

The point was to draw a connection for the general audience, not present the most scientifically accurate description of a relationship between two variables -- that's what the links to the research are for.

kristjansson · on April 17, 2019

It's good to communicate for a general audience, but your presentation misleads rather than simplifies.

> meaning the combined measurement accurately predicts how a potential college applicant will perform in their first year of college 56% of the time.

"accurately predicts...56% of the time" implies that half of predictions are 'accurate', which most readers would interpret as 'correct' i.e. knowing SAT + HSGPA allows you to state FYGPA _exactly_ for about half of cases. That's not what the research you cited says. Rather, the square of the multiple correlation R (which is exactly R^2, the coefficient of determination) indicates how much of the variance in the output variable is explained by the input variables. That quantity _must_ be communicated in terms of the strength of the relationship, not accuracy for a given or share of cases as it doesn't tell us anything about a given case. One could say it tells us about 30% (0.56^2, correction from my statement above) of the information we'd need to know to perfectly predict the outcome, or that the relationship is better than random, but doesn't predict perfectly, or ...

Additionally, table 5 of the link you cited indicates the adjust correlation coefficient b/w FYGPA and the combination of HSGPA and SAT is 0.62. None of the numbers in that table are 0.56, so I'm not sure where you pulled that exact number from. I've used 0.56/56% above to be clear which quantity I'm referring to.

ordinaryperson · on April 17, 2019

Uh...this is exactly what I mean. Your description is 100% scientifically accurate but probably way beyond the average reader.

Again, if you can simplify this in a way more accurate than I have, great, be my guest-- I look forward to reading it.

R^2, coefficient of determination, output variable variance, etc etc -- most readers aren't going to go that deep in the math. For those who do, like you, the links to the actual research is provided.

But so far all I see are data scientists complaining about how my description is not 100% statistically accurate without providing any alternative explanation that doesn't devolve into variance of output variables.

Again, be my guest to show me I'm wrong, but what you wrote above is not something that would be easy to understand for the general audience, IMHO.

kristjansson · on April 17, 2019

The concerning (mis)interpretation of your statement is what I said on the third line above:

> "accurately predicts...56% of the time" implies that half of predictions are 'accurate', which most readers would interpret as 'correct' i.e. knowing SAT + HSGPA allows you to state FYGPA _exactly_ for about half of cases.

This interpretation is easy to arrive at, and clearly does not correspond to a reasonable understanding of the source, even for a general audience.

I provide two suggestions above:

> One could say it tells us about 30% of the information we'd need to know to perfectly predict the outcome, or that the relationship is better than random, but doesn't predict perfectly

ordinaryperson · on April 17, 2019

OK, but that only makes it more confusing. You say it tells us about 30% of the information we'd need to know...which makes it sound (to the lay person) like there's no connection because 70% of the information is elsewhere!

I appreciate your commitment to academic rigor but sometimes oversimplifying things, even at the cost of mathematical accuracy, is enough for a general audience who aren't going to compute variance of output variables.

kristjansson · on April 18, 2019

This isn't about academic or mathematical rigor - this is about responsible communication of statistics.

You're right that a general audience isn't going to look at the source, nor think about variance of output variables. Therefore, it's the responsibility of us as communicators of statistics to relate the conclusions that can be drawn from the data in a way that first and foremost is not wrong or misleading, and secondly captures the concept as accurately as possible for the audience.

The first principle is the overriding obligation. Your simplification can capture as little of the information and conclusions supported by the data as you want, but it cannot imply or state conclusions that are not supported.

You're getting this response to your statement, from me and others, because your interpretation of the source can easily be read as drastically overstating the character (and strength) of the relationship supported by the data - even if that's not what you intended.

ordinaryperson · on April 18, 2019

To you this is the major responsibility, but not to everyone.

I'm getting this response to my statement, from you and many other data scientists, who can't accept oversimplifying math, but all of whom have failed to produce a general description in plain English.

In some ways it's like the test-- everyone hates it but no one has a better alternative. You've complained maybe 7 or 8 times in this thread about how scientifically inaccurate my general summary is but have not produced a description that a regular person could understand in 5 words or less, with no technical jargon.

kristjansson · on April 19, 2019

‘FYGPA is ...

somewhat associated with

partially explained by the combination of

... HSGPA And SAT’

‘

ordinaryperson · on April 19, 2019

You think that's a good description?

"Something is partially explained by a combination of numbers"?

That says nothing semantic of value. Better to exaggerate the causal relationship and give a sense of meaning than offer meaningless generics like that, because at least the general reader intuits a sense of the universe of the relationship. The above implies nothing.

thanatropism · on April 18, 2019

No, man.

1) Your description is 0% correct; and 2) It's not a complaint. Maybe you don't know what a correlation coefficient is. That's okay -- I don't know what polyfinite rings are.

ordinaryperson · on April 18, 2019

So funny how all the data scientists complain about incorrectness and yet fail to produce a plain English description...

thanatropism · on April 18, 2019

The problem is not plain English, the problem is that you want to make claims that are more specific than your numbers allow you to.

"Correlation is a number in [-1,1] where -1 means predictably inversely related and 1 means predictably directly related". What is correlation 0.5? There isn't even an unique definition of "correlation", man.

The best thing you could do is to produce a table of correlation coefficients for comparison. That would communicate some qualitative nuance as to what a 0.5 correlation really means.

You can also try to do something like a bootstrap or kernel density or use a prior distribution to claim that the probability that increment dx leads to an increment of dy or more is... That's more technically involved, but satisfies your desire of saying something specific that correlation coefficients do not allow you to say.

ordinaryperson · on April 18, 2019

Kernel density, bootstrap, using prior distribution, increment the derivative...all this stuff is so far above and beyond the ordinary person.

You and a couple of others have provided plenty of context for the serious data scientist who wants to understand the exact nature of the relationship between these two variables, but 99% of readers do not understand much less want to read about "kernel density".

Again, my challenge remains open: instead of complaining about inaccuracy, try to render the relationship in plain English.

thanatropism · on April 18, 2019

There's no much that can be said. Your quantitative analysis is weak.

ordinaryperson · on April 18, 2019

Lol. I was giving a basic intro for the 99% of readers who don't care about "kernel density" or controlling for output variables -- the fact that you conflate such an attempt with a "quantitative analysis" (which it was not) coupled with your unwillingness or inability to give a plain English description just proves my point.

6gvONxR4sf7o · on April 17, 2019

That's not summarizing. "It's the strength of the relationship" is summarizing. "The combined measurement accurately predicts how a potential college applicant will perform in their first year of college 56% of the time" is just wrong. See Anscombe's quartet for a great example of why it's just plain wrong.

https://en.wikipedia.org/wiki/Anscombe%27s_quartet

ordinaryperson · on April 17, 2019

And your completely scientifically accurate but easy for the lay reader to understand description in a few simple words is...?

6gvONxR4sf7o · on April 17, 2019

Isn't that the example I used?

"It's the strength of the relationship"

I happen to like:

"It's how perfectly you can fit a straight line to them."

You can be mathematically accurate without being mathematically precise. Better imprecise but correct than incorrect but precise.

If you're trying to give a quantitative lay picture of what exactly 0.56 linear correlation means, you need to still be quantitatively right, while the above are quantitative. Pictures and examples can help. "For perspective, 0.56 is about the correlation between <example> and <example>"

ordinaryperson · on April 18, 2019

I'm sorry, I'm not following your description.

Saying there is a quantitative strength to a relationship is, to a regular person, meaningless. Am I .56 in love with my wife?

Can I fit in a straight line to her?

These are not good descriptions. Of course HN is full of data scientists who wildly object to oversimplifying statistical relationships -- luckily you are here to give the detailed mathematical context. But these are not simplified descriptions for a general audience.

dragonwriter · on April 18, 2019

> Saying there is a quantitative strength to a relationship is, to a regular person, meaningless.

That the average person would not understand a particular accurate description of a subject does not, in and of itself, make a completely inaccurate alternative description less wrong or even a good simplification.

ordinaryperson · on April 18, 2019

"less wrong" -- as if what's important is a gradient of "wrongness" and just helping the average person understand stuff without technical jargon.

asddasd · on April 17, 2019

>The SAT when combined with the high school GPA (HSGPA) has an adjusted correlation correlation coefficient of 0.56 with first-year GPA, meaning the combined measurement accurately predicts how a potential college applicant will perform in their first year of college 56% of the time.

This is a totally incorrect interpretation of what correlation is.

ordinaryperson · on April 17, 2019

Again, I'm summarizing for a general audience. If you have a better way to describe it that doesn't devolve into polynomials and linear relationships between variables on a graphs it's more helpful to do so than just say, "You're totally incorrect!"

Jesus_Jones · on April 17, 2019

But a coin flip on a large number of people would trend toward 50% predictions over time.

jsweojtj · on April 17, 2019

And a correlation of 0%

Jesus_Jones · on April 18, 2019

Right, but I'm trying to get at 56% isn't great cause random is 50, and there's no clarify on correlation of the measure that gets to 56.

chongli · on April 19, 2019

Correlation is not probability. You can't compare them at all. Flipping a coin for each student would produce a correlation of 0, far lower than the correlation of 0.56 cited above. Have a look at some plots of data [1] with different correlation coefficients to see how dramatic it can be. Note the difference between r = 0.00 and r = 0.60. That's about what we're dealing with here.

[1] http://www.bwgriffin.com/gsu/courses/edur7130/images/twelve_...

mont · on April 17, 2019

Which is worse than 56%

Jesus_Jones · on April 18, 2019

Yeah, but if randomness gets you 50, 56 doesn't feel that useful.