You're right. I had a look at the methods sections of three of the four
studies cited in the article. It seems how they collect data is they have
people fill in questionnaires, then calculate personality scores based on
those questionnaires.
For example, in the Mac Giolla and Kajonius study [1] the data came from an
online questionnaire.
I'm finding it very hard to criticise the methodology of this sort of study
without [edited out] being extremely rude and dismissive to a whole
field of research. But, I really don't understand how all this is supposed to
make sense. I look at the plot in the middle of the Scientific American
article that is marked "Agreeableness" on the x-axis (with a score from 1 to
5). I look at the papers cited. There's plenty of maths there, but what is
anything calculating? What is "agreeableness" and why is it measured on a
scale of 1 to 5? It seems to be a term that has a specific meaning in
psychology and sociology, and that ultimately translates to "a score of n in
this standardised questionnaire".
But, if you can just define whatever quantities you like and give them a
commonly used word for a name- then what does anything mean anymore? You can
just define anything you like as anything you like and measure it anyway you
like- and claim anything you want at all.
At the end of the day, is all this measuring anything other than trends in
filling up questionnaires? Can we really draw any other conclusions about the
differences of men and women than that the samples of the studies filled in
their questionnaires in different ways? Is even that a safe conclusion? If you
take statistics on a random process you can always model it- but you'll be
modelling noise. How is this possibility excluded here?
All the statistics are quantifying the answers that respondents gave to
questionnaires and how the researchers rated them - without any attempt to
blind or randomise anything, to protect from participant or researcher bias,
as far as I can tell (I'm searching for the word "blind" in the papers and
finding nothing). In the study I link below, participants found the website by
internet searches and word-of-mouth. The study itself points out that this is
a self-selecting sample, but what have they done to exclude the possibility of
adversarial participation (people purposefully filling in a form in a certain
way to confuse results)?
There is so much that is extremely precarious about the findings of those
studies and yet the Scientific American article jumps directly to d values.
Yes, but d-values on what? What is being measured? What do the numbers stand
for?
This is just heartbreaking to see that such a contentious issue is treated
with such frivolity. If it's not possible to lay to rest such hot button
issues with solid scientific work- then don't do it. It will just make matters
worse.
There are whole fields of research devoted to the questions you're raising. As such, it's hard to reply with anything that would do justice to them. This isn't to say your questions aren't important, just that your lack of answers reflects your ignorance more so than that of the researchers. I say this not antagonistically but to suggest that it's important to understand that what you see is not always all there is to say.
It is true that these are self-report questionnaires, but as such they are small samples of behaviors of the people in question. Samples of how they perceive themselves, how they think about life, how they think about others, and what they value.
The Big Five, and the measures used in studies such as this, has been validated (in the sense that the ratings have been associated concurrently and predictively) over decades in many ways, with regard to daily reports of behavior, emotion, and life events, diagnoses, work ratings, performance on tests, ratings by peers and colleagues, ratings by strangers, just about everything you can imagine. These self-reports aren't perfect, but they do provide a fuzzy snapshot of someone at a given moment in time. Yes, it would be better to obtain all sorts of other measures of behavior, but they would be too expensive to obtain on large enough samples to be representative.
A major paradox in understanding human behavioral differences is that the more specific and "real world" you get, the less and less they generalize. That is, you can get a very concrete measure of a real-world behavior, but it ceases to be representative of that person across a large number of contexts and situations. Say you want to measure theft, for example. Do you set up a honeypot? Is that representative of that person? Do you use police reports or records? Is that representative? It turns out self-report on online questionnares is a very good measure of things like this because people are less self-conscious, and report things that don't go on the official record.
Faking is also controversial in this area. You're right to bring it up as an issue, but to understand the research on it it's important to think about why someone would fake. That is, what's the motivation for large proportions of people to systematically fake in one direction? And if they do do go to the trouble of doing that, what's "real" and what's "fake"? That is, let's say people make themselves look more dominant than they really are -- what does it mean if one person does that and another does not? It turns out that the person who wants to make themselves look more dominant often is more dominant, all other things equal, because it means they value that.
Also, strangely enough, it turns out that people who are callous and aggressive don't really care about that, especially on online questionnaires, because they are callous and aggressive.
This has all been very thoroughly researched and it turns out to be much more complicated than it seems at first glance. It doesn't mean things can't be better, but it does mean that over very large samples of persons answering questions on a low-stakes questionnaire (in the sense there aren't real consequences to them answering one way or another), a lot of these things average out. It's not the end of the story, but it's not something to be dismissed either.
In the end, questions of sex differences in behavior are about sex differences in behavior. And that's what this research addresses.
If there are all these ways to verify that the answers to questionnaires are accurate you'd think those ways would have been used instead of questionnaires as proof in this highly controversial and inflammatory subject. Extraordinary claims require extraordinary evidence, don't they?
It's getting past my bed time and your comment deserves a more thorough
answer that I'll try to write tomorrow, but for the time being this is what
strikes me the most about your reply:
>> Also, strangely enough, it turns out that people who are callous and
aggressive don't really care about that, especially on online questionnaires,
because they are callous and aggressive.
How do you know that someone who looks callous and aggressive on online
questionnaires is actually callous and aggressive? The obvious answer seems to
be that you know because you've given them another questionnaire separately.
Is that the case?
I'm not trying to catch you out, so I'll spell it out: if that is the case
then I don't see how you can ever know that someone is callous and aggressive
in any objective sense of the way. Like I say in another comment, that would
be "questionnaires all the way down". This is a really strong signal that I
get from discussions like this and it makes me very suspicious of assurances
that it's all been studied and it's all based on solid evidence.
I mean, I'm sorry, I don't want to sound like a square but "how [people]
perceive themselves" is exactly the opposite of what I'd think of as an
objective measure of how they really are. For example- I perceive myself as
pretty (I like myself, that is) but I am not always perceived as pretty by
others. What value is there in asking me how pretty I am?
Edit: I get that some of your comment addresses this. But it still seems to me like the solution is to try to double-guess the participant. That also doesn't sound like it should make for objective observations.
> How do you know that someone who looks callous and aggressive on online questionnaires is actually callous and aggressive?
You don't, but it's also not necessary. It's impossible to objectively assess someone's subjective experience, the best we can do is look at groups of people and attempt to find reliable indicators.
The point is that some people will over-emphasize any given trait, and others will under-emphasize it, so on average it evens out.
Think of color perception for a similar conundrum. How can you be sure that the red you see is the same as everybody else is seeing?
>> How can you be sure that the red you see is the same as everybody else is seeing?
I can't, but my understanding is that if we all agree to call a certain frequency of visible light "red", the frequency won't change because some people perceive it in a different way than others. Neither will measuring the frequency depend on how people perceive it.
That seems to me to be a more consistent definition of "red" than the definitions of personality traits that are discussed here.
>> There are whole fields of research devoted to the questions you're raising. As
such, it's hard to reply with anything that would do justice to them. This
isn't to say your questions aren't important, just that your lack of answers
reflects your ignorance more so than that of the researchers. I say this not
antagonistically but to suggest that it's important to understand that what
you see is not always all there is to say.
Another comment brought up the term "construct validity" and it seems to match
my concerns exactly. I am glad there is debate on that.
I study for a PhD in AI and I have similar concerns about research in my
field. For instance, in AI, research often claims to have modelled human
abilities such as "reasoning", "emotion" or "intuition". I'm personally
uncomfortable even with well-established terms like "learning" (as in "machine
learning") and "vision" (as in "machine vision")- because we don't really know
what it means to "see" or to "learn" in human terms so we shouldn't be hasty
to apply that terminology to machines.
This tendency has been criticised from the early days of the field but we seem
to have regressed in recent years, with the success of machine learning for
object classification in images and speech processing taking the field by
storm and leaving no room for careful study anymore, it seems. But that's a
conversation for another thread.
In AI, I'm worried that calling what algorithms do "attention" or "learning to
learn" etc, gives a false impression to people outside the field about the
progress of the field, and, in the end, about what we know and what we don't
know. This is certainly not advancing the science.
I think the same about psychology and studies like the ones we're discussing
here. If psychologists are happy measuring the correlations of the answers in
their questionnaires, and they call the quantities measured in this way with
names like "agreeableness" and "sensitivity"- doesn't that just give the
entirely wrong impression to people outside the field who have a very
different concept of what "agreeableness" etc means?
I say that this is "not advancing the science". You could argue that the
science is doing fine, thank you, even if lay people don't get it. But, if the
way the science is carried out creates confusion and influences real behaviour
and decisions, as studies like the ones discussed above have the potential to
do- is that really a beneficial outcome of research?
To put it plainly: as a researcher I don't aspire to create confusion, but to
bring clarity in subjects that are hard to understand. Isn't that the whole
point?
>> In the end, questions of sex differences in behavior are about sex differences
in behavior. And that's what this research addresses.
I understand this. But, my concern here is that asking people "what do you
think about sex differences in behaviour" is likely to return results tained
by ungodly amounts of cultural bias that would be impossible to disentangle
from any other results. How is this addressed in such studies? How do you
account for people answering questions about sex differences in behaviour
based on what they are used to think about sex differences in behaviour,
rather than what they actually observe?
P.S. Hey, your answer does do justice to my questions. Thanks for your
patience, again.
Haven't seen this in the comments yet, so I'll offer some extra information on the Big Five approach.
Essentially, it is a linguistic approach to personality. The original 5 categories were found by asking people if they would describe themselves with a certain adjective. They did this for hundreds of adjectives. After doing a Factor Analysis, surprisingly, the found 5 independent groups of adjectives. In each group, the adjectives correlate with each other. So for instance, someone you would describe as assertive you would also be likely to describe as proactive. Both of these traits happen to correlate with describing someone as extroverted. The particular group is then given the Big Five name. So technically speaking, extroversion is a whole class of attributes you would be likely to describe someone as.
To me, the amazing thing about the Big Five is that we can reliably extract information about types of people using ordinary language.
To be honest though I don't find it surprising that factor analysis would find high correlations between some traits. Actually, this is really concerning if that's the basis of the whole thing. You can find correlations anytime you look for them. How were these correlations validated? I mean- how do we know that the big-5 don't just model noise in the analysed datasets?
I don't know the details of the approach (though someday I would love to study them) but I gather that the correlations are very strong and that it has been replicated many times (including in different languages). Though what is key is that the correlations are weak between groups of traits. You can find large question banks they've used too. I think that even using disjoint sets of attributes/questions you will still get the same groupings.
EDIT: To add just a bit more, of all the replication scandals in psychology today, the Big Five is one of the few frameworks that has withheld the storm.
For the purpose of the “big 5” personality studies, things like “agreeableness” or “openness to experience” are more or less marketing terms that allow professionals to discuss a real and complex topic in a shared language. The “big 5 personality theory” is well established based on peer reviewed and repeatable experiment. The foundation of these studies is real, not fuddy dutty irreparable nonsense (the big 5 comes from the newer corner of psych that doesn’t have a reproducibility crises and has solid scientific grounding and rational for their experimentation and data analysis). I think you should do a bit more research into how the context of the paper instead of doing a drive by analysis based on what I would consider to be an ill informed basis fo analysis of the results.
If you have some time most of your questions are answered in this set of lectures about the subject from someone in the field https://www.youtube.com/watch?v=pCceO_D4AlY&list=PL22J3VaeAB.... If you don't a lot of time then the 10-20 minutes or so from where I linked are also useful and answer a lot of the points you brought up.
One way of thinking about this study is in terms of semantic relations. Ask yourself the following questions, they might help with understanding the value of a study like this:
Will different people answer these personality questions differently?
Do some people have similar personality traits?
Can specific viewpoints be predicted, to some rational degree of error, based on how they answer these questions?
Assuming the people answering do not get any feedback from answering in a specific way (i.e. the questionnaire is a black box), and correlations can be found in the data, and predicted viewpoints can be tested against the answers, then the scientific method here is sound. From an engineering point of view, it makes sense that the results of such a questionnaire can be useful in understanding the way that people think. Of course some people will be fuzzy or erratic and not conform to a regression, but we see that all the time in every practical scientific branch (excluding most maths, in my experience).
If you don't like using the words like "agreeableness", "conscientiousness" etc then just replace with "Trait A", "Trait B" etc.
In either case you will detect consistent differences between the sexes for the different traits, and the predictive power of the results remains unchanged.
Quote from the article https://www.frontiersin.org/articles/10.3389/fpsyg.2011.0017...: "Agreeableness comprises traits relating to altruism, such as empathy and kindness. Agreeableness involves the tendency toward cooperation, maintenance of social harmony, and consideration of the concerns of others (as opposed to exploitation or victimization of others). Women consistently score higher than men on Agreeableness and related measures, such as tender-mindedness (Feingold, 1994; Costa et al., 2001)."
> But, if you can just define whatever quantities you like and give them a commonly used word for a name- then what does anything mean anymore? You can just define anything you like as anything you like and measure it anyway you like- and claim anything you want at all.
> At the end of the day, is all this measuring anything other than trends in filling up questionnaires? Can we really draw any other conclusions about the differences of men and women than that the samples of the studies filled in their questionnaires in different ways? Is even that a safe conclusion? If you take statistics on a random process you can always model it- but you'll be modelling noise. How is this possibility excluded here?
Before you learned the whole thing, don't assume the field is as superficial as you imagined.
> All the statistics are quantifying the answers that respondents gave to questionnaires and how the researchers rated them - without any attempt to blind or randomise anything, to protect from participant or researcher bias, as far as I can tell (I'm searching for the word "blind" in the papers and finding nothing).
Your searching for the word "blind" means you don't know anything about psychology research. We don't use this word in our research. In psychology studies, we care about "reliability" and "validity" and we have extensive methods to test those.
> In the study I link below, participants found the website by internet searches and word-of-mouth. The study itself points out that this is a self-selecting sample, but what have they done to exclude the possibility of adversarial participation (people purposefully filling in a form in a certain way to confuse results)?
Maybe there are a few people try to do that. But with a sample size of 130,602, these response wouldn't impact the research findings at all, unless it is an organized effort trying to influence the research.
> There is so much that is extremely precarious about the findings of those studies and yet the Scientific American article jumps directly to d values. Yes, but d-values on what? What is being measured? What do the numbers stand for?
> This is just heartbreaking to see that such a contentious issue is treated with such frivolity. If it's not possible to lay to rest such hot button issues with solid scientific work- then don't do it. It will just make matters worse.
Maybe the ScientificAmerican article is not flawless, but I think you need to calm down a little bit.
> I'm finding it very hard to criticise the methodology of this sort of study without channeling Feynman and being extremely rude and dismissive to a whole field of research.
Can you elaborate on the Feynman thing? I haven't heard anything about that before.
Probably refers to the "Uncle Sam Doesn't Need You!" chapter in "Surely You're Joking, Mr. Feynman", in which Feynman is given a psychiatric evaluation as part of the medical exam for the draft. Anecdotally, he came away with an even lower opinion of psychiatry (rightly or wrongly) than he had when he went in.
I seem to recall Feynman having a similar attitude towards philosophy.
It's really sad and disapopinting to see that sort of close-minded attitude come from such a talented person towards entire fields he knows nearly nothing about.
That's a matter of opinion, one which I personally disagree with, though I am not a fan of some of the philosophy and psychology of the 20th Century.
Furthermore, philosophy extends back thousands of years and spans across many societies and cultures. To dismiss all of it in a handwavy way from a position of ignorance, as Feynman did, speaks of nothing but narrow-minded bigotry.
I think the social sciences are useful and in fact indispensible. I disagree with their methodology and with the practice of copying methods from other fields that are really not suitable to the subject matter of the social sciences.
For instance, if you define a set of answers to a questionnaire as "agreeableness" and assign it a score, you can do maths with it, just like physics can define the measurement on a thermometer as "temperature" and do maths with that. But there are no thermometers in the social sciences and the maths seem to only be measuring the researchers' intuitions (and of course, their cultural biases). [Edit: don't ask me what a thermometer is actually measuring- but I know that if a thermometer shows the water in the kettle is 100°C then the water is boling. If my agreeableness is 1, what does that do? Does it have a consistent effect? Can I measure the effect? With what? Another questionnaire? So it's questionnaires all the way down? Well, I know for sure that whatever thermometers are measuring- it's not thermometers all the way down.]
It would be a lot more informative to hear what the researchers think, their intuitions and conclusions from their careful observations of human behaviour _without_ any attemt to quantify the unquantifiable. We would learn a lot more about the human mind by listening to the _opinions_ of people who have spent their life studying it if it wasn't for all the maths that (to me anyway) are measuring meade-up quantities. If nothing else, there would be more space left in their papers to explain their intuition.
Obviously, there is an entire literature that measures what it does in terms of life outcomes other than questionnaires.
If you'd like to learn something very elementary about a field you're completely unfamiliar with, you might be better off picking up a textbook, rather than borderline trolling of the "this entire field is nonsense, prove me wrong" variety.
It means you are more likely to answer other questions in a certain way. Other studies might even show that you might be likely to behave in a certain way.
> Does it have a consistent effect?
Yes, but like thermometers only work on groups of molecules, the questionnaires are consistent on groups of humans.
> Can I measure the effect? With what? Another questionnaire? So it's questionnaires all the way down?
No, you could easily do a follow up study by finding groups of people that answered the questionnaires in a certain way, and then have them participate in behavioral experiments.
> Well, I know for sure that whatever thermometers are measuring- it's not thermometers all the way down.
The manner in which molecules bump into eachother randomly. The harder they do it the more space they take up. The more people in your group with an agreability score of 1, the more space they might take up ;)
Opinions are not science, maths is. We don't improve the social sciences with more vague intuitions. If you want to read intuitions then maybe read a glossy instead. Science is about making quantifiable statements, and maths is the way you turn samples into those. We certainly don't need any more room for so-called experts to tell us how we should behave in their papers. We tried that, and it was awful.
>> Opinions are not science, maths is. We don't improve the social sciences with more vague intuitions. If you want to read intuitions then maybe read a glossy instead. Science is about making quantifiable statements, and maths is the way you turn samples into those. We certainly don't need any more room for so-called experts to tell us how we should behave in their papers. We tried that, and it was awful.
That's a very well structured passage, thanks for the comment.
However, what I see is that psychology is trying very hard to make quantifiable statements about things that it can't do quantifiable statements about. Yes, maths can be used to turn observations into quantifiable statements. But just because someone is using maths, it doesn't mean they're turning observations into quantifiable statments. You can use maths to quantify non-existent quantities that you have never observed and the maths themselves won't stop you. I will mention Daryl Bem and his measurements of ESP now, but I don't mean that psychology is like parapsychology, only that you can misuse maths if you're not very, very careful. And just because you have maths it doesn't mean you're being careful.
And I think intuitions and personal expertise with a subject are the basis of scientific knowledge. The maths are there as a common language to communicate the intutions gained in a manner that makes them accessible to others who do not have the same expertise. Maths is the language of science, because it's used to communicate scientific knowledge, not because it's a set of magickal formulae that transform eveything to solid science.
you can misuse maths if you're not very, very careful
That's what nobody ever addresses in these studies.
Even assuming all this linguistic questionnaire stuff passes for a measure of something (certainly not biology, it really falls apart on indigenous populations), the further mathematics gives the joke away. Factor analysis is done just wrong. Questionnaires are mostly positively correlated and no thought is spared to how Frobenius-Perron theorem produces spurious factors, that also are dimensionally invalid to boot (which, one imagines, is not unwelcome, as scaling the data may give a stronger result). Then the methodology manages to fail confirmatory factor analysis on its own terms anyways. https://sci-hub.tw/10.1007/s11336-006-1447-6 Clustering validation is not even attempted beyond trying different number of clusters (anywhere from 4 to 13 results in fits only marginally worse than 5).
Denunciations of Big Five (and friends) go far and wide decades back. Then there's a flood of reassertions as if nothing happened, and again some refutations of that new wave. Ascent of data science made things comical. One year they do a metastudy with one million respondents, some dude asks some basic questions, the next year they do it with two million as if this answers anything. It is an endless war of attrition and not worth anyone's time.
I cite a large passage from the paper you linked to because it's an excellent example ofthe kind of "misuse of maths" I meant:
Consider, for instance, the personality literature, where people have
discovered that executing a PCA of large numbers of personality subtest scores,
and selecting components by the usual selection criteria, often returns five
principal components. What is the interpretation of these components? They are
“biologically based psychological tendencies,” and as such are endowed with
causal forces (McCrae et al., 2000, p. 173). This interpretation cannot be
justified solely on the basis of a PCA, if only because PCA is a formative
model and not a reflective one (Bollen& Lennox, 1991; Borsboom, Mellenbergh, &
Van Heerden, 2003). As such, it conceptualizes constructs as causally
determined by the observations, rather than the other way around (Edwards&
Bagozzi, 2000). In the case of PCA, the causal relation is moreover
rather uninteresting; principal component scores are “caused” by their
indicators in much the same way that sumscores are “caused” by item scores.
Clearly, there is no conceivable way in which the Big Five could cause subtest
scores on personality tests (or anything else, for that matter), unless they
were in fact not principal components, but belonged to a more interesting
species of theoretical entities; for instance, latent variables. Testing the
hypothesis that the personality traits in question are causal determinants of
personality test scores thus, at a minimum, requires the specification of a
reflective latent variable model (Edwards & Bagozzi, 2000). A good example
would be a Confirmatory Factor Analysis (CFA) model.
Now it turns out that, with respect to the Big Five, CFA gives Big Problems.
For instance,McCrae, Zonderman, Costa, Bond, & Paunonen (1996) found that a
five factor model is not supported by the data, even though the tests involved
in the analysis were specifically designed on the basis of the PCA solution.
What does one conclude from this? Well, obviously, because the Big Five exist,
but CFA cannot find them, CFA is wrong. “In actual analyses of personality
data [...] structures that are known to be reliable [from principal components
analyses] showed poor fits when evaluated by CFA techniques. We believe this
points to serious problems with CFA itself when used to examine personality
structure” (McCrae et al., 1996, p. 563).
If I'm not prying too much- what is your relation with the field?
Psychology talks about 'agreeable' in the same way as Physics talks about 'hot'. In our everyday context, 'hot' is quite vague, and people will have widely varying opinions about something being hot, depending on the context and on their personal experience.
To cope with that problem, science needs to detach the term from everyday use, and put an artificial definition in its place which allows to make repeatable statements. In its wake, the term loses a lot of its meaning. That is the price for preciseness.
Imagine there was a unit for agreeableness, so you could say 'John has an agreeableness of 4.4 Ag'. Then it would be clear that this statement refers to a formal definition (based on a standardized questionnaire), instead of our common vague understanding.
Now you could still argue that this new definition is so detached from what we usually mean with agreeableness that it becomes useless. However, you can't simply dismiss the method, you need to bring concrete arguments why the proposed definition does not capture what it is supposed to capture. For example, you could show that people with agreeableness below 2 Ag are married happily more often than people with agreeableness above 4 Ag. Do you have such concrete objections?
Maybe I'm missing something. What is it specifically about this psychology research that somehow makes it incompatible with making quantifiable statements? I'm not saying some random person is doing some random maths which would randomly make it science. I'm saying these particular persons in this particular study are performing scientific research and making quantifiable statements through maths that seem to be trivially and intuitively applicable. How is this different from say particle physics? If anything applying maths to particle physics is more dangerous because it's so easy to repeat the experiment until you've got the result you want.
You have something specifically against psychology research, I haven't seen a single argument from you that could not be applied against any other scientific field. They applied a methodology that you didn't initially understand, and now you're refusing to understand it because you're committed to arguing against it. Weird thing is it's not even a controversial finding, just a confirmation of something everyone knows to be true.
>> How is this different from say particle physics?
It is different in the sense that particle physics quantifies concepts that are not correlated to how someone feels about them, or how someone answers questions in a questionnaire.
And I don't see how I misunderstood the studies we're discussing. They handed people questionnaires asking them how they think or feel about things. I don't see how any concrete evidence about anything can be found in this way, other than how people fill questionnaires.
But the claim is never "X people fill this questionnaire in this way". It's always along the lines of "X people are more agreeable" etc. This is misleading.
We've traded replies here on HN a couple of times. We have a lot of interest areas in common — theorem proving, software synthesis, that kind of thing.
- If so, are the social sciences really sciences at all?
And one philosophical question:
- What happens if an entire field of "research" is dissolved as wholly subjective and not repeatable?
This would be much bigger than the debunking of phrenology or astrology as those don't have university departments, journals, or attempt to set social policy.
They are measuring the differences between male and female responses to questionnaires. I don't get why this is so hard to understand, or worse heartbreaking to you. Who cares what agreable means, it's just some concept that they've got correlated questions to.
It's just proper science, no one was hurt, why get emotional about it? Everybody has the intuition that there's general difference between the psychology in men and women. They developed a methodology that shows and quantifies that phenomenon. That's good and proper science.
That the content is not well doesn't dimish its value. When Volt measured electric charge but didn't get what was happening and even got the direction of charge wrong, that might have been heartbreaking, but it also was world changing that he definitely should not have just not done. Even if he made matters a little bit worse.
It is not useless. Besides the point that the whole idea of science is that all of it is or could be useful in some way. This research establishes a platform on which you could build more specific research on that might be helpful in determining whether something is a personality trait, a sexual trait, or some kind of disorder.
Or if you want to go even more specific, maybe research like this could be used to convince governments that deny transsexuality that it in fact is possible to scientifically show that a person aligns more with a different sex, which might give them access to subsidy or insurance for surgery. Some people might be helped just because the science acknowledges the reality of their feelings.
In any case, your anti intellectual disposition is shameful.
For example, in the Mac Giolla and Kajonius study [1] the data came from an online questionnaire.
I'm finding it very hard to criticise the methodology of this sort of study without [edited out] being extremely rude and dismissive to a whole field of research. But, I really don't understand how all this is supposed to make sense. I look at the plot in the middle of the Scientific American article that is marked "Agreeableness" on the x-axis (with a score from 1 to 5). I look at the papers cited. There's plenty of maths there, but what is anything calculating? What is "agreeableness" and why is it measured on a scale of 1 to 5? It seems to be a term that has a specific meaning in psychology and sociology, and that ultimately translates to "a score of n in this standardised questionnaire".
But, if you can just define whatever quantities you like and give them a commonly used word for a name- then what does anything mean anymore? You can just define anything you like as anything you like and measure it anyway you like- and claim anything you want at all.
At the end of the day, is all this measuring anything other than trends in filling up questionnaires? Can we really draw any other conclusions about the differences of men and women than that the samples of the studies filled in their questionnaires in different ways? Is even that a safe conclusion? If you take statistics on a random process you can always model it- but you'll be modelling noise. How is this possibility excluded here?
All the statistics are quantifying the answers that respondents gave to questionnaires and how the researchers rated them - without any attempt to blind or randomise anything, to protect from participant or researcher bias, as far as I can tell (I'm searching for the word "blind" in the papers and finding nothing). In the study I link below, participants found the website by internet searches and word-of-mouth. The study itself points out that this is a self-selecting sample, but what have they done to exclude the possibility of adversarial participation (people purposefully filling in a form in a certain way to confuse results)?
There is so much that is extremely precarious about the findings of those studies and yet the Scientific American article jumps directly to d values. Yes, but d-values on what? What is being measured? What do the numbers stand for?
This is just heartbreaking to see that such a contentious issue is treated with such frivolity. If it's not possible to lay to rest such hot button issues with solid scientific work- then don't do it. It will just make matters worse.
__________________
[1] https://onlinelibrary.wiley.com/doi/epdf/10.1002/ijop.12529?