You don't need a larger mutex for fairness. For recursive locking, maybe. But most mutexes aren't recursive.
That's one of the big problems with pthread mutexes: they try to pack too much functionality into a single type. A pthread mutex can be recursive, so every mutex needs space to store a recursion count, even though most mutexes aren't recursive. A pthread mutex can have an arbitrary priority ceiling (pthread_mutexattr_setprioceiling), so every mutex needs space to store the priority ceiling, even though most mutexes don't set that attribute. A pthread mutex can be "robust" (pthread_mutexattr_setrobust), which at least under glibc means that every mutex needs previous/next pointers in it so it can be stored in a per-thread linked list, even though most mutexes are not robust. That's an excessive level of generality, when different kinds of mutexes could have just been different types!
The other big problem with pthread mutexes is just that they are old. I could be wrong, but my impression is that much of the interest in smaller mutexes has come about relatively recently, at least compared to the age of these APIs. It might be possible to shrink pthread_mutex_t to some extent despite the aforementioned issues, but it's impossible to do so on any existing operating system (at least on existing architectures) because changing the size of pthread_mutex_t would break ABI.
Another point about glibc implementation of pthread_mutex_t (and IIRC even the glibc internal lll_t that implements simple mutex as few lines of assembly involving futex(2)) is that on 32/64b platforms the layout of the structure is compatible between 32b and 64b ABIs. The reason for that is that pthread_mutex can have pshared attribute and also it is used in implementation of higher-level posix-IPC primitives (sane implementation of all of which involves just placing the particular synchronization struct into shared memory and not somehow translating that into SysV-IPC).
You'd need extra space to store the fact that the mutex is fair rather than unfair. (I agree with you that the API tries to cram too much into one interface, but that's really on POSIX rather than glibc.)
That's one of the big problems with pthread mutexes: they try to pack too much functionality into a single type. A pthread mutex can be recursive, so every mutex needs space to store a recursion count, even though most mutexes aren't recursive. A pthread mutex can have an arbitrary priority ceiling (pthread_mutexattr_setprioceiling), so every mutex needs space to store the priority ceiling, even though most mutexes don't set that attribute. A pthread mutex can be "robust" (pthread_mutexattr_setrobust), which at least under glibc means that every mutex needs previous/next pointers in it so it can be stored in a per-thread linked list, even though most mutexes are not robust. That's an excessive level of generality, when different kinds of mutexes could have just been different types!
The other big problem with pthread mutexes is just that they are old. I could be wrong, but my impression is that much of the interest in smaller mutexes has come about relatively recently, at least compared to the age of these APIs. It might be possible to shrink pthread_mutex_t to some extent despite the aforementioned issues, but it's impossible to do so on any existing operating system (at least on existing architectures) because changing the size of pthread_mutex_t would break ABI.