message batching and locking can be thought of as orthogonal pieces. In practice, yes there are advantages still of using thread per core, but perhaps not for the reasons you think. It has to do with core-local metadata materialization for a subset of the requests. THere is effectively essential complexity (i.e.: to replicate data, one must replicate data), but the TpC is strictly an optimization for flatter tail latencies.
Depends on use cases. We are working with a couple of security companies doing intrusion detection and for them, writing to disk on anomaly + notification seems to matter. Also financial services wanting to leverage open source tooling like spark or tensor flow still care about latency.