I mostly agree (and use void* myself), but I don't 100% agree on the last point. There are a bunch of algorithms that are conceptually generic, but get a significant speed-up if type-specialized to certain types, mostly numeric ones. For example, I think STL's type-specializing sort is a pretty reasonable default, while C suffers from having only a non-type-specializing sort built in, which ends up much slower if you're sorting, say, an array of integers. Same with a bunch of machine-learning algorithms that work on generic comparable types, but work much faster if you type-specialize them to integer or double, the way C++ templates would do. That's the main use case I see those terrible C macro approaches used for. An alternative I've seen is to actually generate the type-specialized variants ahead-of-time using some external templating or macro system, or even some hacked-up Perl script.
Fast atomic types support is indeed a good reason to do this sort of hackery - that's one of the reason why it is used in kernels. The difference of performance hashmap/vectors/sets between void* and int-specialized can be very significant (easily up to one order of magnitude if not more in some real cases).
I wished there was a language (not C++) which could help for this kind of things. Unfortunately, it becomes difficult very fast. One interesting approach is K as suggested by some FreeBSD hackers, but it never went into production AFAIK (http://wiki.freebsd.org/K)
I'd question how much of that impact comes from specializing the collection code, and how much of it comes from allocation and locality. The temptation with void* collections is to malloc everything, which is lethal for performance, and one of the very few non-bug things I've come across whose fix actually yields a 10x performance improvement.
Memory allocation can indeed be amortized with specialized allocator, but that's not what I had in mind. The context I usually operate with is numerical computation, and the indirection cost is very high in those cases, especially when you can access memory in blocks if you use specialized allocator. It would be very hard to do well for sure. It is well known that allocator is one of the main weakness of the STL (one of the reason for the existence of Electronic Arts STL).
I am certainly not advocating doing this in general - I think the need for atomic support in generic collections is quite low (I have been investigating the issue recently to add fast and generic support for sparse matrices in scipy). I am pretty sure the macro, specialized ones used in freebsd (tree/queue.h) and linux (rbtree, list) have been benchmarked to hell, though, and would trust them more than most STL implementations.
Seriously, why not C++? It's just C with some stricter type checking. Pretend it's C, then judiciously grab a pair of angle brackets when you need them. No macros required.
C++ has portability issue (where portability does not mean supporting g++ and MSVC), poor interoperability, and is much harder to maintain in a distributed team of people with varying ability in the language (e.g. open source).
When those are not issues, C++ is appropriate. Otherwise, it is a pain.