Just in case this isn't clear, BLAKE3 breaks a single large input up into many chunks, and it hashes those chunks in parallel. The caller doesn't need to provide mulitple separate inputs to take advantage of the SIMD optimizations. (If you do have multiple separate inputs, you can actually use similar SIMD optimizations with almost any hash function. But because this situation is rare, libraries that provide this sort of API are also rare. Here's one of mine: https://docs.rs/blake2s_simd/1.0.0/blake2s_simd/many/index.h....)
Majority of Blake3's perf benefit manifests from merkle-tree structure and SIMD processing of multiple input streams at the same time.