It works the *exact* same way. You process the input one byte at a time, build a histogram, construct the Huffman Tree and encode the input.
Why should it work different? Text files are just regular files where the bytes only use a specific subset of the possible values they could have.
If you do have some knowledge about the data you are encoding, you can be a little bit smarter about it: e.g. for the text section of an executable, you might work on individual instructions instead of bytes, maybe use common, prepared Huffman Trees, so you don't have to encode the tree itself.
On a side note: IIRC the Intel Management Engine does that using a proprietary Huffman Tree, backed into the hardware itself, as an obfuscation technique[1].
To circle back to your question: PNG simply feeds the pixel data into the zlib deflate() function as-is.
You ask why it should work differently but then give a good reason why it should work differently: sometimes splitting by bytes is not the best unit.
And it actually does often work differently for PNG! PNG has a handful of preprocessing options for the pixels. So in filter mode 2, for example, deflate is encoding the difference between each pixel and the pixel above it. More or less.
Why should it work different? Text files are just regular files where the bytes only use a specific subset of the possible values they could have.
If you do have some knowledge about the data you are encoding, you can be a little bit smarter about it: e.g. for the text section of an executable, you might work on individual instructions instead of bytes, maybe use common, prepared Huffman Trees, so you don't have to encode the tree itself.
On a side note: IIRC the Intel Management Engine does that using a proprietary Huffman Tree, backed into the hardware itself, as an obfuscation technique[1].
To circle back to your question: PNG simply feeds the pixel data into the zlib deflate() function as-is.
[1] https://en.wikipedia.org/wiki/Intel_Management_Engine#Design