This just seems so crazy, that one would compute a three-input boolean function by first describing instructions for computing it in this peculiar format, and then executing those instructions, rather than just doing some simple bit operations to perform a "table lookup"! I wish he went into the rationale in more detail at the end.
It's effectively a very early version of what we would do with "shaders" today - a massively repeated operation across an entire bitmap. Targeting the 16-bit single-processor era of 4MHz machines with 1MB RAM.
A more modern version might be some sort of AVX filter-instruction generator. Although these days we try to avoid letting the CPU dirty its hands with bitmaps and leave them to the GPU.
The Windows documentation for BitBlt dates it to Windows 2000, which is definitely wrong. I wonder if anyone has a 3.0 or even 1.0 API reference lying around?
> The Windows documentation for BitBlt dates it to Windows 2000, which is definitely wrong. I wonder if anyone has a 3.0 or even 1.0 API reference lying around?
This is just an inner loop optimization, really. The thing to realize about the machines Windows 1.0 was originally written to run on is that they were very short on both memory and speed. Almost unimaginably so, by modern standards.
So here, where there's a need to perform operations in bulk as quickly as possible, this sort of extreme optimization makes a bunch of sense. Conditionals inside the loop would blow your performance, and the other options buy back performance with memory that was itself in short supply. So... dynamic compilation threads the needle between these two constraints at the cost of some up front development complexity. (Probably ongoing complexity too, as additional types of graphics boards were released with their own sorts of peculiar optimization strategies.)
To pick the bit operation being used. They also have varying number of operands (ranging from just filling a target region with solid black to three operand stuff that merges images, etc.)
No, I mean... look, you've got four inputs: p, s, d, and f. p, s, and d are one-bit values. f is an 8-bit value that encodes the input->output mappings here. The output is 1-bit.
The output is found simply as f&(1<<((p<<2)|(s<<1)|d)) != 0.
Yeah, if you use the index to map to a bunch of ands and ors and whatnot to execute, you have a conditional. But if you just use the index directly and do bit operations, there's no conditional. What am I missing?
Different ROP's do different amounts of work (both logical operations and memory accesses). An ROP like BLACKNESS just fills the target with black, where an ROP like SRCAND does logical operations on multiple source values. The dynamic compilation let them truly reduce the inner loop of the BitBlt operation to about as simple as it could be with the given ROP.
There are also format issues. Circa Windows 1.0, you might be likely to have a CGA, Hercules graphics board, or EGA. All three have different memory formats and architectures. (CGA was fairly ordinary, Hercules used an odd line ordering, and EGA had multiplexing hardware on board that let you operate on 32-bits at a time over an 8-bit interface.)
Any approach based on lookup tables would've both taken a lot of memory that wasn't available and introduced extra memory references in the inner loop of what was probably the most performance critical code in the OS from a GUI PoV. (Keep in mind that unlike the Mac, these machines were not originally designed to run a GUI.)
This story is set ten or more years after the design of BitBlt was done, but I think it does nicely illustrate how people were thinking about memory back then. The reason Windows 95 didn't update the seconds on the taskbar clock was to allow it to page out font rendering code on small machines. If you're rendering text every second, you can't really do that.
As described in the earlier article [1], in the original BitBlt implementations the blitter loop was generated on the fly based on a template and executed as a subroutine. Looks like the opcode mini-language was meant to be straightforward to translate into machine code.