It's amazing that this would run halfway well on a 33 MHz 486. Doom had a 35 fps cap, and ran at 320x240 (square pixels):
2.7 million pixels per second at 35 fps (the cap).
1.4 million pixels per second at 18 fps (~50% of cap).
At the more realistic target of 18 fps, you have 24 clock cycles per pixel. A 486 averaged about 0.8 instructions per clock, so you're looking at 19 instructions per pixel. With a 33 MHz memory bus and the DRAM of the day, you're looking at about 5 clocks for memory latency. That looks like an upper bound of no more than 4 memory operations per pixel.
A convincing 3d renderer averaging 19 instructions and 4 memory operations per pixel. And we're not even counting blit/video delays here. Good lord is that savage optimization work. Carmack is famous for a reason.
P.S. The really scary thought is that Doom would hypothetically run on any 386 machine -- can you imagine painting e.g. 160x120 on a cacheless 20 MHz 386 laptop?
IIRC, the 486 was not pipelined, so you did't get 19 instructions, 4 of which could be memory instructions. You got 19 arithmetic instructions or 4 memory instructions, or something in between (without checking your math, but as I recall, that's roughly right).
There are basically two inner loop types in a game like Doom - the wall loop and the floor loop. The wall loop renders a vertical strip of pixels on a wall and the floor loop renders a horizontal strip of floor. Each of these loops is actually rather trivial to write and just walks over the pixels reading an input texel, modulate by lighting, and then write out to the screen buffer. (Actually, Doom had transparent textures for some walls, so that's a third type of loop). The wall loop is slightly simpler because it can be an axis aligned walk through the input texture.
There really isn't very much room to optimize these loops. You can unroll them. You can play with how you do the adds and carrys and maybe shave off another instruction. For walls, you can organize your texture data so that vertical neighbors are consecutive in memory. You can be very choosy about your lighting function. The hard work was setting everything up for your inner loop.
You're right. What I really meant was that instructions were processed 'in order'. An uncached memory access slowed things down a particular amount and there was no way to cover the latency with other instructions.
The figure I found suggested that a 486 averages ~0.8 instructions per clock running full-tilt. That seems impossible unless one hardly ever hits memory.
I'm not 100% certain, but I think the pipelining would allow you to execute (some) register-only instructions in the x86 equivalent of a delay slot.
According to the specs, the original 80486 could read or write 16 bits per clock. That's probably on a DX - on a DX2 it's probably 2 clocks. And that's with in order execution - the bus was running at the same frequency as the CPU.
If I recall correctly, you had to shrink the view port on lower-end 486s for playable framerates. Also, the status bar shrank the area to be rendered to some extent.
Your point still stands, just pointing out that the target area for rendering was often smaller than 320x240.
I remember playing multiplayer Doom over a modem back in the days. I had to shrink the view port down to the size of a postage stamp to maintain reasonable performance; I felt like a T-Rex playing like that (if you didn't move, I probably wouldn't notice you [especially if you were indigo]).
I remember it was very playable on the 33mhz, 8mb 486DX machines I had at my high school. A friend had a 50mhz 12mb 486SX-2 which was wonderful for Doom. I sadly only had a 16mhz 386sx with 2mb of RAM. I amazingly did get Doom to "run" on this machine through some virtual memory program for Win 3.1. I had to shrink the screen size down to the smallest size for it to run at about 1fps.
I also remember a couple of other people that had "486" upgrades for their 386 based systems and Doom was perfectly playable on those systems as well. RAM was the big limiting factor that I remember.
I also done that on a friends' 386SX 25MHz machine that had only 2MB of ram. Doom required at least 4MB (sounds ridiculous as I am typing) and although it used the DOS4GW dos extender that was capable of swapping, it never worked.
On the other side, Win 3.1 was also capable of swapping and ran Doom. It was absolutely unplayable though :)
You know, I still have that 386 in my mom's attic. I'm tempted to break it out and see what it can do.
Ultima VII required a 33MHz 386DX w/ 4Mb of ram, and my computer ran it fine with a boot disk. If memory serves (and I was 7 at the time, so it probably doesn't), a DOS 6.22 bootdisk contained fields for page size, which leads me to believe it had support for virtual memory. That said, I do remember the massive stink that was made about virtual memory when Windows 95 came out, so I could be wrong.
I'm highly tempted to break out that old box and play around with it. This conversation tickles my nostalgia bone.
Doom needed 4mb of free memory. If you had 4mb total, you had to create a boot disk. 8mb machines worked fine without a boot disk.
I also, amazingly, remember running some early version of Pagemaker on that machine. When I went to link text, I would click the mouse to start the process, go get a sandwich, watch some TV and come back 10 minutes later when it finished.
The really scary thought is that Doom would hypothetically run on any 386 machine -- can you imagine painting e.g. 160x120 on a cacheless 20 MHz 386 laptop
I used to play Doom with a AMD386sx/33 with an ULSI coprocessor (a i387 clone)
The first (and only one, currently) comment brought me back years ago! A major performance trick back then was to ensure the code and data would remain into the (very small) cache, as well as preferring structures that would be read in order.
====================
"Because walls were rendered as columns, wall textures were stored in memory rotated 90 degrees to the left. This was done to reduce the amount of computation required for texture coordinates"
The real reason is faster memory acces when reading linearly on old machine, less cpu cache clear. It's an old trick used on smooth rotozoomer effect in demo scene year ago.
Actually, rotating a texture 90 degrees in memory when making a rotozoomer still results in large amounts of cache misses depending on the current rotation angle. In order to get a smooth rotozoomer effect you have to use a technique called block rendering. This means you divide the image in square cells and render these one by one. Because the pixels in the cell are always near eachother you can reduce cache misses. There is an old snippet from Niklas Beisert floating around the web that explains exactly how this works.
Actually, the real reason was that Doom ran in "Mode X", where it was actually faster to rasterize vertical strips rather than horizontal ones. This worked out especially well in Wolf3D and Doom, since perspective correction in the vertical direction was never needed.
Actually it didn't it ran in the bog standard 320x200 256 color mode. mostly because using "mode x" back then wasn't as widely supported on lots of low end machines.
Looking at the Doom source, it seems that the actual 3D game view was rendered into a linear buffer, and subsequently blitted onto the mode X surface. Some UI elements went straight to the screen.
(The DOS Doom source was never publicly released, just the Linux port, but there are remnants of some DOS code in the archive. I'm specifically looking at R_DrawSpan in README.asm and V_DrawPatchDirect in v_video.c)
2.7 million pixels per second at 35 fps (the cap).
1.4 million pixels per second at 18 fps (~50% of cap).
At the more realistic target of 18 fps, you have 24 clock cycles per pixel. A 486 averaged about 0.8 instructions per clock, so you're looking at 19 instructions per pixel. With a 33 MHz memory bus and the DRAM of the day, you're looking at about 5 clocks for memory latency. That looks like an upper bound of no more than 4 memory operations per pixel.
A convincing 3d renderer averaging 19 instructions and 4 memory operations per pixel. And we're not even counting blit/video delays here. Good lord is that savage optimization work. Carmack is famous for a reason.
P.S. The really scary thought is that Doom would hypothetically run on any 386 machine -- can you imagine painting e.g. 160x120 on a cacheless 20 MHz 386 laptop?