Yeah, it was that way for all previous ARM processors too, for exactly that reas...

DevilStuff · 2024-10-23T06:08:49 1729663729

Hey, I'm the author, sorry it came off that way, I was really just poking fun. I should've definitely phrased that better!

But no, I really think that making the program counter a GPR isn't a good design decision - there's pretty good reasons why no modern arches do things that way anymore. I admittedly was originally in the same boat when I first heard of ARMv4T - I thought putting the PC as a GPR was quite clean, but I soon realized it just wastes instruction space, makes branch prediction slightly more complex, decrease the number of available registers (increasing register pressure), all while providing marginal benefit to the programmer

joosters · 2024-10-23T06:34:34 1729665274

I think I misread your tone, sorry.

It's a good article though, the explanation of how multiplies work is nicely written.

DevilStuff · 2024-10-23T06:52:02 1729666322

No worries, I'm glad you brought it up so I could amend the article to be more respectful to those that I look up to :)

DevilStuff · 2024-10-23T06:18:25 1729664305

Anyway, I took the time to rewrite that paragraph a bit to be more respectful. :P Hopefully that'll come off better. The website take a few minutes to update.

kruador · 2024-10-23T11:38:53 1729683533

The reason was because it makes subroutine return and stack frame cleanup simpler.

You know this, but background for anyone else:

ARM's subroutine calling convention places the return address in a register, LR (which is itself a general purpose register, numbered R14). To save memory cycles - ARM1 was designed to take advantage of page mode DRAM - the processor features store-multiple and load-multiple instructions, which have a 16-bit bitfield to indicate which registers to store or load, and can be set to increment or decrement before or after each register is stored or loaded.

The easy way to set up a stack frame (the way mandated by many calling conventions that need to unwind the stack) is to use the Store Multiple, Decrement Before instruction, STMDB. Say you need to preserve R8, R9, R10:

STMDB R8-R10, LR

At the end of the function you can clean up the stack and return in a single instruction, Load Multiple with Increment After:

LDMIA R8-R10, PC

This seemed like a good decision to a team producing their first ever processor, on a minimal budget, needing to fit into 25,000 transistors and to keep the thermal design power cool enough to use a plastic package, because a ceramic package would have blown their budget.

Branch prediction wasn't a consideration as it didn't have branch prediction, and register pressure wasn't likely a consideration for a team going from the 3-register 6502, where the registers are far from orthogonal.

Also, it doesn't waste instruction space: you already need 4 bits to encode 14 registers, and it means that you don't need a 'branch indirect' instruction (you just do MOV PC,Rn) nor 'return' (MOV PC,LR if there's no stack frame to restore).

There is a branch instruction, but only so that it can accommodate a 24-bit immediate (implicitly left-shifted by 2 bits so that it actually addresses a 26-bit range, which was enough for the original 26-bit address space). The MOV immediate instruction can only manage up to 12 bits (14 if doing a shift-left with the barrel shifter), so I can see why Branch was included.

Indeed, mentioning the original 26-bit address space: this was because the processor status flags and mode bits were also available to read or write through R15, along with the program counter. A return (e.g. MOV PC,LR) has an additional bit indicating whether to restore the flags and processor state, indicated by an S suffix. If you were returning from an interrupt it was necessary to write "MOVS PC, LR" to ensure that the processor mode and flags were restored.

# It was acceptable in the 80s', It was acceptable at the time... #

Ken Shirriff has a great article "Reverse engineering the ARM1" at https://www.righto.com/2015/12/reverse-engineering-arm1-ance....

Getting back to multipliers:

ARM1 didn't have a multiply instruction at all, but experimenting with the ARM Evaluation System (an expansion for the BBC Micro) revealed that multiplying in software was just too slow.

ARM2 added the multiply and multiply-accumulate instructions to the instruction set. The implementation just used the Booth recoding, performing the additions through the ALU, and took up to 16 cycles to execute. In other words it only performed one Booth chunk per clock cycle, with early exit if there was no more work to do. And as in your article, it used the carry flag as an additional bit.

I suspect the documentation says 'the carry is unreliable' because the carry behaviour could be different between the ARM2 implementation and ARM7TDMI, when given the same operands. Or indeed between implementations of ARMv4, because the fast multiplier was an optional component if I recall correctly. The 'M' in ARM7TDMI indicates the presence of the fast multiplier.

Joker_vD · 2024-10-23T12:39:37 1729687177

> Adding special cases would have increased the transistor count, for no great benefit.

No, it's exactly backwards: supporting PC as a GPR requires special circuitry, especially in original ARM where PC was not even fully a part of the register file. Stephen Furber in his "VLSI RISC Architecture and Organization" describes in section 4.1 "Instruction Set and Datapath Definition" that quite a lot of additional activity happens when PC is involved (which may affect the instruction timings and require additional memory cycles).