Alignment requirements are and have historically been very common -- you can see...

throwaway000002 · on July 14, 2017

I can understand the historical requirements for alignment, the necessary transistors, what not. But, much like branch-delay slots, there is no modern reason to expose this to the programmer. Of course, I gave an exception to atomics, but if you will, they're like memory-mapped communication, and now that all I/O is memory-mapped, with no concept of ports, the (ordering) semantics of memory access becomes really important.

I'm also the weirdo that feels process isolation, memory management, and I/O mechanisms need a rethink. But that's something that would take me forever to get into.

One thing I will say, though, is alignment issues "infect" everything. Assume your architecture doesn't allow misaligned access. Now, all your data has to be naturally aligned. Your structs now have to be aligned to the alignment of the largest sub-structure within them. This is all because code is alignment sensitive. Given a pointer to a struct, generic code is unnecessarily larger. Any why would we care? Communication, of course. If we're exchanging data between systems then idiosyncrasies such as this suddenly become globally visible.

Endian-ness must be little. Byte-aligment a non-issue, and network-bit order should be from bit zero up, with any upper layer need, say for cut-through forwarding, expressed as a data ordering requirement, so for example an IP4 address is not a blind 32-bit word, but specifies the structure of those 32-bits.

pm215 · on July 14, 2017

Even today, allowing unaligned accesses is still not free -- there is an implementation cost in transistors and in design complexity. There's a tradeoff here, as usual. There are a lot of places with a CPU architecture where there's a choice of "do we handle this in hardware, at the cost of having to have more hardware, or do we say it's software's job to deal with this, and hardware provides either nothing or just some helpful tools". You can see this for instance in whether software has to perform icache/dcache maintenance vs the CPU doing a lot of snooping to present the illusion of a completely coherent system; in whether hypervisor virtual machine switching is done with a single "switch all my state" operation on by letting hypervisor software switch register state itself; and in many other places. x86 has in my view generally ended up on the "handle things in hardware and make software's life easier", which it's been able to do because its natural territory is desktop/server where extra transistors don't hurt much. Other architectures tend towards different points on this spectrum because their constraints differ -- in embedded systems the extra power and are cost of more transistors can really matter. "Tend to prefer that software do something" is also a strand of the original RISC philosophies.

Practically speaking, the world is not going to converge on a single endianness or on a no-alignment-restrictions setup any time soon, so we have to deal with the world as it is. If you're programming in a sensible high-level language, it will deal with this kind of low-level nit for you. If you're programming in a low-level language (like C), well, I think you wouldn't be doing that if you didn't have fun at some level in feeling like you had a mastery of the low-level nits :-)

throwaway000002 · on July 14, 2017

You sound like an architecture person (I'm not, btw), so maybe you can give the lowdown on this.

Why registers? I haven't studied the Tomasulo algorithm in any detail, but if you're going to do "register renaming", why have registers at all? You could, for example, treat memory as a if-needed-only backing store, and then add a "commit" instruction that commits memory (takes an address, or a range). Sure you need to make changes with how you do mm i/o and protection, but at a basic level: why registers?

I'm glad FPGA's are becoming a thing, and I think we're about a decade or two away from ASICs as a service, because if you're not beholden to tradition, you really can work some magic. Of course I'll be pretty rusty by then, but who knows, maybe medicine will keep me feisty.

bluGill · on July 14, 2017

Because that would require longer instructions and thus more memory.

Instructions on a CPU are something like to following (this is based on MIPS since x86 is a mess) The first 6 bits are the instruction, the rest is command specific. For add the next 12 would be 4 bits for each of the source registers and then the destination register and then various flags (overflow for example).

If instead they only worked on memory they would have a lot more possible instructions - but there isn't enough room on CPUs to design that many instructions anyway so who cares, followed by the all three memory addresses. This means that every CPU instruction needs to read 3 times as much memory before doing anything. Worse, most of those are pointers: when you compile the code you don't know the location of those address, so in most cases it is read the instruction from the program, then go back to the stack to read the address of the next values, then read those locations. That is a lot of memory access and memory access is expensive. Of course as you can say you can just use caching, but cache is expensive and now you need to add 3 times as much - this is too big for the fast level one cache so now you are expanding level two cache and seeing a lot more cache misses in the level one cache.

The above would all be okay, but it turns out that given enough registers (x86 fails here) in most cases you are operating on the same set of values all the time, (indeed the stack locations each of the above is referring too is probably a small set of variables) so if the compiler is careful it can manage all that. The compiler has better information on when things need to be committed to memory anyway so let it handle that.

spc476 · on July 14, 2017

Not really. The TMS9900 [1] used memory as registers and had a fairly compact instruction set. Yes, it does have registers, but only three (a program counter, a status register, and a pointer to the current "register set" in memory). At the time, it was regarded as a slow machine, probably because of all the memory-to-memory operations.

[1] https://en.wikipedia.org/wiki/Texas_Instruments_TMS9900

throwaway000002 · on July 14, 2017

Sure, this is one technique. You could also, akin to jump instructions, have a concept of data locality, versus instruction locality. You can do this is a lot of ways without resorting to something like segmentation, which everybody hates. Trivial would be something like a current "data pointer", which would see useful implicit updates, and well as explicit ones (akin to a long jump).

pm215 · on July 17, 2017

Not all CPUs do register renaming, and almost all architectures will have started out being defined for a CPU which didn't do renaming. Even today, lower end CPUs (think the embedded market) don't do register renaming. If you want your architecture to be able to cover down to the low end then anything that drops the idea of a register file is a non-starter. Also, a register-based architecture is well understood, in terms of how to implement it effectively, how to exploit it in compiler design, and how to hand-code assembly for it when necessary. You need a really strong argument to justify taking the weird and innovative route, usually.