RiscIn2022

Some thoughts on hardware instruction sets.

Instruction sets are an API for talking to the hardware. Having this API at the right level of abstraction is Important. There’s lots of features that can go together into this API which make it “better” or “worse” for particular use cases.

RISC was a set of design principles developed in the 1980’s that enabled hardware to get much faster and more efficient. We tend to still call modern-looking instruction sets “RISC-y”, but really, a bunch of the original design principles of RISC CPU’s have not stood the test of time. So let’s look at the things that have worked and not worked between the 1980’s and 2022.

Note I am not a CPU designer, nor am I really an expert in compilers and assembly language. I’m a software dude who writes compilers and OS’s as a hobby, and likes weird architectures.

Last updated in July 2023, despite the date of the article. Hey I tried.

things to fix: conversation here, https://lobste.rs/s/fgfxvu/risc_2022. Also this wrt subregisters.

Case studies

Old systems that started in the 1980’s and evolved:

  • x86_64
  • ARM32
  • PPC/POWER

New systems made from whole cloth in the 2010’s:

  • Aarch64
  • RISC-V

Historical things that currently appear to have little future:

  • MIPS
  • Itanium
  • x86 (32-bit-only version)

Things I don’t know much about:

  • SPARC
  • PA-RISC

Out of scope:

  • 68K (stopped evolving in the 90’s afaik)
  • Alpha (maybe? it also stopped evolving in the 90’s, but later than 68k)
  • SH4 (does well in embedded space but never really wanted to grow above that)

My best sources are Raymond Chen’s computer architecture articles, and reading references, and writing code.

Things that work

32ish general purpose registers

Having more registers is good because it gives you more working space without needing to touch slow memory. Having more registers is good because it gives the compiler more options for where to put your program’s data without needing to add instructions to shuffle things around.

Having more registers is bad because it means on every function call or OS context switch, you need to save and restore these registers, adding overhead to these very common operations. Having more registers is bad because each instruction needs to specify which registers it operates on, and if you have tons of registers you are spending more bits per instruction storing just which registers its arguments are in.

32 registers is a decent, pragmatic sweet spot for 32-bit 3-address instructions. (More on that below.) Encoding 3 registers per instruction takes 15 bits, leaving you 17 bits for Other Stuff which is a pretty useful amount of space. 16 registers is pretty good, it’s common on smaller systems that want more compact instructions (ARM32, SuperH), but it’s not too hard to write programs which could really use more registers if they were available. 64 registers with 3 registers per instruction would require 18 bits out of 32 just for register addressing, which is a bit much, and results in a greater proportion of programs which don’t really need all the available scratch space.

This does mean that it is difficult to create immediate values larger than the spare 17/21/whatever bits that you can fit into one instruction. Creating full 32 or 64 bit literals will generally take multiple instructions. This includes creating addresses/offsets for both data and function calls/jumps! There are many patterns for doing this efficiently, depending on the architecture. You may need to load the high and low 16 bits of a 32 bit value separately in sequential instructions. You may be able to easily create values relative to the program counter, and use these for function calls or to load constants from tables the compiler sprinkles between function code. You may have a “global register” that the calling convention reserves to point to the program’s data segment, so you can always load values relative to that register. You may have instructions that create larger bit patterns out of smaller ones, such as a single instruction that takes 0x1F and loads it into a 32-bit register as 0x1F1F1F1F. And so on. Nobody really likes this, as far as I can tell, but so far it seems to mostly be worth the extra instructions in exchange for a simpler and more regular instruction format. Having variable-length instructions that can have a full 32 or 64 bit value stuck into them would add complexity to the CPU’s instruction decoder and would not really save that much space. Assemblers even tend to have shortcuts to handle the construction of large constants for you, so you can write load_immediate r1, 0x1234_5678_9ABC_DEF0 and it will find somewhere to stash the value 0x1234_5678_9ABC_DEF0 and fabricate the instruction sequence to get it for you. So the only people it really bothers in practice are compiler writers.

TODO: Apparently ARM did an analysis of 32 registers that showed that 16 performed just as well for most code. This was apparently internal only but might have external references somewhere. You can also easily modify LLVM to only use 16 registers and see what that does to various code. Here’s some sources that investigate different register sizes: https://doi.org/10.1109/ACAC.2001.903365 and https://doi.org/10.1109/ISCA.1993.698564. Indeed, both show that 32 registers is only a few % better than 16; I thought it was a bigger difference! The newer paper is from 2001 though; I don’t expect backend compiler tech to have changed that much since then, but it would be nice to update the findings.

3-address codes

This is not a hard and fast rule, but having most operations of the form rD = rA + rB is a pretty good way to structure your instruction set. It just gives the compiler a lot of flexibility for which registers to put intermediate values into, which means the compiler doesn’t often have to do a lot of register shuffling to get things to where they need to be. Even if register renaming makes register-to-register moves super fast, they’re still instructions that need to be loaded and decoded, which ideally do not need to be present in your program.

The other common instruction encoding is a 2-address code of the form rD = rD + rA, ie, the result overwrites one of the operands. This is a lot less flexible. But it is a common enough special case it’s a common optimization for the purpose of small code size. Having 2 registers instead of 3 forces the compiler to do more shuffling and copying of data when arranging large expressions, so an expression written as 2-address codes may be slower or occasionally use more instruction space than one using 3-address codes.

Not every instruction fits into the 3-address format of course. Loads and stores usually only need 2 register operands, and use the extra space to make offsets or literals larger. Aarch64 (and probably others) have a few 4-register instructions, such as rD = rA + (rB * rC), but those generally have specific instruction encodings for those particular instructions, so they’re deliberately a special case. Apparently a useful enough special case to exist though.

32 bit instructions, but also small code size

Canonically, it’s Best if all instructions are 32 bits long and on a 32-bit boundary. This makes your instruction decoder very simple: you load 4 bytes from a word boundary and that’s one instruction, no logic involved. If you have out-of-order execution then you can load 8 or 16 bytes or whatever at a time and still need to do exactly zero work to figure out where instructions are before starting to decode them. 4 bytes per instruction is a nice size for most things, it has enough space for 3-address encoding, thousands of instructions, and usually immediate values bigger than a byte.

On the other hand, instructions take up i-cache space and memory bandwidth, so the smaller your instructions are, the better. 4 bytes per instruction is actually slightly on the large size, historically speaking. 8 bytes per instruction is excessively flabby; most instructions really don’t need that much space.

On the third hand, if you have small instructions then it’s difficult for them to do as much as you can do with bigger instructions (see the section on 3-address codes), so you may need more small instructions to do the same work as fewer large instructions. 2 bytes per instruction feels pretty cramped a lot of the time; you need either 2-address codes or fewer registers, or both.

So in reality it seems like you want a balance: a variable-length instruction set that’s compact but still easy to decode. Both ARM32 and RISC-V have an optional “compressed” instruction format (Thumb-2 and C extension respectively) that essentially allow 16-bit instructions to be mixed in with 32-bit ones, so that seems to be the current sweet spot. Both started with the approach that the base instruction set is all 32-bit, and then the most common and useful instructions have duplicate 16-bit encodings. (Thumb-2 seems to diverge from this a bit more than RISC-V, I don’t understand all the differences though.) In both, instructions must be 16-bit aligned, but 32-bit instructions may span 32-bit boundaries, so it’s a little more complicated to load instructions and find instruction boundaries, but not very much.

(ARM’s Thumb-1 has different rules that are more restrictive, so I’m not going to consider it here. RIP Gameboy Advance programmers though.)

Aarch64 does not have variable length instructions (yet), not sure yet whether it will grow one. However, from reading Raymond Chen’s blog posts about it, it looks like even with 32-bit instructions only, Aarch64 spends a lot of effort packing more work into fewer instructions. For example there’s no integer multiplication instruction, every integer multiplication is multiply-and-add, rD = rA + (rB * rC). There are also many instructions that do odd little bit-twiddling or slightly offbeat combinations of operations. So it could be that the Aarch64 designers have deliberately decided to not have variable-length instructions, but have more complex instructions. Departure from RISC’s origins, indeed.

Load-store architecture

This is a biggie. Back in the day, it was common to have instructions that touched multiple memory addresses at once.

This made life harder for instruction decode and execution units (if I understand correctly), so RISC ditched it. You had enough silicon for a lot of registers (compared to the then-common 16 or 8 or even fewer), and so it made life easier to have every instruction operate on registers and only a few instructions move stuff from registers to memory and back.

Memory kept getting slower, and caching and instruction reordering kept getting more important to keep the CPU fed with useful work, and so this decision worked out really really well in the long term. Maybe even by accident? Either way, the CPU has to reach out to memory relatively rarely and does a lot more of its work entirely in registers. Adding logic to the CPU to deal with caching intelligently is also much easier when only one memory location is touched at a time.

Few addressing modes

TODO, addressing modes are boring. Apparently Aarch64 has richer addressing modes than usual, which is worth the payoff for larger systems. “The goal is to maximise the work done per instruction, not to minimise the encoding space for a single instruction. RISC-V optimises the other way, which is why it’s a good fit for microcontrollers but not as good for laptop or server class systems.” Modern compilers also seem to like lots of addressing modes, based on anecdote at least, so this might actually belong in the “Things that don’t work” category.

Big flat address spaces

This is more general technological evolution than RISC specific, but it kinda started with RISC and is a trend I see continuing. It’s now inexpensive enough to put addressing hardware and traces in chips that even if you have a microcontroller with only 64kb of actual physical RAM (16 bit address space), having your CPU use a full 32-bit address space with 32-bit registers and pointers is usually not a big cost. Even if the actual chip has fewer than 32 address lines coming out of it, the internals are still all 32-bit. The “empty” address space gets used for memory-mapped devices, useful address tricks like virtual memory, etc. Sure, you could get a little more use out of your transistor budget by having the hardware not pretend it can have more than 16 bits per pointer. But we are wealthy enough now to waste some of that capability for the sake of simplicity.

Weak memory model

x86 is the only big architecture made since the 90’s(?) with a strong memory model. ARM32, Aarch64, and RISC-V all have weak memory models. MIPS, Itanium, SPARC and POWER all have weak memory models. ’Nuff said, really. Which is good, ’cause I don’t know much more about the details of why this is a good thing, other than “simpler data caches instruction speculation”. From @dave_chisnall: “In a TSO architecture, you have to keep things in store queues for a lot longer because more things that happened on other cores can invalidate your memory access.”

Afaict the “memory model” of a processor is what promises it gives you about how different cores interact with the cache and other various layers of the memory hierarchy. If core A writes to memory and core B reads from the same location, how is that resolved? Again without trying to get too deep into it, “weak” means that the processor is free to reorder the reads and writes from separate cores however it feels like, unless you use a memory fence instruction to explicitly tell it not to. If each core was reading and writing to different sections of memory, each will see their own reads and writes happen in the order they are issued even if reordering happens under the hood. But when they both modify the same place in memory, you could get abject nonsense – the archetypical race condition between threads. The two cores are not even guaranteed to see the same thing, until some later date when the memory system sorts its shit out.

On the other hand a “strong” memory model is one where different cores reading and writing to memory are guaranteed to all see the same sequence of read/write events, even if they’re interleaved in ways you didn’t intend. x86 appears to be the only common large processor that uses a strong memory model, though apparently SPARC had an optional operating mode that was almost as strong.

The DEC Alpha had a “super weak” memory model that apparently could sometimes reorder loads and stores for the same core unless you explicitly told it not to. This seems to have been more trouble than it was worth.

The strong memory model is such a pain in the ass that the Apple M series of chips have a special operating mode that enforces a strong memory model just so they can emulate x86_64 code more efficiently. TBH we may never be free of it.

So basically, at some point memory reads and writes actually hit the physical RAM, and get resolved into a single definite sequence when they do. (At least, as far as us software people can tell.) However, to get there you have a chain of registers -> per-core cache -> shared cache -> actual RAM. You have multiple cores and multiple per-core caches, and you could even have multiple levels of shared cache shared between different cores (I think early Ryzen processors did this with their CCX’s). The memory model defines where in this chain this ordering into a single definite sequence occurs. In a weak memory model it occurs in the shared cache -> actual RAM step, so the memory system has the freedom to reorder reads and writes anywhere before that. In a strong memory model it happens in the per-core cache -> shared cache step, so either less reordering can happen or the system has to work harder to cover it up.

It’s more complicated than that in reality, because of course it is – different interactions of reads and writes can be re-ordered differently. More details on the guts of this here.

Things that don’t work

These are things that older RISC chips tended to do which seem to have become less common through time. As far as I know none of these are such fatal mistakes they have killed architectures, but they certainly make life less convenient for compiler writers, system programmers and/or hardware designers.

Reduced instruction sets

This one is a little cheeky but I had to put it in. The first RISC I CPU had 31 instructions. I didn’t count carefully, but MIPS IV seems to have about a hundred instructions, not including floating-point. Alpha also has something like 100 integer instructions, at least in the v4 manual.

On the other hand, ARM32 has thousands of instructions. Aarch64 has hundreds of instructions if not thousands. RISC-V has a very small core of about 40 instructions, but if you want useful extensions that let you run an OS with memory protection, atomics, useful bit manipulation instructions, and an FPU, you are soon looking at more like 200-300 instructions. RISC-V vector instructions (SIMD) add another 200 or so.

Basically, there’s tons of little fiddly things you can add to a CPU that are worth the effort if you have the transistors to spend, especially when it comes to bit manipulations and floating point math. These days we usually have the transistors to spend on special-case performance. There’s a pretty strong incentive to add more instructions to do better on certain benchmarks or provide certain features, and not a whole lot of incentive not to. The difference between post-1990 “RISC” and pre-1990 “CISC” is that the emphasis has shifted from “add instructions that are useful for humans hand-write assembly” to “add instructions that are useful for compilers to reason about”.

That said, the core instruction set does still tend to remain quite small. No capability is lost if you don’t have rD = rA and (not rB) as its own instruction, just performance. So, like variable-length instruction sets, large complex instruction sets are due to us spending transistors to save memory bandwidth or add special-case optimizations, which makes instruction caches smaller and pipelines smoother. If we could fetch instructions from memory faster than we could execute them, then having fewer and simpler instructions would probably still be the way to go.

Single-purpose instructions

Very related to the previous point. Code density is too important, and memory is too slow. Aarch64 and ARM32 both have many instructions that do multiple things at once, as long as they only have one memory read/write at once. The best examples I’ve seen is that Aarch64 has “load-and-increment” instructions for array indexing: they load data from an address stored in a register, then increment that address in the register. RISC-V has far fewer but they still exist: fused multiply-add, jump-and-link, some of the bit-manipulation instructions, and the proposed push/pop memory stack instructions. So this is not a hard and fast rule, but the trend seems to be that it’s worth having more complicated instruction execution to make very common sets of operations smaller, within reason.

This surprised me when I learned about it, ’cause load-and-increment instructions used to exist in the old-school CISC instruction sets of the 70’s and 80’s and be used as an example of the “do multiple things at once” philosophy that RISC eschewed. And now it’s back! The difference is that what Aarch64’s multi-function instructions don’t do is read a value from memory, modify it, and write it back to memory all at once. Nobody does. Having an instruction do multiple things at once does make it harder to reorder and needs more complicated execution machinery, but that’s less important than the fact it sticks to the load-store model.

Comment on this from David Chisnall: “The key goal is that there should not be two instruction sequences of the same length, with no shorter version, that achieve the same thing. … Things like the bitfield extract and insert instructions replace a longer shift and mask sequence. In older CISC instruction sets, there were many equivalent-length instruction sequences for doing the same thing, which made it hard for compiler writers to know which they should pick.”

Complicated data dependencies

Flags registers, hardware traps, instructions that modify multiple registers (a la x86’s MUL which stores results in rax/rdx or something like that) all make life a lot harder for the CPU because there are more potential conflicts between one instruction writing data to a location and the next one reading from it.

This is basically same problem as instructions that read or modify several things in memory, except more so because things like flag register changes are implicit and happen on each instruction. In Ye Olden Days, instructions were executed strictly in the order they were given and you had one thread of execution per CPU and it didn’t matter that instruction A had to write its results to the flags register before instruction B could read from it. Now it does matter. These sorts of “modify global state” operations were a lot easier to design before out-of-order and speculative execution became common. So older instructions like ARM32, MIPS and Alpha have them, and RISC-V does not. Aarch64 seems to straddle the line a little bit; it has a flags register but uses it more lightly than ARM32, and fewer operations affect it.

From David Chisnall again: “Flags are still there on everything except RISC-V, and even there they exist in the floating-point ISA. They are annoying to implement but they’re a huge win for software. Compilers are really good at if conversion now and you can get the same performance on an architecture with conditional moves as one without if you have about less than half as much branch predictor state. The saving in the branch predictor more than offsets the cost of flags. They’re also essential for any kind of constant-time code to be fast, which is increasingly important with modern systems and side channels.” So that one is still a tug-of-war between “easy for the CPU” and “easy for the compiler”, and also frankly easy for different parts of the CPU. Also, POWER has an actually quite interesting design where there’s 8 sets of flags registers. Explicit compares can use any of them, and most integer/float instructions can have a flag that tells them to set a default flag register according to the results (cr0 for integer, cr1 for floating point) or not set the flags at all ’cause they won’t be used. That’s a pretty cool design IMO. Apparently Itanium had something similar too.

Note that this goes so far that things like integer divide by zero no longer cause an interrupt. Interrupts require stopping what the CPU is doing, waiting for all the pipeline and speculation machinery to notice, context switching to the OS, and handling the trap. That’s a lot of extra work to potentially have to do on any instruction, so newer CPU’s have as few instructions that potentially interrupt as they can manage. In RISC-V and Aarch64, if you want to check for divide by zero, you do it explicitly; the CPU does not stop what it’s doing for you, it just gives you an invalid result and moves on with life. In ARM32 trapping on divide by zero is an optional flag the CPU may not support.

Exposing too many hardware details

The first MIPS CPU had a pipelined instruction decoder, which meant that the by the time the instruction decoder figures out whether the current instruction is a jump, the next instruction after it was already loaded. This presents a problem for the CPU: It can either throw that loaded instruction away and accept a 1-instruction pipeline bubble on every jump, or it can use some sort of fanciness such as speculative execution to find something hopefully-useful to do during that pipeline bubble. Or you can do what MIPS does, which is execute that next instruction after the jump anyway, as if it were the first instruction at the destination of the jump. This is called a “branch delay slot”. Raymond Chen has some opinions about them. They’re weird, and cause some odd special cases, but compilers don’t really care; the only damage branch delay slots do is to the minds of people who have to read or write assembly code.

This my understanding is a little more vague on, but afaict Itanium had instruction formats that packed 2 integer ops together, because it had 2 integer units. That way the instruction stream could, in the best case, keep the CPU fed with exactly what it needed to keep running at max utilization. No speculative execution or reordering or such needed, the compiler just knew what the CPU wanted and arranged the instructions to fit. (They still do this to an extent.)

Both of these features work fine for specific models of CPU. Exposing these details mean that the CPU doesn’t have to do extra work to hide these behaviors from the program’s execution model. The problem is that the instant you change the hardware implementation this suddenly becomes a liability where you have to paper over the differences from the old implementation anyway. If the latency before a jump is 2 cycles instead of 1, you suddenly need to add in all that reordering/speculative execution/whatever hardware to cover it up and make it look like 1 cycle. The branch delay slot suddenly saves you absolutely nothing. If you make a CPU that has 4 integer units instead of 2, you need to load and dispatch and track more instructions at a time, just as if your instructions could only have 1 integer operation at a time.

This is a great demonstration of how a good instruction set that scales across many devices and system sizes is not a literal description of how the computer works, but rather is an API for talking to it. Like any API it needs to be at a useful level of abstraction.

On the other hand, smol things like DSP’s use these sorts of compromises heavily, on the assumption that rewriting/recompiling a program for a new model of hardware is easier than spending more transistors to make the hardware more flexible.

Register windows (minor)

These aren’t so much of a bad design as just a design that didn’t pay off. Many CPU’s have had register windows: RISC-I, SPARC, Itanium are the ones I know of best. The idea seems good: instead of explicitly spilling registers to a stack and restoring them every function call, just have multiple sets of registers and have each function call switch from one bank to another. You can even make the banks partially overlap and use the overlapping portions for passing args and return values between functions. Neat, huh?

The thing is that once you fill up however many registers you have in your CPU, on every function call you have to spill/unspill registers anyway, so as programs get larger the benefits get smaller, and depending on how they’re implemented you can hit a sudden performance cliff when you run out of empty register banks. Register windows also probably make it more complicated to walk the stack for backtraces and stuff; if you need to handle a software exception or have a debugger inspect a program, you now need to be able to spill all your CPU’s registers onto the stack on demand. So, nobody seems to use register windows for general purpose CPU’s anymore. I’ve never heard anyone condemn them, they’ve just quietly died out. It seems like it was more effective to just have a bunch of registers and a calling convention that saved a fixed number of them to the stack on every function call.

On the other hand, apparently they’re still somewhat common in DSP’s and embedded devices, so maybe they’ll come back someday. I have been informed that “register windows remain great if you have SPARC-like circular view of them and a microarchitecture that does asynchronous background spilling. Oh, and a memory model weak enough that nothing on the spill stack is guaranteed visible to anything other than the spill / reload logic without an explicit barrier.” So basically they can still be beneficial if your CPU is smart enough to spill stuff in the background while the rest of the program is still doing stuff, which sounds potentially complicated but also like it’d solve the problem. I could certainly imagine a dedicated cache for spilling registers; Itanium also had a second stack pointer for this purpose, so the CPU can certainly be designed around this idea. Is it worth it? I dunno! These may come back someday.

Instruction condition codes (minor)

ARM32 had a condition code on every instruction, so you could write add.gt or whatever and it would only execute the add instruction if the greater-than flag was set. It’s a really cool idea. Unfortunately, it also does not seem to have been worth the costs in terms of instruction encoding space and execution complexity; it’s gone in Aarch64. It’s far easier to have only some instructions be conditional – mainly jumps and maybe a few loads/stores. x86’s “conditional move” instructions seem to be one of the few x86 instructions that compiler writers really like, though the exact tradeoffs involved in using them seem to have wobbled back and forth a few times since they were created.

In-order-execution + VLIW (added 2023)

Awright, let’s talk about Itanium! Itanium is fun and weird, and a great example of a billion-dollar boondoggle that the company doing it somehow managed to survive. The whole idea of Itanium was to get rid of all the fancy instruction-mongling hardware like branch prediction, out-of-order execution/speculation, the ability to pause a pipeline or switch to a different thread of execution when a memory load took too long, etc. You just made the compiler do all this sort of analysis in software and output The Right Instructions, and all the CPU had to do was dumbly and rapidly zip down them and do exactly what they said. Part of this was the “Very Long Instruction Word” style format, which is simple in principle: each “instruction packet” was 16 bytes long and contains 3 instructions with a variety of layouts, and these layouts exactly match what the CPU’s execution hardware is capable of doing. You get reasonably dense instructions (Itanium instructions in the packet were each 41 bits but they also were 3-address codes with 64 registers) with much, much simpler decoding. The processor doesn’t have to do much if any of the work of taking the instruction apart, seeing what registers it touches, seeing what execution units are available, trying to do something clever if it touches data that isn’t available yet, and dispatching stuff appropriately. Just load and go.

This post has a great summary of how this works, as well as a great summary of why it failed: “the fundamental idea of Itanium– parallel instruction decoding– does not work.” I’ll recount the reasons and add one or two of my own:

  1. It’s not always possible to fit instructions into groupings of three. Having the compiler organize instruction packets optimally is an example of the Knapsack Problem, which is O(N^2) in time complexity. So you end up with either super slow compilers, or approximations that fill out lots of space with noop’s – or both, if the problem the CPU is told to solve is fundamentally un-packable. There’s a popular quote from a Donald Knuth interview in 2008 saying a compiler for Itanium is “basically impossible to write”. It’s vague and a bit flame-y, ’cause frankly Knuth knows a lot about computers but I don’t think he ever tried to write an Itanium compiler, but it was repeated a lot anyway. Honestly I think that these days compilers can probably throw enough brute-force at instruction scheduling problem to do a half-decent job of it, but that wasn’t the case in 1999 or whatever when Intel was actually trying to make the damn thing.
  2. Furthermore, this also overlaps with “Exposing too many hardware details”, as previously mentioned. VLIW formats are designed to match the execution hardware the CPU actually has available. How many memory loads/stores it can dispatch at once, how many ALU instructions it can do at once, how many FPU instructions, etc. If this execution hardware changes, then your VLIW format is no longer a perfect match, and will only become more imperfect over time. So sooner or later you have to add in all the instruction disassembly-and-dispatch hardware anyway that this was supposed to avoid!
  3. The bigger problem, though: Cache hierarchies. “With a multi-core, multi-process, modern operating system computing environment, it’s practically impossible to know how physically close your data is to your execution unit [at any given time].” Basically if you’re a compiler and you have a load r1, (r2) instruction to dereference some random pointer, you usually have no idea how long it will actually take. Is (r2) in cache? Which cache? How long do you have to wait for it to load before you actually use it? A modern CPU will try to find something else useful to do while it is waiting for the memory bus to package (r2)’s value in a crate and send it via DHL 3-day shipping, or it will just stall if it truly has nothing else it can work on. An Itanium’s VLIW format is entirely based around the idea that this should never happen, the compiler can figure it out for you. So your compiler will often have to be pessimistic about what data is actually in cache, and either load things more often than necessary or wait for loads more often than necessary.
  4. This goes even further though: Is the address r2 points to an atomic value that another core is currently tinkering with? Or the result of some earlier computation that hasn’t finalized yet? Is r1’s current value actually still being used by an earlier computation and you need to wait for that to finish or register-rename it to something else? Every time a branch lands a program’s execution at the start of a new block, the compiler can not know exactly where the branch came from; it may have always come from the same place, or one of 2 or 3 or infinity different places that may have left the CPU in different states. So again, the compiler often has to be pessimistic about what CPU resources it can actually use.

This last point is the real killer: every time a jump lands execution somewhere new in the program, the actual state of the CPU and cache may be in one or more different states. The compiler has to make an educated guess at what the CPU’s inner guts might be doing at any given point of time and how it is supposed to schedule instructions to use the hardware optimally with a minimum of waiting, but it has to be a conservative guess because if it fucks up the CPU won’t stop and sort it out, it will just keep going with invalid results. A CPU that is smart enough to analyze and re-order and predict instructions has more information than the compiler does about what is actually going on for real, and so it can make more intelligent decisions! This doesn’t absolve a compiler from trying to give the CPU the best guess it has to work with, because the CPU can only look at a small number of instructions at a time, but the two working together can do far better than either of them working separately.

So a simple in-order CPU zipping through straight-line code with predictable memory access patterns can crunch numbers real fast and be very efficient in terms of power and silicon used– but that goes out the fucking window every time a jump occurs. And apparently in most code a jump occurs on average every 7 instructions, give or take. So this approach will never work out well for general-purpose CPU’s that are used to do a lot of decision-making. Are there special-purpose problems where jumps are uncommon, and ideally where you always compile code for a specific CPU architecture? Why yes, this is common in digital signal processing and graphics code – and indeed embedded DSP’s and GPU’s not-uncommonly use VLIW instruction sets! Though as GPU’s become more general-purpose and have to be better at executing branch-y code, apparently VLIW instruction sets are becoming less common. Use the right tool for the right job!

(Full discussion where I actually figure half this shit out here

Jury is still out

Load/store multiple values at a time

Very useful for function preludes where you have to spill or restore all those registers. Are they worth it? idk. You’d think the answer is an obvious “yes” but they re-introduce a lot of the problems that get taken away when you have instructions only do single loads or stores at a time. For example, what happens if you do a multiple-register load/store that crosses a cache line? What happens if an interrupt occurs in the middle of it? If you have a single instruction that dumps 8 values to memory all at once, and 3 of those get sent to the memory bus and then an interrupt occurs, does the CPU need to be able to roll back and restart 3 memory transactions? Or pause and remember the 5 that are pending and then restart them when the interrupt finishes? Or something else? It adds a lot more extra state that the CPU has to juggle. See this discussion for more about the design tradeoffs.

ARM32 especially has these sorts of instructions, and they’re great for code density but make life real hard as your CPU gets more powerful. RISC-V (currently) does not but may acquire them; there’s a proposal for it that looks like it will pass. Aarch64 has a kinda weird little compromise, there’s instructions that load and store two registers at a time, which makes it easier for them to enforce some restrictions on the process like “destination must be 16-byte aligned” that cause fewer edge cases. (See the discussion linked under “load-store architecture” for talk about said edge cases.) (As of 2023, Intel is acquiring load2/store2 instructions too! I wonder if/when we’ll have instructions that load/store 4 registers at once?)

General purpose registers with special purposes

Old-school CISC processors usually had many different register types: integer registers, address registers, floating point registers, string pointer registers, etc. They had different instructions that operated specifically on those registers. Best example is the x86 IDIV instruction: “The IDIV instruction divides the contents of the 64-bit integer EDX:EAX by the specified operand value. The quotient result of the division is stored into EAX, the remainder is placed in EDX.” Don’t want the remainder in EDX ’cause there’s something else there already? Too bad. It made the hardware simpler to implement.

RISC got rid of all of this and just made all registers usable for any instruction, which resulted in fewer special cases for software to juggle. Fewer special cases also made the hardware simpler to implement – presumably with different tradeoffs that became more favorable as time went on, but I don’t know details. But some hardwired special cases for general-purpose registers crept back in.

r0 for zero register, it’s nice but not essential. RISC-V and MIPS has it. ARM32 does not. On Aarch64 and POWER it’s context dependent – for some instructions “register 0” is an actual zero register, for other instructions it’s the stack pointer – because using the zero register with those instructions would be a no-op anyway. Basically by having a zero register you spend 1 GPR forever in exchange for removing some special cases from your instruction set. Sometimes you can reuse these for useful things; for example RISC-V reserves various no-op instructions with r0 as a destination register to be used as hints or custom instructions for various purposes.

Having the program counter be a GPR is another option. ARM32 is the only arch we’re looking at which does this. It’s kinda weird, but enables some cute tricks such as jumps really just being normal instructions that modify a particular register, or function returns being a pop instruction that pops the return address from the stack into the program counter. But having one more GPR might be more important than those tricks. It also makes speculation and other hardware design difficult if any instruction might end up being a jump, including memory loads; ARM32 has an idiom where it uses its “pop multiple things from the stack” instruction and one of them is the address of the function to return to, which just gets slapped straight into the program counter. It’s great for code size because most of your function prologue/epilogue gets done by a single instruction, but it’s terrible for speculation and out-of-order execution because any time an instruction writes to a register the CPU has to go “oh shit, is this actually a jump? Gotta check.” Good design for small cheap CPU’s, less so for big fast ones.

Having a link register seems to be nice, it may or may not need to be special cased in the instruction set though. In RISC-V every jump stores its previous address in a register, a jump-with-no-link just uses r0 as its link address. The flip side of this is that you’ll hardly ever use a “jump and link” instruction with a register besides the default link register, so you can squeeze more space out of your instruction encoding by not needing to specify which link register to use. Per David Chisnall, letting any register be a link register is a false generalization because of these reasons: “RISC-V is the only ISA that has a link register and doesn’t special case it and this was a huge mistake. The JAL and JALR instructions are used almost exclusively with wither the zero register or the link register as targets. … RISC-V uses them for the functions for outlined spill slots, but that is mostly necessary because they wasted too much of the encoding space to have efficient prologues. There’s a proposal for a 16-bit spill-multiple instruction, which will fix this. There’s another wrinkle with the link register, which means that it ends up being special on RISC-V, even though they try to pretend it isn’t: return-address prediction.”

Dedicated stack pointers are another common one that have gone in and out of vogue repeatedly. x86_64 has a dedicated stack register. In RISC-V and ARM32 the stack pointer is just a general-purpose register, it appears to be defined only by convention. POWER has a stack pointer that is defined by convention but I think there are also some instructions that treat it specially. AArch64 goes back to having a stack pointer that is not a general-purpose register. Support for stack frame base registers are even more uncommon in hardware; I think only x86_64 has it. However, base registers can be omitted; they’re mostly used to make it easier to unwind the stack and get a stack trace while debugging. So their cost is just saving/loading an extra register on function calls and usage of 1 GPR; there’s not as much to be gained by making them special purpose. (Software people seem to be divided into two camps: those who have done this process and say “always have base registers”, and those who have not and say “never have base registers, so your code goes faster”.)

Apparently there are some actual benefits to a dedicated stack register though. “Take a look at the disassembly for any program (for any ISA) and count the instructions in an addressing mode that use two registers to calculate the address for a load or store. Now count how many of those use the stack pointer as one of the operands. If the number that use the stack pointer is more than zero, your program has some huge stack frames (not including on-stack arrays, which are not addressed by the stack pointer directly, instead they move the base of the array to another register with a stack-pointer-relative add or subtract and then use that). Now, in the other direction, look at the loads and stores that use immediate addressing. These will be quite similar, but the stack-pointer-relative ones will almost always be able to take advantage of a shift because stack spill slots are register-width aligned. In addition, about the only non-memory operations that use the stack pointer are add and subtract (for creating / destroying stack frames) and occasional masking (to realign the stack).” Someone else mentioned that Intel CPU’s have a dedicated “stack engine” that speeds up its stack ops.

Floating point registers are their own special case, and seem to basically always be separate from GPR’s. I thiiiiiink the first implementation of hardware floating point for ARM32 processors reused GPR’s for floating point math, and it sucked so much they rapidly obsoleted it?

Various quotes, mostly from David Chisnall of course:

Loading or storing multiple values is still a win and modern vector ISAs all now have scatter/gather loads. The AArch32 version had some exciting consequences if you took a fault in the middle, which caused some pain. The other thing here that the article misses that made this painful: PC was a GPR. This was a huge mistake because you needed to decode enough of the instruction to know it’s destination register to know whether it would be a jump.

Multiple operating modes

x86 and ARM both grew multiple operating modes via accretion – real vs protected mode, normal vs Thumb mode. Aarch64 and RISC-V both deliberately eschew them. Whether Aarch64 and RISC-V will be forced to grow new operating modes as software and hardware evolves has yet to be seen, but my impression is the hardware designers really want to avoid it.

SIMD vs. vector instructions

Being able to push more data at a time through a processor’s SIMD unit seems like it should usually be a win. But the debacle of AVX-512’s initial release by Intel shows the risks of blindly increasing SIMD width: the greater bandwidth doesn’t help much if it makes your core run so hot it has to downclock itself. In contrast, taking a more conservative approach can work better, letting the CPU keep its clock speed high even if it’s not actually pumping as many instructions through per clock cycle. Either way, fixed-width SIMD instructions have the problem that widening your SIMD registers/lanes involves adding new instructions, which compilers need to then support, which software then needs to be compiled with, and you’ll still need shims to handle architectures that don’t use those new instructions. And every time you double your SIMD lane width, you also reduce the number of programs out there that can really take advantage of it all.

So a new approach is gaining ground where the software has instructions to give the CPU a numerical algorithm, tell it the bounds of the data to run it on, and then let it perform that algorithm with however much parallelism it has to offer. (This is actually a fairly old approach, but new for commodity hardware afaik.) So when you write a loop to sum an array, if the CPU can handle 2 SIMD operations at a time it will step through an array of data loading and adding 2 elements at once, and if it can handle 16 operations at a time the CPU executing the same code will load and add 16 elements at a time.

It’s a promising idea if it works, and I hope it does. The way I think of it is basically having your SIMD instructions be a bytecode for your CPU’s vector processing unit, and the CPU will execute that bytecode in whatever way is most efficient for the hardware it has available. This offers the ability to have SIMD code be far more backwards (and forwards) compatible, as well as hopefully easier to write. The down side, I suspect, is that you need to then have this specialized sub-processor kinda doing its own thing vs the rest of the CPU, and for small programs the overhead to set up and execute these vector programs may be greater than just chomping down a short run of SIMD instructions. I dunno.

There’s two main implementations of this idea that I know of in the world so far: ARM’s Scalable Vector Extension and RISC-V’s V extension. Both of them exist in real hardware… but only quite new hardware, so I don’t think we have enough data yet about how this approach will stand the test of time.