RiscIn2022
Some thoughts on hardware instruction sets.
Instruction sets are an API for talking to the hardware. Having this API at the right level of abstraction is Important. There’s lots of features that can go together into this API which make it “better” or “worse” for particular use cases.
RISC was a set of design principles developed in the 1980’s that enabled hardware to get much faster and more efficient. We tend to still call modern-looking instruction sets “RISC-y”, but really, a bunch of the original design principles of RISC CPU’s have not stood the test of time. So let’s look at the things that have worked and not worked between the 1980’s and 2022.
Note I am not a CPU designer, nor am I really an expert in compilers and assembly language. I’m a software dude who writes compilers and OS’s as a hobby, and likes weird architectures.
things to fix: conversation here, https://lobste.rs/s/fgfxvu/risc_2022. And also here: https://wiki.alopex.li/@RiscIn2022. Done some reading about PPC and it’s a mess but also has some cool stuff, esp. with condition registers.
Case studies
Old systems that started in the 1980’s and evolved: x86_64, ARM32, PPC/POWER
New systems made from whole cloth in the 2010’s: Aarch64, RISC-V
Historical things that currently appear to have little future: MIPS, Itanium, x86 (32-bit-only version), PPC/POWER (debatable but I don’t want to debate it)
Things I don’t know much about: SPARC, PA-RISC
Out of scope: 68K (stopped evolving in the 90’s afaik), Alpha (maybe? it also stopped evolving in the 90’s, but later than 68k)
My best sources are Raymond Chen’s computer architecture articles, and reading references, and writing code.
Things that work
32ish general purpose registers
Having more registers is good because it gives you more working space without needing to touch slow memory. Having more registers is good because it gives the compiler more options for where to put your program’s data without needing to add instructions to shuffle things around.
Having more registers is bad because it means on every function call or OS context switch, you need to save and restore these registers, adding overhead to these very common operations. Having more registers is bad because each instruction needs to specify which registers it operates on, and if you have tons of registers you are spending more bits per instruction storing just which registers its arguments are in.
32 registers is a decent, pragmatic sweet spot for 32-bit 3-address instructions. (More on that below.) Encoding 3 registers per instruction takes 15 bits, leaving you 17 bits for Other Stuff which is a pretty useful amount of space. 16 registers is pretty good, it’s common on smaller systems that want more compact instructions (ARM32, SuperH), but it’s not too hard to write programs which could really use more registers if they were available. 64 registers with 3 registers per instruction would require 18 bits out of 32 just for register addressing, which is a bit much, and results in a greater proportion of programs which don’t really need all the available scratch space.
This does mean that it is difficult to create immediate values larger than the spare 17/21/whatever bits that you can fit into one instruction. Creating full 32 or 64 bit literals will generally take multiple instructions. This includes creating addresses/offsets for both data and function calls/jumps! There are many patterns for doing this efficiently, depending on the architecture. You may need to load the high and low 16 bits of a 32 bit value separately in sequential instructions. You may be able to easily create values relative to the program counter, and use these for function calls or to load constants from tables the compiler sprinkles between function code. You may have a “global register” that the calling convention reserves to point to the program’s data segment, so you can always load values relative to that register. You may have instructions that create larger bit patterns out of smaller ones, such as a single instruction that takes 0x1F
and loads it into a 32-bit register as 0x1F1F1F1F
. And so on. Nobody really likes this, as far as I can tell, but so far it seems to mostly be worth the extra instructions in exchange for a simpler and more regular instruction format. Having variable-length instructions that can have a full 32 or 64 bit value stuck into them would add complexity to the CPU’s instruction decoder and would not really save that much space. Assemblers even tend to have shortcuts to handle the construction of large constants for you, so you can write load_immediate r1, 0x1234_5678_9ABC_DEF0
and it will find somewhere to stash the value 0x1234_5678_9ABC_DEF0
and fabricate the instruction sequence to get it for you. So the only people it really bothers in practice are compiler writers.
TODO: Apparently ARM did an analysis of 32 registers that showed that 16 performed just as well for most code. This was apparently internal only but might have external references somewhere. You can also easily modify LLVM to only use 16 registers and see what that does to various code. Here’s some sources that investigate different register sizes: https://doi.org/10.1109/ACAC.2001.903365 and https://doi.org/10.1109/ISCA.1993.698564. Indeed, both show that 32 registers is only a few % better than 16; I thought it was a bigger difference! The newer paper is from 2001 though; I don’t expect backend compiler tech to have changed that much since then, but it would be nice to update the findings.
3-address codes
This is not a hard and fast rule, but having most operations of the form rD = rA + rB
is a pretty good way to structure your instruction set. It just gives the compiler a lot of flexibility for which registers to put intermediate values into, which means the compiler doesn’t often have to do a lot of register shuffling to get things to where they need to be. Even if register renaming makes register-to-register moves super fast, they’re still instructions that need to be loaded and decoded, which ideally do not need to be present in your program.
The other common instruction encoding is a 2-address code of the form rD = rD + rA
, ie, the result overwrites one of the operands. This is a lot less flexible. But it is a common enough special case it’s a common optimization for the purpose of small code size. Having 2 registers instead of 3 forces the compiler to do more shuffling and copying of data when arranging large expressions, so an expression written as 2-address codes may be slower or occasionally use more instruction space than one using 3-address codes.
Not every instruction fits into the 3-address format of course. Loads and stores usually only need 2 register operands, and use the extra space to make offsets or literals larger. Aarch64 (and probably others) have a few 4-register instructions, such as rD = rA + (rB * rC)
, but those generally have specific instruction encodings for those particular instructions, so they’re deliberately a special case. Apparently a useful enough special case to exist though.
32 bit instructions, but also small code size
Canonically, it’s Best if all instructions are 32 bits long and on a 32-bit boundary. This makes your instruction decoder very simple: you load 4 bytes from a word boundary and that’s one instruction, no logic involved. If you have out-of-order execution then you can load 8 or 16 bytes or whatever at a time and still need to do exactly zero work to figure out where instructions are before starting to decode them.
On the other hand, instructions take up i-cache space and memory bandwidth, so the smaller your instructions are, the better.
On the third hand, if you have small instructions then it’s difficult for them to do as much as you can do with bigger instructions (see the section on 3-address codes), so you may need more small instructions to do the same work as fewer large instructions.
So in reality it seems like you want a balance: a variable-length instruction set that’s compact but still easy to decode. Both ARM32 and RISC-V have an optional “compressed” instruction format (Thumb-2 and C extension respectively) that essentially allow 16-bit instructions to be mixed in with 32-bit ones, so that seems to be the current sweet spot. Both started with the approach that the base instruction set is all 32-bit, and then the most common and useful instructions have duplicate 16-bit encodings. (Thumb-2 seems to diverge from this a bit more than RISC-V, I don’t understand all the differences though.) In both, instructions must be 16-bit aligned, but 32-bit instructions may span 32-bit boundaries.
(ARM’s Thumb-1 has different rules that are more restrictive, so I’m not going to consider it here. RIP Gameboy Advance programmers though.)
Aarch64 does not have variable length instructions (yet), not sure yet whether it will grow one. However, from reading Raymond Chen’s blog posts about it, it looks like even with 32-bit instructions only, Aarch64 spends a lot of effort packing more work into fewer instructions. For example there’s no integer multiplication instruction, every integer multiplication is multiply-and-add, rD = rA + (rB * rC)
. So it could be that the Aarch64 designers have deliberately decided to not have variable-length instructions, but have more complex instructions. Departure from RISC’s origins, indeed.
Load-store architecture
This is a biggie. Back in the day, it was common to have instructions that touched multiple memory addresses at once.
This made life harder for instruction decode and execution units (if I understand correctly), so RISC ditched it. You had enough silicon for a lot of registers (compared to the then-common 16 or 8 or even fewer), and so it made life easier to have every instruction operate on registers and only a few instructions move stuff from registers to memory and back.
Memory kept getting slower, and caching and instruction reordering kept getting more important to keep the CPU fed with useful work, and so this decision worked out really really well in the long term. Maybe even by accident? Either way, the CPU has to reach out to memory relatively rarely and does a lot more of its work entirely in registers.
Few addressing modes
TODO, addressing modes are boring. Apparently Aarch64 has richer addressing modes than usual, which is worth the payoff for larger systems. “The goal is to maximise the work done per instruction, not to minimise the encoding space for a single instruction. RISC-V optimises the other way, which is why it’s a good fit for microcontrollers but not as good for laptop or server class systems.”
Big flat address spaces
This is more general technological evolution than RISC specific, but it kinda started with RISC and is a trend I see continuing. It’s now inexpensive enough to put addressing hardware and traces in chips that even if you have a microcontroller with only 64kb of actual physical RAM (16 bit address space), having your CPU use a full 32-bit address space with 32-bit registers and pointers is usually not a big cost. The “empty” address space gets used for memory-mapped devices, useful address tricks like virtual memory, etc. Sure, you could get a little more use out of your transistor budget by having the hardware not pretend it can have more than 16 bits per pointer. But we are wealthy enough to waste some of that capability for the sake of simplicity
Weak memory model
x86 is the only big architecture made since the 90’s(?) with a strong memory model. ARM32, Aarch64, and RISC-V all have weak memory models. MIPS, Itanium, SPARC and POWER all have weak memory models. ’Nuff said, really. Which is good, ’cause I don’t know much more about the details of why this is a good thing, other than “simpler data caches instruction speculation”. From @dave_chisnall
: “In a TSO architecture, you have to keep things in store queues for a lot longer because more things that happened on other cores can invalidate your memory access.”
Things that don’t work
These are things that older RISC chips tended to do which seem to have become less common through time. As far as I know none of these are such fatal mistakes they have killed architectures, but they certainly make life less convenient for compiler writers, system programmers and/or hardware designers.
Reduced instruction sets
This one is a little cheeky but I had to put it in. The first RISC I CPU had 31 instructions. I didn’t count carefully, but MIPS IV seems to have about a hundred instructions, not including floating-point. Alpha also has something like 100 integer instructions, at least in the v4 manual.
On the other hand, ARM32 has thousands of instructions. Aarch64 has hundreds of instructions if not thousands. RISC-V has a very small core of about 40 instructions, but if you want useful extensions that let you run an OS with memory protection, atomics, useful bit manipulation instructions, and an FPU, you are soon looking at more like 200-300 instructions. RISC-V vector instructions (SIMD) add another 200 or so.
Basically, there’s tons of little fiddly things you can add to a CPU that are worth the effort if you have the transistors to spend, especially when it comes to bit manipulations and floating point math. These days we usually have the transistors to spend on special-case performance. There’s a pretty strong incentive to add more instructions to do better on certain benchmarks or provide certain features, and not a whole lot of incentive not to.
That said, the core instruction set does still tend to remain quite small. No capability is lost if you don’t have rD = rA and (not rB)
as its own instruction, just performance. So, like variable-length instruction sets, large complex instruction sets are due to us spending transistors to save memory bandwidth, which makes instruction caches smaller and pipelines smoother. If we could fetch instructions from memory faster than we could execute them, then having fewer and simpler instructions would probably still be the way to go.
Single-purpose instructions
Very related to the previous point. Code density is too important, and memory is too slow. Aarch64 and ARM32 both have many instructions that do multiple things at once, as long as they only have one memory read/write at once. The best examples I’ve seen is that Aarch64 has “load-and-increment” instructions for array indexing: they load data from an address stored in a register, then increment that address in the register. RISC-V has far fewer but they still exist: fused multiply-add, jump-and-link, some of the bit-manipulation instructions, and the proposed push/pop memory stack instructions. So this is not a hard and fast rule, but the trend seems to be that it’s worth having more complicated instruction execution to make very common sets of operations smaller, within reason.
This surprised me when I learned about it, ’cause load-and-increment instructions used to exist in the old-school CISC instruction sets of the 70’s and 80’s and be used as an example of the “do multiple things at once” philosophy that RISC eschewed. And now it’s back! The difference is that what Aarch64’s multi-function instructions don’t do is read a value from memory, modify it, and write it back to memory all at once. Nobody does. Having an instruction do multiple things at once does make it harder to reorder and needs more complicated execution machinery, but that’s less important than the fact it sticks to the load-store model.
TODO: “The key goal is that there should not be two instruction sequences of the same length, with no shorter version, that achieve the same thing. … Thing like the bitfield extract and insert instructions replace a longer shift and mask sequence. In older CISC instruction sets, there were many equivalent-length instruction sequences for doing the same thing, which made it hard for compiler writers to know which they should pick.”
Complicated data dependencies
Flags registers, hardware traps, instructions that modify multiple registers (a la x86’s MUL which stores results in rax/rdx
or something like that) all make life a lot harder for the CPU because there are more potential conflicts between one instruction writing data to a location and the next one reading from it.
This is basically same problem as instructions that read or modify several things in memory, except more so because things like flag register changes are implicit and happen on each instruction. In Ye Olden Days, instructions were executed strictly in the order they were given and you had one thread of execution per CPU and it didn’t matter that instruction A had to write its results to the flags register before instruction B could read from it. Now it does matter. These sorts of “modify global state” operations were a lot easier to design before out-of-order and speculative execution became common. So older instructions like ARM32, MIPS and Alpha have them, and RISC-V does not. Aarch64 seems to straddle the line a little bit; it has a flags register but uses it more lightly than ARM32, and fewer operations affect it.
TODO: from @dave_chisnall
again: “Flags are still there on everything except RISC-V, and even there they exist in the floating-point ISA. They are annoying to implement but they’re a huge win for software. Compilers are really good at if conversion now and you can get the same performance on an architecture with conditional moves as one without if you have about less than half as much branch predictor state. The saving in the branch predictor more than offsets the cost of flags. They’re also essential for any kind of constant-time code to be fast, which is increasingly important with modern systems and side channels.” Also, POWER has an actually quite interesting design where there’s 8 sets of flags registers. Explicit compares can use any of them, and most integer/float instructions can have a flag that tells them to set a default flag register according to the results (cr0 for integer, cr1 for floating point) or not bother. Apparently Itanium had something similar too.
Note that this goes so far that things like integer divide by zero no longer cause an interrupt. Interrupts require stopping what the CPU is doing, waiting for all the pipeline and speculation machinery to notice, context switching to the OS, and handling the trap. That’s a lot of extra work to potentially have to do on any instruction. In RISC-V and Aarch64, if you want to check for divide by zero, you do it explicitly; the CPU does not stop what it’s doing for you. In ARM32 trapping on divide by zero is an optional flag the CPU may not support.
Exposing too many hardware details
The first MIPS CPU had a pipelined instruction decoder, which meant that the by the time the instruction decoder figures out whether the current instruction is a jump, the next instruction was already loaded. This presents a problem for the CPU: It can either throw that loaded instruction away and accept a 1-instruction pipeline bubble on every jump, or it can use some sort of fanciness such as speculative execution to find something hopefully-useful to do during that pipeline bubble. Or you can do what MIPS does, which is execute that next instruction after the jump anyway, as if it were the first instruction at the destination of the jump. This is called a “branch delay slot”. Raymond Chen has some opinions about them. They’re weird, and cause some odd special cases, but compilers don’t really care; the only damage branch delay slots do is to the minds of people who have to read or write assembly code.
This my understanding is a little more vague on, but afaict Itanium had instruction formats that packed 2 integer ops together, because it had 2 integer units. That way the instruction stream could, in the best case, keep the CPU fed with exactly what it needed to keep running at max utilization. No speculative execution or reordering or such needed, the compiler just knew what the CPU wanted and arranged the instructions to fit. (They still do this to an extent.)
Both of these features work fine for specific models of CPU. Exposing these details mean that the CPU doesn’t have to do extra work to hide these behaviors from the program’s execution model. The problem is that the instant you change the hardware implementation this suddenly becomes a liability where you have to paper over the differences from the old implementation anyway. If the latency before a jump is 2 cycles instead of 1, you suddenly need to add in all that reordering/speculative execution/whatever hardware to cover it up and make it look like 1 cycle. The branch delay slot suddenly saves you absolutely nothing. If you make a CPU that has 4 integer units instead of 2, you need to load and dispatch and track more instructions at a time, just as if your instructions could only have 1 integer operation at a time.
This is a great demonstration of how a good instruction set that scales across many devices and system sizes is not a literal description of how the computer works, but rather is an API for talking to it. Like any API it needs to be at a useful level of abstraction.
On the other hand, smol things like DSP’s use these sorts of compromises heavily, on the assumption that rewriting/recompiling a program for a new model of hardware is easier than spending more transistors to make the hardware more flexible.
Register windows (minor)
These aren’t so much of a bad design as just a design that didn’t pay off. Many CPU’s have had register windows: RISC-I, SPARC, Itanium are the ones I know of best. The idea seems good: instead of explicitly spilling registers to a stack and restoring them every function call, just have multiple sets of registers and have each function call switch from one bank to another. You can even make the banks partially overlap and use the overlapping portions for passing args and return values between functions. Neat, huh?
The thing is that once you fill up however many registers you have in your CPU, on every function call you have to spill/unspill registers anyway, so as programs get larger the benefits get smaller, and depending on how they’re implemented you can hit a sudden performance cliff when you run out of empty register banks. Register windows also probably make it more complicated to walk the stack for backtraces and stuff; if you need to handle a software exception or have a debugger inspect a program, you now need to be able to spill all your CPU’s registers onto the stack on demand. So, nobody seems to use register windows for general purpose CPU’s anymore. I’ve never heard anyone condemn them, they’ve just quietly died out. It seems like it was more effective to just have a bunch of registers and a calling convention that saved a fixed number of them to the stack on every function call.
On the other hand, apparently they’re still somewhat common in DSP’s and embedded devices, so maybe they’ll come back someday. I have been informed that “register windows remain great if you have SPARC-like circular view of them and a microarchitecture that does asynchronous background spilling. Oh, and a memory model weak enough that nothing on the spill stack is guaranteed visible to anything other than the spill / reload logic without an explicit barrier.” So basically they can still be beneficial if your CPU is smart enough to spill stuff in the background while the rest of the program is still doing stuff, which sounds potentially complicated but also like it’d solve the problem. I could certainly imagine a dedicated cache for spilling registers; Itanium also had a second stack pointer for this purpose, so the CPU can certainly be designed around this idea. Is it worth it? I dunno!
Instruction condition codes (minor)
ARM32 had a condition code on every instruction, so you could do add.gt
or whatever and it would only execute the add instruction if the greater-than
flag was set. It’s a really cool idea. Unfortunately, it also does not seem to have been worth the costs in terms of instruction encoding space and execution complexity; it’s gone in Aarch64. It’s far easier to have only some instructions be conditional – mainly jumps and maybe a few loads/stores.
Jury is still out
Load/store multiple values at a time
Very useful for function preludes where you have to spill or restore all those registers. Are they worth it? idk.
ARM32 especially has these. RISC-V (currently) does not but may acquire them; there’s a proposal for it that looks like it will pass. Aarch64 has a kinda weird little compromise, there’s an instruction that stores two registers at a time, which makes it easier for them to enforce some restrictions on the process like “destination must be 16-byte aligned” that cause fewer edge cases.
General purpose registers with special purposes
Old-school CISC processors usually had many different register types: integer registers, address registers, floating point registers, string pointer registers, etc. They had different instructions that operated specifically on those registers. Best example is the x86 IDIV
instruction: “The IDIV
instruction divides the contents of the 64-bit integer EDX:EAX
by the specified operand value. The quotient result of the division is stored into EAX, the remainder is placed in EDX.” Don’t want the remainder in EDX
’cause there’s something else there already? Too bad. It made the hardware simpler to implement.
RISC got rid of all of this and just made all registers usable for any instruction, which resulted in fewer special cases for software to juggle. Fewer special cases also made the hardware simpler to implement – presumably with different tradeoffs that became more favorable as time went on, but I don’t know details. But some hardwired special cases for general-purpose registers crept back in.
r0
for zero register, it’s nice but not essential. RISC-V has it. ARM32 does not. On Aarch64 and POWER it’s context dependent??? Basically you spend 1 GPR forever in exchange for removing some special cases from your instruction set. For example, IIRC pre-load hints to the processor in RISC-V take the form of “load some value from memory and store it into the zero register”. (TODO: Double check this.)
Having the program counter be a GPR is another option. It’s kinda weird, but enables some cute tricks such as jumps really just being normal instructions that modify a particular register, or function returns being a pop
instruction that pops the return address from the stack into the program counter. But having one more GPR might be more important than those tricks. It might also make speculation and other hardware design difficult if any instruction might end up being a jump, idk for sure. Only ARM32 does this afaict.
Having a link register seems to be nice, may or may not need to be special cased in the instruction set though. In RISC-V every jump stores its previous address in a register, a jump-with-no-link just uses r0
as its link address.
Nobody besides x86_64 seems to have dedicated stack registers anymore, they’re defined only by convention. Stack frame base registers are even more uncommon. TODO: Apparently there might be some actual benefits though. “Take a look at the disassembly for any program (for any ISA) and count the instructions in an addressing mode that use two registers to calculate the address for a load or store. Now count how many of those use the stack pointer as one of the operands. If the number that use the stack pointer is more than zero, your program has some huge stack frames (not including on-stack arrays, which are not addressed by the stack pointer directly, instead they move the base of the array to another register with a stack-pointer-relative add or subtract and then use that). Now, in the other direction, look at the loads and stores that use immediate addressing. These will be quite similar, but the stack-pointer-relative ones will almost always be able to take advantage of a shift because stack spill slots are register-width aligned. In addition, about the only non-memory operations that use the stack pointer are add and subtract (for creating / destroying stack frames) and occasional masking (to realign the stack).” Someone else mentioned that Intel CPU’s have a dedicated “stack engine” that speeds up its stack ops.
Floating point registers are their own special case, and seem to basically always be separate from GPR’s. I thiiiiiink the first implementation of hardware floating point for ARM32 processors reused GPR’s for floating point math, and it sucked so much they rapidly obsoleted it?
TODO: “Loading or storing multiple values is still a win and modern vector ISAs all now have scatter/gather loads. The AArch32 version had some exciting consequences if you took a fault in the middle, which caused some pain. The other thing here that the article misses that made this painful: PC was a GPR. This was a huge mistake because you needed to decode enough of the instruction to know it’s destination register to know whether it would be a jump.” And also, “RISC-V is the only ISA that has a link register and doesn’t special case it and this was a huge mistake. The JAL and JALR instructions are used almost exclusively with wither the zero register or the link register as targets. … RISC-V uses them for the functions for outlined spill slots, but that is mostly necessary because they wasted too much of the encoding space to have efficient prologues. There’s a proposal for a 16-bit spill-multiple instruction, which will fix this.” And also, “There’s another wrinkle with the link register, which means that it ends up being special on RISC-V, even though they try to pretend it isn’t: return-address prediction.”
Multiple operating modes
x86 and ARM both grew multiple operating modes via accretion – real vs protected mode, normal vs Thumb mode. Aarch64 and RISC-V both deliberately eschew them. Whether Aarch64 and RISC-V will be forced to grow new operating modes as software and hardware evolves has yet to be seen, but my impression is the hardware designers really want to avoid it.
SIMD vs. vector instructions
Being able to push more data at a time through a processor’s SIMD unit seems like it should usually be a win. But the debacle of AVX-512’s initial release by Intel shows the risks of blindly increasing SIMD width: the greater bandwidth doesn’t help much if it makes your core run so hot it has to downclock itself. In contrast, taking a more conservative approach can work better, letting the CPU keep its clock speed high even if it’s not actually pumping as many instructions through per clock cycle. Either way, fixed-width SIMD instructions have the problem that widening your SIMD registers/lanes involves adding new instructions, which compilers need to then support, which software then needs to be compiled with, and you’ll still need shims to handle architectures that don’t use those new instructions. And every time you double your SIMD lane width, you also reduce the number of programs out there that can really take advantage of it all.
So a new approach is gaining ground where the software has instructions to give the CPU a numerical algorithm, tell it the bounds of the data to run it on, and then let it perform that algorithm with however much parallelism it has to offer. (This is actually a fairly old approach, but new for commodity hardware afaik.) So when you write a loop to sum an array, if the CPU can handle 2 SIMD operations at a time it will step through an array of data loading and adding 2 elements at once, and if it can handle 16 operations at a time the CPU executing the same code will load and add 16 elements at a time.
It’s a promising idea if it works, and I hope it does. The way I think of it is basically having your SIMD instructions be a bytecode for your CPU’s vector processing unit, and the CPU will execute that bytecode in whatever way is most efficient for the hardware it has available. This offers the ability to have SIMD code be far more backwards (and forwards) compatible, as well as hopefully easier to write. The down side, I suspect, is that you need to then have this specialized sub-processor kinda doing its own thing vs the rest of the CPU, and for small programs the overhead to set up and execute these vector programs may be greater than just chomping down a short run of SIMD instructions. I dunno.
There’s two main implementations of this idea that I know of in the world so far: ARM’s Scalable Vector Extension and RISC-V’s V extension. Both of them exist in real hardware… but only quite new hardware, so I don’t think we have enough data yet about how this approach will stand the test of time.