RiscVNotes

Just scribbling down my thoughts on RISC-V as I read the references. References used are the 20191213 version of the unprivileged spec and the 20190608 version of the privileged spec.

Good

  • Zero is an illegal instruction
  • They do variable-size instructions Right, more or less. There’s instruction encoding provisions for instructions >32 bits, but nothing uses yet. When 16-bit instructions are enabled, 32-bit instructions only need to be 16-bit aligned. Not what I expected, but they say it helps code density a lot, and I can believe it. You can intermix 16 and 32 bit instructions freely; contrast with ARM’s Thumb instruction set, where entire functions/blocks basically have to be one or the other. This means that even in 32 bit code if you just write your code with compact idioms (keep most common used scratch values in the s0 and s1 registers, for example) then your code is magically smaller.
  • The spec explains a lot of the rationale behind the decisions made, which is neat ’cause in my experience that stuff is mostly never written down anywhere. (At least nowhere that software people ever read? Are there textbooks and papers that talk about that sort of thing?)
  • Canonical 2-instruction sequences for constructing immediates or doing jumps to any 32-bit value, including PC-relative jumps.
  • No flags register, all comparisons are compare-and-jump instructions that compare two registers
  • Separate register file for floating point registers, plus a CSR for FPU status/exceptions/etc. (More on CSR’s in a sec.)
  • No floating point operations trap, they all just set flags. Floating point operations do not preserve NaN payloads, all NaN’s produced by instructions are just 0x7fc00000. AFAICT signaling NaN’s never generated unless the input to an operation is a signaling NaN?
  • FMA instruction exists.

Bad

  • ermergherd can we stop using “word” to mean “whatever size we feel like”? Knuth already solved this.
  • Single instructions for immediate jump offsets are 12 bits, ie +/- 2 kb. I’m no expert but it feels a bit smol. Many jumps will need 2 instructions. …though apparently the conditional branches implicitly multiply the address by 2 so it’s actually +/- 4 kb, which is quite a bit better. Unconditional jumps don’t do this multiplication and so are still only +/- 2 kb, for Hardware Reasons. Most jumps are conditional anyway… except, you know, function calls, which will generally need 2 instructions.
  • Smol gripe from OS land: It would be kinda nice if the CPU gave us more than one register to work with during interrupts. Automatically saving sp somewhere or such would be convenient. Otherwise you just have to juggle things around it yourself.

Boring but good things

ie, good but conventional decisions made or historical mistakes avoided

  • Endianness is kinda “whichever you want”, but instructions are always little-endian. So it’s kinda de-facto little-endian by default, which is fine.
  • 32 integer registers, x0 is zero, x1 is link register, x2 is stack pointer
  • IEEE floating point
  • Canonical NOP instruction: addi x0, x0, 0
  • No damn branch delay slots
  • Backwards branches are predicted taken (ie, end of a loop), forwards predicted not taken (ie, whichever part of an if the compiler thinks is less likely).
  • Standard system call instruction
  • Reserved chunk of instruction space for hint instructions, means that implementations that don’t understand particular hints can ignore them
  • There’s a separate space of 4096 32-bit machine-specific registers (CSR’s, Control and Status Registers) which have a few reserved for machine counters and are mostly left implementation-defined

Other

Observations, weirdnesses, or things that might be good or bad but I don’t have the background to judge.

  • The spec is written in a very prose-y style in general. This makes it quite good as a teaching doc, and kinda crap as a reference doc.
  • They authors are more interested in details of instruction opcode formats than anyone I’ve ever seen. Hence there’s a lot of nuance to instruction encodings for even basic instructions, for Good Hardware Reasons that I don’t care about. Similarly, you can tell they’re not systems/software people, ’cause they spend as little time talking about traps, interrupts and memory spaces as they can manage.
  • HARdware Threads are called “hart”’s, for some damn reason.
  • No conditional-move instruction
  • 32-bit values in 64-bit registers are always maintained in sign-extended form, which means that sign-extend is a no-op, addresses are always canonical (if it even has a concept of such a thing, I think it does?), and adding 1 to 0xFFFFFFFFu32 always wraps properly for comparison purposes and stuff like that.
  • The 32-bit ISA is not a subset of the 64-bit ISA, it is an entirely separate ISA that just happens to be mostly the same. Their rationale for this seems reasonable, basically “if you make 32-bit a subset of 64-bit, you can write code in a 32-bit ISA running on a 64-bit system, but you still need the 64-bit ABI to talk to anything so it doesn’t actually make anything better”.
  • Multiplication has separate instructions for “multiply x*y and get low word" and “multiply x*y and get high word“. If you want both you’re supposed to just do both in sequence and the instruction decoder will recognize the sequence and handle it efficiently. Division+remainder is similar. Is this the”micro-op fusion" people like to complain about? Seems relatively innocuous to a software guy. I don’t see any other instruction sequences that require special handling to be efficient, and multiplication and division are both kinda chonky operations to begin with so idk if making them harder for small CPU’s to handle efficiently will matter much ’cause those CPU’s will want to avoid them anyway.
  • Division by zero does not trap, it just returns a register full of 1’s. Kinda weird, but it makes it so that arithmetic instructions never trap and so removes a special case.

Operating system stuff

  • There’s something called the SBI, “supervisor binary interface”, that separately documents the API that a CPU (or VM) presents to the operating system. It’s basically “the instructions that a software hypervisor needs to trap and emulate instead of executing directly”, which is a very nice way of thinking about it. It can also serve as a bootloader and portability layer to abstract away implementation-defined things like “how do you start up a hart”.
  • Four privilege modes. One is reserved, the others are User, Supervisor, and Machine. CPU’s only must implement Machine mode. Not sure what the difference between Machine and Supervisor is yet, presumably it’s “OS” vs. “hypervisor”.
  • Basically all non-User instructions that modify CPU state are implemented by reading/writing CSR’s. Some of the high bits of the CSR number encode read/write ability and privilege level required. Looks like Supervisor mode can set/manipulate/handle interrupts and traps and such, but nothing more, while Machine mode can also do things like manipulate memory protection, do CPUID type stuff, handle cycle counters/performance timers, etc.
  • There’s space for a semi-nonstandard Debug instruction space too.
  • mscratch machine-specific scratch register that sounds like I should know what to do with it already. Thread-local storage stuff? “Typically, it is used to hold a pointer to a machine-mode hart-local context space and swapped with a user register upon entry to an M-mode trap handler.” MIPS apparently has two such registers, k0/k1. Yep, sscratch is how you get information like “here is your interrupt stack address” into interrupts.
  • There’s a “wait for interrupt” instruction that is basically a hint that you are in a busy-wait loop, it may be semantically switched with a NOP.
  • Where the program counter starts on reboot is implementation-defined.
  • Lots of stuff about how memory access to different regions of memory are controlled (normal, I/O, etc)
  • There’s a Physical Memory Protection functionality (PMP) where you can define RWX access levels to up to 16 contiguous blocks of memory. Might be fun to play with. Embedded-level ARM chips have something similar, but idk how it works. PMP looks basically as simple as you could wish for.
  • Interrupt handler set by the mtvec register. Can just be a single interrupt handler that then figures out what kind of interrupt caused it, or it may be a vector of things where each different interrupt goes to a different place.
  • There’s a separate interrupt handler vector for System mode, stvec.
  • satp register holds paging context flags and the address of the root of the page table. There’s also an address space ID in there? What’s that for?
  • Paging layers on top of the PMP functionality, one way or another. Each page also has the usual RWX bits.
  • On RV64 there’s two paging modes, 39-bit virtual addressing and 48-bit virtual addressing, with reservations for bigger ones. On RV32 the page table is 2 levels deep, I think? Bit hard to grok. I like how hardware spec writers always seem to be utterly immune to drawing a tree structure comprehensible to software people. If I’m understanding properly, 39-bit addressing uses 3-level tables. 48-bit address space is a very straightforward extension, with 4-level tables.
  • Smallest page size is 4 kb. 39-bit addressing also allows 2 MB and 1 GB page sizes, same as x86_64. 48-bit addressing also supports 512 GB pages, for when you wanna go hard. No mention of larger page sizes for 32-bit addressing mode.