RISC-V Load Immediate `li` Pseudo-Instruction

The pseudo-instruction li breaks down to a complex algorithm depending on the variables. Before presenting the algorithm, some terms have to be defined.

Terms and Concepts

Numbers and Sign

First, an unsigned number has its raw value in binary and cannot be negative. There is no guess on how to read the value, each bit will only represent a power of two.

 1   0   1   1
2^3 2^2 2^1 2^0
 8 + 0 + 2 + 1  = 11

Next, a signed integer will have its most significant bit set to a sign bit. A naive representation would be to note if the sign bit is one, in which case the number is negative, then interpret the rest of the number. However, this makes the value 0 and -0 have a different representation.

 1   0   1   1
 -  2^2 2^1 2^0
 - ( 0 + 2 + 1 ) = -3

Another issue with this representation is the fact that you have to take extra care of operations. For example:

   1011 (-3)
 + 1010 (-2)
 ______
   0101 (+5)  X incorrect

The real answer is 10101 but if we are limited to 4 bits, the extra digit is simply lost…

The solution to this representation is two’s complement, inverting all bits of the positive integer and add 1 to get its negative representation.

7 --> 0111 --> 1000 --> 1001 --> -7
       (7)   (2s ct)    (+1)

If we now try another operation:

	1001 (-7)
  + 0100 (+4)
  ______
    1101  sign bit up so the value is negative!
   	      remove bit sign and subtract 1 -> 100 = 4
      => -4  CORRECT!

In Pharo

While these are basic number representation concepts, in Pharo, writing -16rF22 is possible and the inspector will tell you it is equal to -3874. However, if what you mean is perform the two’s complement of that number, you can use a function such as:

computeSignedValueOf: aValue ofSize: aSize
	"If the number is negative, returns two's complement, otherwise return the value"
	aValue < 0
		ifTrue: [ ^ aValue twoComplementOfBitSize: aSize ]  "Compute two's complement"
		ifFalse: [ ^ aValue ]

Sign Extension

Another concept that is important regarding signed and unsigned numbers is the concept of sign extension and zero extension. When moving a number around or extracting a number from a bit array, it is important to extend it to the destination size. How should you extend it then? Add 0s as the higher bits?

Two methods coexist: the first one is called sign-extension and will write the sign bit of the number as all the higher bits while zero-extension will write 0s. The first one has to be used when dealing with signed numbers while the latter is preferred for unsigned numbers.

For example, moving the signed value -222 (16r22 with the smallest size of 8 bits) should be sign-extended to 16rFFFFFF22 when moved in a 32-bits register. This will not change the value due to how two’s complement work! On the other hand, when moving an unsigned number it is important to keep the bits raw as they directly represent the number.

In C or C++, sign-extension is usually performed as follows:

uint64_t x;
// Extract the lower n bits of x and sign-extend them
int64_t low12 = (int64_t)((x << (64 - n)) >> (64 -n))

Using the fact that causing an overflow in a left shift will truncate the number, shifting left then shifting right the same amount will truncate the value at the given shift amount. The cast to int64_t from uint64_t will perform the sign extension!

Load Immediate `li` pseudo-instruction

Loading an immediate in RISC-V means loading a raw value in a register. While this can be done with a single load operation on architectures such as x86, the path is more convoluted for RISC-V.

Instructions Available

We have four different instructions that we can combine to load a particular immediate. Those are:

Load Upper Immediate lui: Writes the sign-extended 20-bit immediate, left-shifted by 12 bits to x[rd], zeroing the lower 12 bits. This translates to:
```
x[rd] = sext(immediate[31:12] << 12)
```
Add Immediate addi: Adds the sign-extended 12-bits immediate to register x[rs1] and writes the result to x[rd]. This translates to:
```
x[rd] = x[rs1] + sext(immediate)
```
Add Word Immediate addiw: Adds the sign-extended 12-bits immediate to register x[rs1], truncates the result to 32 bits, and writes the sign-extended result to x[rd]. This translates to:
```
x[rd] = sext((x[rs1] + sext(immediate))[31:0])  
```
Shift Left Logical Immediate slli: Shifts register x[rs1] left by shamt positions. The vacated bits are filled with zeros, and the result is written to x[rd]. This translates to:
```
x[rd] = x[rs1] << shamt
```

Using these four instructions, we can easily (naively) figure how we could load an immediate. Depending on the size of the immediate, we can use from 1 to 8 instructions: addi if it fits less than 12 bits or up to lui, addiw, slli, addi, slli, addi, slli, addi. Note that the first two instructions lui + addiw can contribute up to 32 bits while the following addi contribute up to 12 bits each. We first emit the 32 most significant bits with lui+addiw then work our way with shifts and adds.

However, this idea does not work due to the fact that EACH addi and addiw performs a sign-extension. This means that if a big constant (let’s say 0x7000000000000800 or 8070450532247930880) has to be loaded, the final part where you perform the add will get understood as a signed integer due to the most significant bit set to 1.

ASM Exploration

Using clang and llvm-objdump, let’s look at how the compiler handles the translation!

Using clang makes it possible to generate an object file from the asm with:

$ clang -c -target riscv64 -march=rv64g -g load_immediate.s -o load_immediate.o

load_immediate.s is composed of:

    .org 0x1000
 main:
    li x7, 12341234

The generated object file can be read with objdump:

$ llvm-objdump -S load_immediate.o

A simple Python script automates the process in the command line:

import sys
import subprocess

if len(sys.argv) != 2:
    print("Wrong arguments! Use as python immediate.py <your 64-bits immediate value>")
    sys.exit(1)

with open("temp.s", "w") as file:
    file.write("""    .org 0x1000
 main:
    li x7, {}""".format(sys.argv[1]))

subprocess.run(['clang', '-c', '-target', 'riscv64', '-march=rv64g',
                         '-g', 'temp.s', '-o', 'temp.o'])

result = subprocess.run(['llvm-objdump', '-S', 'temp.o'], stdout=subprocess.PIPE)
print(result.stdout.decode('utf-8'))

ASM Examples

Let’s look at some examples of the result of objdump:

7FF:

0000000000001000 main:
;     li x7, 0x7FF
1000: 93 03 f0 7f          	addi	t2, zero, 2047  # 0x7FF

FF0:

0000000000001000 main:
;     li x7, 0xFF0
1000: b7 13 00 00          	lui	t2, 1           # 0x1
1004: 9b 83 03 ff          	addiw	t2, t2, -16 # 0xFF0

The first example loads the value within 11 bits (sign bit 0) so a simple addi is enough. On the other hand, the second value, FF0 will be understood as the two’s complement of 16 and the emitted instruction will be addi rd, rs, -16. Leaving this instruction alone will sign-extend the value -16 to the size of the register. Instead of having only the value FF0 in the register, we would have FFFFFFF0 (for 32 bits). One way to solve the problem is to add 1 through lui so that the sign bit will be reset:

    0000 1000
  + FFFF FFF0
  ___________
    0000 0FF0

A simple check on the sign bit of the 12 bits immediate should work!

But what happens when the value lui should manage gets negated too?

Let’s look into some more ASM examples:

7FF00FF0:

0000000000001000 main:
;     li x7, 0x7FF00FF0
1000: b7 13 f0 7f          	lui	t2, 524033       # 0x7FF01
1004: 9b 83 03 ff          	addiw	t2, t2, -16  # 0xFF0

FFF00FF0:

0000000000001000 main:
;     li x7, 0xFFF00FF0
1000: b7 03 10 00          	lui	t2, 256           # 0x100
1004: 9b 83 13 f0          	addiw	t2, t2, -255  # 0xF01
1008: 93 93 c3 00          	slli	t2, t2, 12
100c: 93 83 03 ff          	addi	t2, t2, -16   # 0xFF

While the first example is straightforward, the second one is more convoluted, we will look into this one. To handle the sign extension lui performs, a new instruction is added! The first two will do an addition between 0x00100000 (lui loads 256 then shifts it left by 12 bits) and 0xFFFFFF01 (addiw uses the sign-extended value -255). The result is:

   0010 0000
 + FFFF FF01
 ___________
   000F FF01

which corresponds to our value 0xFFF00 with the added 1 since the next value has the sign bit set and has to be corrected as presented in the examples earlier.

An additional instruction can handle the shift the issue and handle the problem of sign-extension for lui!

LLVM Solution

How does LLVM handles this rather complicated algorithm (and even better than that, it can also adapt the shifts in case of scarce values)?

(https://github.com/llvm/llvm-project/blob/4c3d916c4bd2a392101c74dd270bd1e6a4fec15b/llvm/lib/Target/RISCV/MCTargetDesc/RISCVMatInt.cpp)

static void generateInstSeqImpl(int64_t Val,
                                 const FeatureBitset &ActiveFeatures,
                                 RISCVMatInt::InstSeq &Res) {
   // Check for 64 bits
   bool IsRV64 = ActiveFeatures[RISCV::Feature64Bit];  
   assert(IsRV64 && "Can't emit >32-bit imm for non-RV64 target");

   // Check if the value fits in 32-bits
   if (isInt<32>(Val)) {
     // Depending on the active bits in the immediate Value v, the following
     // instruction sequences are emitted:
     //
     // v == 0                        : ADDI
     // v[0,12) != 0 && v[12,32) == 0 : ADDI
     // v[0,12) == 0 && v[12,32) != 0 : LUI
     // v[0,32) != 0                  : LUI+ADDI(W)
     int64_t Hi20 = ((Val + 0x800) >> 12) & 0xFFFFF;
     int64_t Lo12 = SignExtend64<12>(Val);

     if (Hi20)
       Res.push_back(RISCVMatInt::Inst(RISCV::LUI, Hi20));

     if (Lo12 || Hi20 == 0) {
       unsigned AddiOpc = (IsRV64 && Hi20) ? RISCV::ADDIW : RISCV::ADDI;
       Res.push_back(RISCVMatInt::Inst(AddiOpc, Lo12));
     }
     return;
   }

   // In the following, constants are processed from LSB to MSB but instruction
   // emission is performed from MSB to LSB by recursively calling
   // generateInstSeq. In each recursion, first the lowest 12 bits are removed
   // from the constant and the optimal shift amount, which can be greater than
   // 12 bits if the constant is sparse, is determined. Then, the shifted
   // remaining constant is processed recursively and gets emitted as soon as it
   // fits into 32 bits. The emission of the shifts and additions is subsequently
   // performed when the recursion returns.
   int64_t Lo12 = SignExtend64<12>(Val); // (int64_t)((Val << (64 -12)) >> (64 -12))
   int64_t Hi52 = ((uint64_t)Val + 0x800ull) >> 12;
   int ShiftAmount = 12 + findFirstSet((uint64_t)Hi52);
   Hi52 = SignExtend64(Hi52 >> (ShiftAmount - 12), 64 - ShiftAmount);
   // If the remaining bits don't fit in 12 bits, we might be able to reduce the
   // shift amount in order to use LUI which will zero the lower 12 bits.
   if (ShiftAmount > 12 && !isInt<12>(Hi52) && isInt<32>((uint64_t)Hi52 << 12)) {
     // Reduce the shift amount and add zeros to the LSBs so it will match LUI.
     ShiftAmount -= 12;
     Hi52 = (uint64_t)Hi52 << 12;
   }
   // Recursive call
   generateInstSeqImpl(Hi52, ActiveFeatures, Res);
   // Generation of the instruction
   Res.push_back(RISCVMatInt::Inst(RISCV::SLLI, ShiftAmount));
   if (Lo12)
     Res.push_back(RISCVMatInt::Inst(RISCV::ADDI, Lo12));
 }

Let’s understand this piece of recursive code step by step. Note that the given value is a 64-bits signed integer!

If the value fits in 32 bits, it splits it in 20 + 12 and performs the usual lui+addi(w) *if needed*

Note: Since this code is used recursively, this is the last part that will execute.

Otherwise:
Extract the lowest 12 bits from the constant and sign-extended to 64 bits.
Extract the highest 52 bits by first adding 0x800 then shifting left by 12 bits.

Note: the addition will automatically verify the bit sign, if it is 1 it will be propagated as we want otherwise it will not do anything.
Find the optimal shift amount by looking into the highest 52 bits for the first set bit, if it is greater than 12, shift the Hi52 as well so it takes it in consideration.
If the remaining bits do not fit in 12 bits, the shift amount can be reduced following certain conditions.
Finally, call recursively the function and push back the FF0

Note: The constant is processed from LSB to MSB (right to left) but instruction emission is performed from MSB to LSB (right to left).

Notes:

LLVM Comments:

In the worst case, for a full 64-bit constant, a sequence of 8 instructions (i.e., LUI+ADDIW+SLLI+ADDI+SLLI+ADDI+SLLI+ADDI) has to be emitted. Note that the first two instructions (LUI+ADDIW) can contribute up to 32 bits while the following ADDI instructions contribute up to 12 bits each.

On the first glance, implementing this seems to be possible by simply emitting the most significant 32 bits (LUI+ADDIW) followed by as many left shift (SLLI) and immediate additions (ADDI) as needed. However, due to the fact that ADDI performs a sign extended addition, doing it like that would only be possible when at most 11 bits of the ADDI instructions are used. Using all 12 bits of the ADDI instructions, like done by GAS, actually requires that the constant is processed starting with the least significant bit.

In the following, constants are processed from LSB to MSB but instruction emission is performed from MSB to LSB by recursively calling generateInstSeq. In each recursion, first the lowest 12 bits are removed from the constant and the optimal shift amount, which can be greater than 12 bits if the constant is sparse, is determined. Then, the shifted remaining constant is processed recursively and gets emitted as soon as it fits into 32 bits. The emission of the shifts and additions is subsequently performed when the recursion returns.

recursiveLoadImmediate: anImmediate inRegister: aRegister andEmitInstructionsIn: aCollection

	"LLVM uses a clever recursive way to determine the best combination of instructions that
	 are needed by the pseudo instruction li.
	https://github.com/llvm/llvm-project/blob/4c3d916c4bd2a392101c74dd270bd1e6a4fec15b/llvm/lib/Target/RISCV/MCTargetDesc/RISCVMatInt.cpp"

	| hi20 lo12 hi52 shiftAmount signedImmediate |
	self flag: #TODO.
	signedImmediate := anImmediate signedIntFromLong64.
	"Special case if the value is 16rFFFFFFFFFFFFFFFF (max)"
	(signedImmediate = -1)
		ifTrue: [ aCollection add: (self addImmediate: -1 toRegister: X0 inRegister: aRegister).
					 "End the recursion!"
					 ^ 0 ].
	"Check if the immediate can be contained in 32 bits"		
	(self value: signedImmediate isContainedIn: 31)
		ifTrue: [
			"Depending on the value of the immediate, the following instructions are emitted:
				imm == 0                          : ADDI
				imm[0,12) != 0 && imm[12,32) == 0 : ADDI
			   imm[0,12) == 0 && imm[12,32) != 0 : LUI
				imm[0,32) != 0                    : LUI+ADDIW
			"
			hi20 := (((self computeSignedValue64Bits: signedImmediate) + 16r800) >> 12) bitAnd: 16rFFFFF.
			lo12 := self computeSignedValue64Bits: (signedImmediate bitAnd: 16rFFF).

			"lui instruction"
			hi20 ~= 0
				ifTrue: [ aCollection add: (self loadUpperImmediate: hi20 inRegister: aRegister) ].

			((lo12 ~= 0) or: [hi20 = 0])
				ifTrue: [ hi20 ~= 0
						ifTrue: [ aCollection add: (self addWordImmediate: lo12 toRegister: aRegister inRegister: aRegister)]
						ifFalse: [ aCollection add: (self addImmediate: lo12 toRegister: X0 inRegister: aRegister) ]
					].
			"Return to end the recursion"			
			^ 0
		].
		"In the case the value does not fit in 32 bits"
		lo12 := self computeSignedValue64Bits: (signedImmediate bitAnd: 16rFFF).
		hi52 := (((self computeSignedValue64Bits: signedImmediate) + 16r800) bitAnd: 16rFFFFFFFFFFFFFFFF) >> 12.
		"Process the optimal shift amount"
		shiftAmount := 12 + (self trailingZerosOf: hi52).
		hi52 := self computeSignedValueOf: hi52 >> (shiftAmount - 12) ofSize: (64 - shiftAmount).

		"Check if the shift amount can be reduced to fit in a 32 bit variable"
		((shiftAmount > 12) and: [((hi52 bitAnd: 16rFFF) = 0) and: [self value: (hi52 << 12) isContainedIn: 32]])
			ifTrue: [
				"Reduce the shift amount"  
				shiftAmount := shiftAmount - 12.
				hi52 := self computeSignedValueOf: hi52 << 12 ofSize: 64.
			].

		" Recursive call"
		self recursiveLoadImmediate: hi52 inRegister: aRegister andEmitInstructionsIn: aCollection.
		"Add shift and add"		
		aCollection add: (self shiftLeftValueInRegister: aRegister byShiftAmount: shiftAmount intoRegister: aRegister).
		lo12 ~= 0
			ifTrue: [ aCollection add: (self addImmediate: lo12 toRegister: aRegister inRegister: aRegister) ].

		^ 0

RISC-V Load Immediate li Pseudo-Instruction