Good asm style is pretty universal across ISAs (and different dialects of asm for the same CPU). Compiler output (like gcc/clang) typically does all of the things I mention below, so is a good guideline. (And C compiler output is often a good starting-point for optimizing a small function.)
Generally indent instructions one level deeper than labels and assembler directives.
Indent operands to a consistent column (so varying lengths of mnemonics doesn t leave your code ragged, and it s easy to scan down a block and see the destination register of every instruction as the first operand)1.
Indent comments on instruction lines to a consistent column on the right, well past the operands to avoid visual noise.
Group blocks of related instructions together, with a blank line to separate them. (Or if you re optimizing for in-order CPUs by scheduling instructions, you can t really do this and need to use comments to keep track of which part of the problem each instruction is working on. Using different levels of indents for the comments can be helpful then)
Footnote 1:
Except for MIPS store instructions, like sw $t0, 1234($t1)
where the first operand is actually a source; they chose to make the asm source use the same operand order for both loads and stores, maybe because they re both I-type instructions in the machine code. This is typical of asm for RISC load/store architectures, though, so it s something to get used to coming from a CISC where mov eax, [rdi]
is a load and mov [rdi], eax
is a store. And add [rdi], eax
is both.
Example: an atoi
function for unsigned integers, for a real MIPS with branch-delay slots. But not MIPS I, no load-delay slots. Although I tried to avoid load-use stalls anyway. (Godbolt for a C version)
# unsigned decimal ASCII string to integer
# inputs: char* in $a0 - ASCII string that ends with a non-digit character
# outputs: integer in $v0
# clobbers: $t0, $t1
atoi:
# peel the first iteration to avoid a 0 * 10 multiply
lbu $v0, 0($a0)
addiu $v0, $v0, - 0 # digit = *p - 0
sltu $t0, $v0, 10
bnez $t0, .Lloop_entry # if unsigned (! digit<10)
nop # doing work for the next iteration here hurts ILP for in-order CPUs
#addu $t2, $v0, $v0 # total * 2 (branch delay slot)
# invalid non-digit input
jr $ra # return 0
move $v0, $zero
.Lloop: # do {
addu $v0, $v0, $v0 # total *= 2
addu $t0, $t0, $t1 # total*8 + digit
addu $v0, $v0, $t0 # total*10 + digit = total*2 + (total*8 + digit)
.Lloop_entry:
lbu $t0, 1($a0)
addui $a0, $a0, 1 # t0 = *(p++ + 1)
addiu $t0, $t0, - 0 # t0 = digit
sltu $t1, $t0, 10
bnez $t1, .Lloop # while(digit<10);
sll $t1, $v0, 3
jr $ra
nop
This is probably not optimal for any specific MIPS implementation; an in-order superscalar would probably benefit from putting more of the shifts / adds between the load and the branch, even though that means more redundant work done on the last iteration. It s probably good for an OoO exec like r10k. A modern MIPS32r6 would use lsa
to left-shift-accumulate, like gcc does with -march=mips32r6
, and would use the no-branch-delay versions of branch instructions.
This might be pretty good on an early scalar MIPS, though. The pointer-increment fills the slot after the load, avoiding a stall inside the loop. (The immediate offset of 1 is because we avoided the increment in the peeled first iteration).
Filling the delay-slot for the startup branch to .Lloop_entry
would be possible if we wanted to compute more stuff for the next iteration after the addu $v0, $v0, $t0
inside the main loop. But that would require a dependency on $v0
, hurting ILP for superscalar in-order CPUs. (Currently the top to addu
instructions can run in parallel, then addu
to produce the new total can run in parallel with the lbu
.)
It would be fine on scalar in-order (like MIPS I / MIPS II), or on out-of-order CPUs.
(Although I m not sure if early MIPS needs to stall when a conditional branch reads its input from the previous ALU instruction; branch decision is in the ID stage, 1 cycle before EX. But probably not because MIPS I literally didn t have pipeline interlocks for RAW hazards; that s why it had a load delay slot.)