Skip to content

[otbn] Define SIMD instructions#29231

Open
etterli wants to merge 3 commits intolowRISC:masterfrom
etterli:otbn-pqc-simd-insn
Open

[otbn] Define SIMD instructions#29231
etterli wants to merge 3 commits intolowRISC:masterfrom
etterli:otbn-pqc-simd-insn

Conversation

@etterli
Copy link
Contributor

@etterli etterli commented Feb 2, 2026

This PR defines the new SIMD (vectorized) bignum instructions for OTBN operating on the WDRs. The new instructions operate on 8 32-bit elements per WDR:

  • bn.addv(m): Vectorized addition with optional pseudo-modulo reduction.
  • bn.subv(m): Vectorized subtraction with optional pseudo-modulo reduction
  • bn.mulv(l): Vectorized multiplication with optional 'by lane' mode.
  • bn.mulvm(l): Vectorized Montgomery multiplication (without the conditional subtraction step of Montgomery). With optional 'by lane' mode.
  • bn.trn1 / bn.trn2: Instructions to interleave vector elements. This is especially useful for NTT and INTT to shuffle the vector elements when the stride between elements is smaller than what two WDRs provide. These instructions also operate on 64-bit and 128-bit vector elements.
  • bn.shv: Vectorized logic shift instruction
  • bn.pack / bn.unpk: Instructions to pack / unpack vectors from/to 32-bit elements to/from 24-bit element vectors. This allows to save on DMEM space.

The instructions are described in more detail in hw/ip/otbn/data/bignum-insns.yml (first commit). For bn.pack / bn.unpk and bn.trn see also the programmer's manual. Alternatively one can build the documentation locally and view the generated ISA manual (use util/site/build-docs.sh serve).

When this is merged, various OTBN DV tests will fail because the Random Instruction Generator (RIG) parses the bignum-insn.yml file and will generate programs including the new instructions. This will result in failing tests because illegal instructions are encountered as these are not yet added to the simulator and RTL.

This PR will have minor conflicts with #29025 (only some documentation files). So we should agree on which to merge first (probably #29025 first).

Add two WDR registers interpreted as vectors elementwise and reduce the results by the MOD WSR.

The WDRs `wrs1` and `wrs2` are interpreted as vectors with unsigned elements of size given by `elen`.
The vectors are summed elementwise and if an individual result is equal to or larger than MOD, MOD is subtracted from it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If we're using commas, we need one before "if" as well. But I'd probably split into two sentences, and you need to say something about the SIMD behaviour of MOD as well.

Something like this maybe?

    The vectors are summed elementwise.
    If an individual result is equal to or larger than bottom lane of MOD then that lane is subtracted from it.

I think the text needs a bit of re-ordering to make sure we talk clearly about the bottom lane of MOD (which I think is what you mean by the penultimate para) a bit earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on splitting the sentence.

Further below it is defined which part of the MOD WSR is used to do the reduction.
"The modulus is taken from the MOD WSR at the corresponding element length bits. For example, for .8s the modulus is taken from MOD[31:0] and replicated for all element pairs."

Does this clarify the situation?

Copy link
Contributor

@rswarbrick rswarbrick Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good. I wonder whether it makes sense to give a name to the modulus we're using (a name that doesn't also happen to be the name of a WSR).

Something like this maybe?

    Add two WDR registers interpreted as vectors and reduce the results using the MOD WSR.

    The modulus by which the addition results are reduced comes from the bottom word of the MOD WSR at the corresponding element length.
    For example, for `.8s` the modulus is taken from `MOD[31:0]`.
    Write M for this modulus.

    The WDRs `wrs1` and `wrs2` are interpreted as vectors with unsigned elements of the same size.
    The vectors are summed elementwise.
    If an individual result is equal to or larger than M then M is subtracted from it.
    The final results are truncated to the element length and the whole vector is stored in `wrd`.
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took q instead of M to be consistent with the bn.mulvm instruction.

@etterli etterli force-pushed the otbn-pqc-simd-insn branch from f0c84b5 to 81053bf Compare February 2, 2026 20:19
Copy link
Member

@nasahlpa nasahlpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Pascal, this looks really good to me. Just a couple of nits/questions.

@etterli etterli force-pushed the otbn-pqc-simd-insn branch 4 times, most recently from 5997928 to 1c3cca1 Compare February 4, 2026 08:01
@etterli etterli added the CI:Rerun Rerun failed CI jobs label Feb 4, 2026
@github-actions github-actions bot removed the CI:Rerun Rerun failed CI jobs label Feb 4, 2026
Copy link
Member

@nasahlpa nasahlpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the feedback!

Copy link
Contributor

@h-filali h-filali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @etterli this looks very good. I just have some minor comments and questions.

The vectors are summed elementwise and each sum is truncated to the element length.
The final vector is stored in `wrd`.

Flags are not used or saved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure but do we need a iflow section here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a second instruction without an iflow section btw. Just in case you didn't catch that :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even more than one, thanks!

The vectors are multiplied elementwise and each product is truncated to the element length.
The final vector is stored in `wrd`.

This instruction stalls OTBN for 4 cycles and writes the final result to the ACC WSR.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the final result added to the accumulate WSR or just stored there? Or why are we storing to wrd and ACC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instruction operates over 4 cycles and after each cycle there is one vector chunk result which must be stored somewhere. It cannot be written to the WDR before the instruction finishes (otherwise the lane version won't work properly) so we store the partial results in ACC. This way we "construct" the result over multiple cycles and use ACC as buffer. At the end we still must perform a write to the desired WDR.

I agree this sentence should be extended similar to how the mulvm instructions are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think there should be more remarks on how the partial results are accumulated?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay that makes sense. That was also what I had in mind. I think what tripped me up a bit is the fact that we are using the accumulate register even though we are not accumulating. Maybe we can mention explicitly that ACC is overwritten and we do not accumulate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"writes the final result to the ACC WSR" I think this includes these points?

Even-numbered vector elements from `wrs1` are placed into even-numbered elements of `wrd`.
Even-numbered vector elements from `wrs2` are placed into odd-numbered elements of `wrd`.

This can be described as `wrd[i] = wrs1[i]` and `wrd[i+1] = wrs2[i]` for `i in [0, 2, .., 256//size]` which results in `wrd = {..., wrs2[2], wrs1[2], wrs2[0], wrs1[0]}`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this looks right. So for .2q we would have i in [0, 2] and this would give us 4 elements {wrs2[2], wrs1[2], wrs2[0], wrs1[0]} instead of just 2 {wrs2[0], wrs1[0]}.

Also if you use size as the number of elements given by elen then we could write something simpler like:
i in [0, 2, .., elen-2]?

Not sure I understood correctly though.

Copy link
Contributor Author

@etterli etterli Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, for .2q the example and pseudo code is not correct / there should only be two elements in the example result.

I honestly don't know how to describe this example such that it covers all elens. What do you think if the example is specific to the .8s case? I think the other cases are then easy to understand.
Like this:

For .8s, this can be described as wrd[i] = wrs1[i] and wrd[i+1] = wrs2[i] for i in [0, 2, .., (256//32)-2] which results in wrd = {..., wrs2[2], wrs1[2], wrs2[0], wrs1[0]}.

I explicitly wrote 256//32 to highlight that it depends on the element size.

Using elen directly does not work because .8s is encoded as 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could introduce vlen as the vector length instead of size. This way we can simplify the notation to just i in [0, 2, ..., vlen-2].

Sure this is also not perfect for vlen = 2 but I think this would be the easiest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the explanation is now only for .8s the size is not used anymore. I think the textual description is the main part and the concrete example for .8s makes it more understandable. So there is no need to use abstract / generalized examples.

@etterli etterli force-pushed the otbn-pqc-simd-insn branch from 1c3cca1 to 6505711 Compare February 6, 2026 08:45
@etterli etterli added the CI:Rerun Rerun failed CI jobs label Feb 6, 2026
@github-actions github-actions bot removed the CI:Rerun Rerun failed CI jobs label Feb 6, 2026
@vogelpi
Copy link
Contributor

vogelpi commented Feb 6, 2026

@andreaskurth this PR will break the OTBN regressions as it introduces instruction definitions which at the moment neither the RTL nor the simulator knows. And that's kind of okay as we need to split up the work into multiple PRs.

However, I think we should increase the version number and set back the design and verification stages in otbn.hjson to D1 and V1 to reflect this. In terms of version number increase: OTBN binaries and Ibex code interfacing OTBN should not change as long as the IMEM does not increase to more than 16 KiB. Right now, it's not entirely clear if 16 KiB will be enough. So, I would suggest increasing the version from 1.1.0 to 1.2.0 for now. We can always go to 2.0.0 if breaking changes get introduced. WDYT?

This introduces vectorized (SIMD) big number instructions operating on
the 256-bit WDRs. These instructions interpret WDRs as vectors of
unsigned elements. The width of the elements is for most instructions 32
bits except for a few instructions which support also larger widths.

Signed-off-by: Pascal Etterli <[email protected]>
…lanations

Adds a simple example how to use the bn.pack and bn.unpack instructions.

Signed-off-by: Pascal Etterli <[email protected]>
This illustrates the functionlaty of the bn.trn1 and bn.trn2
instructions.

Signed-off-by: Pascal Etterli <[email protected]>
@etterli etterli force-pushed the otbn-pqc-simd-insn branch from 6505711 to 134ff78 Compare February 6, 2026 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants