Hi there,
First off - thank you for open sourcing this very interesting work(!).
I'm having a look at this repo along with the paper "Context-Aware Learning to Rank with Self-Attention", and it seems there is a bug, as the code doesn't seem to correspond exactly to what's written in the paper:
https://github.com/allegro/allRank/blob/master/allrank/models/transformer.py#L105
Here you seemingly apply LayerNorm to the inputs to the MultiHeadAttention and Fully-Connected layers.
However, in the paper (and from what I gather is general practice for Transformers), you've written that LayerNorm is applied to the outputs of these corresponding layers.
It seems that this repo isn't actively worked on, but nonetheless I thought I'd let you know.
Thanks & best wishes,
Patrick
Hi there,
First off - thank you for open sourcing this very interesting work(!).
I'm having a look at this repo along with the paper "Context-Aware Learning to Rank with Self-Attention", and it seems there is a bug, as the code doesn't seem to correspond exactly to what's written in the paper:
https://github.com/allegro/allRank/blob/master/allrank/models/transformer.py#L105
Here you seemingly apply LayerNorm to the inputs to the MultiHeadAttention and Fully-Connected layers.
However, in the paper (and from what I gather is general practice for Transformers), you've written that LayerNorm is applied to the outputs of these corresponding layers.
It seems that this repo isn't actively worked on, but nonetheless I thought I'd let you know.
Thanks & best wishes,
Patrick