Transformer implementation seemingly not corresponding to paper

Hi there,

First off - thank you for open sourcing this very interesting work(!).

I'm having a look at this repo along with the paper "Context-Aware Learning to Rank with Self-Attention", and it seems there is a bug, as the code doesn't seem to correspond exactly to what's written in the paper:

https://github.com/allegro/allRank/blob/master/allrank/models/transformer.py#L105
Here you seemingly apply LayerNorm to the inputs to the MultiHeadAttention and Fully-Connected layers.
However, in the paper (and from what I gather is general practice for Transformers), you've written that LayerNorm is applied to the outputs of these corresponding layers.

It seems that this repo isn't actively worked on, but nonetheless I thought I'd let you know.

Thanks & best wishes,
Patrick

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformer implementation seemingly not corresponding to paper #46

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Transformer implementation seemingly not corresponding to paper #46

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions