Ten years ago, Dzmitry Bahdanau from Yoshua Bengio’s group recognized a flaw in RNNs and the information bottleneck of a fixed length hidden state. They put out a paper introducing attention to rectify this issue. Not long after that, a group of researchers at Google found that you can just get rid of the RNN altogether and you still get great results with improved training performance, giving us the transformer architecture in their Attention Is All You Need paper. But transformers are expensive at inference time and scale poorly with increasing context length, unlike RNNs. Clearly, the solution is to just use RNNs. Two days ago, we got Were RNNs All We Needed?