Thanks to the author for clarifying something that's been a mystery to me for a few years. The positional encoding scheme in the "Attention Is All You Need" paper is only given half a page and the construction appears to come out of nowhere.
One of the things I really love about rope is that it allows for a lot of interesting encoding schemes at inference time without model retraining. I’ve had a lot of fun playing with different relative positions. You can elicit a lot of interesting behaviors from the model when you use different rotations for keys vs queries, they don’t always have to match.
For example exact position doesn’t matter too much when tokens are spaced out. Let’s say you use token position 100 for your query, you can shift all the keys around position 100, and the further they are back in the context the more freedom you have to play with the value.
I don't think the first code example should work (it indeed says false here).
When given a permuted sequence, the attention output will also be permuted, not identical. The need for positional encodings is due to two tokens resulting in the same value in the final attention matrix regardless of the tokens' absolute and relative position; that is enough to miss a lot of meaning.
But u/imjonse's reasoning seems right. I haven't run either version of the code, but when reading it I expected that to be False. The output is still a list with an order.
the dog chased the cat: position 1 in the output is attention(dog, everything)
the cat chased the dog: position 1 in the output is attention(cat, everything)
I'm effectively a complete layman in this (although I do see some parallels to physical positional encoders, which is interesting) so at first read this entire thing went WAAAAY over my head. At first glance it seemed to be way overcomplicated just to encode position, so I figured I was missing something. ChatGPT was super helpful in explaining spiking neural networks to me so I just spent 20 minutes asking ChatGPT to explain this to me and I feel like I actually learned something.
Then at the end I asked ChatGPT how this all relates to how it operates and it was interesting to see things like:
>Tokens as Subword Units: I use a tokenization method called Byte Pair Encoding (BPE), which breaks text into subword units.
I don't know if it's accurate or not, but it's wild seeing it talk about how it works.
The context includes that "it" is ChatGPT. The fact that ChatGPT uses Byte Pair Encoding is widely published. It is expectable that a LLM can regurgitate this kind of information, nothing wild about that.
How about context encoding more generally ? Are there techniques to do that. I.E, during training, I want the string "Dubito ergo cogito, cogito ergo sum, sum ergo Deus est." to have embedded René Descartes as main author, year 1637 as date of writing and "Discours de la méthode" as global context of writing.
So that when trained again another part of the same book, the model can learn they were from same context.
Maybe someone could answer this for me: it seems like encoding the positional embeddings as augmentations to the "natural" activations instead of as their own inputs (concatenated onto the activations) make things like sliding a window much harder... I guess obviously the drawback is you have a somewhat less textually derived information.
I recall a early transformers video where they tried both and it turned out that adding the position onto the existing vectors was no worse so they went with it... No further discussion about motivations happened in that video.
Is it worth revisiting that maybe now that activations have a gobsmackingly large dimension?
They are not concatenated, but summed. I think concatenation wouldn’t work, as you indicate.
I think you mean the line in the original paper where they say compared the learned attention weights with the predefined encoding, and it made no difference.
> Furthermore, by rotating the vector, we have absolutely zero impact on the norm of the vector, which encodes the semantic information of our token.
Doesn’t the angle encode semantic information? Cosine similarity works for embeddings after all.
Thanks to the author for clarifying something that's been a mystery to me for a few years. The positional encoding scheme in the "Attention Is All You Need" paper is only given half a page and the construction appears to come out of nowhere.
Thank you! Seemed like voodoo to me too, hence this post!
One of the things I really love about rope is that it allows for a lot of interesting encoding schemes at inference time without model retraining. I’ve had a lot of fun playing with different relative positions. You can elicit a lot of interesting behaviors from the model when you use different rotations for keys vs queries, they don’t always have to match.
For example exact position doesn’t matter too much when tokens are spaced out. Let’s say you use token position 100 for your query, you can shift all the keys around position 100, and the further they are back in the context the more freedom you have to play with the value.
I don't think the first code example should work (it indeed says false here).
When given a permuted sequence, the attention output will also be permuted, not identical. The need for positional encodings is due to two tokens resulting in the same value in the final attention matrix regardless of the tokens' absolute and relative position; that is enough to miss a lot of meaning.
The first code example says False because of high precision, I've updated the example.
But u/imjonse's reasoning seems right. I haven't run either version of the code, but when reading it I expected that to be False. The output is still a list with an order.
the dog chased the cat: position 1 in the output is attention(dog, everything)
the cat chased the dog: position 1 in the output is attention(cat, everything)
Run the code and look at the values!
I'm effectively a complete layman in this (although I do see some parallels to physical positional encoders, which is interesting) so at first read this entire thing went WAAAAY over my head. At first glance it seemed to be way overcomplicated just to encode position, so I figured I was missing something. ChatGPT was super helpful in explaining spiking neural networks to me so I just spent 20 minutes asking ChatGPT to explain this to me and I feel like I actually learned something.
Then at the end I asked ChatGPT how this all relates to how it operates and it was interesting to see things like:
>Tokens as Subword Units: I use a tokenization method called Byte Pair Encoding (BPE), which breaks text into subword units.
I don't know if it's accurate or not, but it's wild seeing it talk about how it works.
The context includes that "it" is ChatGPT. The fact that ChatGPT uses Byte Pair Encoding is widely published. It is expectable that a LLM can regurgitate this kind of information, nothing wild about that.
Note if you don't have a good system prompt, other LLMs will also tell you they're ChatGPT or Claude.
100% accurate
How about context encoding more generally ? Are there techniques to do that. I.E, during training, I want the string "Dubito ergo cogito, cogito ergo sum, sum ergo Deus est." to have embedded René Descartes as main author, year 1637 as date of writing and "Discours de la méthode" as global context of writing.
So that when trained again another part of the same book, the model can learn they were from same context.
Maybe someone could answer this for me: it seems like encoding the positional embeddings as augmentations to the "natural" activations instead of as their own inputs (concatenated onto the activations) make things like sliding a window much harder... I guess obviously the drawback is you have a somewhat less textually derived information.
I recall a early transformers video where they tried both and it turned out that adding the position onto the existing vectors was no worse so they went with it... No further discussion about motivations happened in that video.
Is it worth revisiting that maybe now that activations have a gobsmackingly large dimension?
They are not concatenated, but summed. I think concatenation wouldn’t work, as you indicate.
I think you mean the line in the original paper where they say compared the learned attention weights with the predefined encoding, and it made no difference.
> I think concatenation wouldn’t work, as you indicate.
Why do you say that?
Does anyone know why 2D rope implementations apply two separate 1D rotations to pairs, instead of applying a 2d rotation to triplets?
The binary coding example would have been much better with Gray codes.
Similarly, "you" could have designed state of the art LLM sampling: https://openreview.net/forum?id=FBkpCyujtS&referrer=%5BTasks...