Don't Look Twice: Faster Video Transformers with Run-Length Tokenization

(rccchoudhury.github.io)

74 points | by jasondavies 4 days ago ago

16 comments

kmeisthax 3 days ago ago
I'm wondering if it would make sense to use an H.264/5/6/AV1 encoder as the tokenizer, and then find some set of embeddings that correspond to the data in the resulting bitstream. The tokenization they're doing is morally equivalent to what video codecs already do.
[-]
- ronsor 2 days ago ago
  This was already done in JPEG-LM [0] and it did work.
  [0] https://arxiv.org/abs/2408.08459
  [-]
  - kmeisthax 2 days ago ago
    Cryptomnesia!
    Interestingly, they managed to train and inference on JPEG bitstream directly. I thought they'd need to at least build embeddings for those bitstream features or something.
pavlov 3 days ago ago
Would event camera input data be useful here?
https://en.wikipedia.org/wiki/Event_camera
“Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.”
cyberax 3 days ago ago
Interestingly, biological vision for reptiles (and probably other species) works largely on the same principle. It tends to filter out static background.
[-]
- sigmoid10 2 days ago ago
  Most people believe this because it is said twice in the Jurassic Park movie (the idea being taken from the book), but it is not true. It is somewhat true for amphibians with very simple visual systems and limited hunting strategies, like certain frogs, which would at least be an in-universe explanation for Jurassic Park's haphazardly cloned dinos. But in the movie Dr. Grant claims it for the first time before even learning of the existence of the park, so they don't get any points for that. In reality, T-Rex for example is believed to have had incredibly good vision - much better than humans:
  https://bioone.org/journals/journal-of-vertebrate-paleontolo...
smusamashah 4 days ago ago
Isn't this like Differential Transformers that worked based on differences?
[-]
- ImageXav 3 days ago ago
  As far as I can can tell though the core idea is the same, to focus on the differences, the implementation is different. Differential transformers 'calculates attention scores as the difference between two separate softmax attention maps'. So they must process the redundant areas. This removes them altogether, which would significantly reduce compute. Very neat idea.
  However, I do think that background information can sometimes be important. I reckon a mild improvement on this model would be to leave the background in the first frame, and perhaps every x frames, so that the model gets better context cues. This would also more accurately replicate video compression.
  [-]
  - ImageXav 3 days ago ago
    Actually, I was mislead by the video example. They do actually keep the background information they use a temporal encoding so that the information is propagated through. Very interesting and well thought out
- Lerc 3 days ago ago
  That was my feeling too for the most part, but The run length is a significant source of information and if it enables tokens to be skipped it is essentially gaining performance by working with a smaller but more dense form of the same information. My instinct is that run-length would be just the most basic case of a more generalized method for storing token information to encompass time and area and for the density of information in tokens to be more even, The area and duration being variable but the token stream containing a series of tokens containing similar quantities of semantic data.
  I feel like this is very much like the early days of data compression where a few logical but kind of ad-hoc principles are being investigated in advance of a more sophisticated theory that integrates the ideas of what is being attempted, how to identify success, and recognizing pathways that move towards the optimal solution.
  These papers are the foundations of that work.
robbiemitchell 3 days ago ago
For training, would it be useful to stabilize the footage first?
[-]
- FatalLogic 3 days ago ago
  Stabilization appears to be a subset of a literally wider, but more rewarding, challenge: reconstructing the whole area that is scanned by the camera. It could be better to work on that challenge, not on simple stabilization.
  That's similar to how the human visual system 'paints' a coherent scene from a quite narrow field of high-resolution view, with educated guesses and assumptions
  [-]
  - cma 3 days ago ago
    https://vidpanos.github.io/
    There are other recent ones that do a new camera from any vantage point, not just rotation+fov changes like the above as well. But they still might want stabilized video as the baseline input if they don't already use it.
    Besides saccades and tracking, your eyes also do a lot of stabilization, even counter rotating on the roll axis as you lean your head to the side. I'm not sure if they roll when tracking a subject that rolls, I would think not common enough to need to be a thing.
    [-]
    - FatalLogic 2 days ago ago
      Thanks - that link is very interesting. You can see some distortion and 'hallucination', which would be a risk with my suggestion. Their video output is great work, but the far end of the fence at the right hand side glitches and vanishes at about 4-5 sec mark, for instance
- nairoz 3 days ago ago
  I guess yes. Having worked on video processing, it's always better if you can stabilize because it significantly reduces the number of unique tokens, which would be even more useful for the present method. However, you probably lose in generalization performance and not all videos can be stabilized.
trash_cat 3 days ago ago
What would be the applications of this that is different from regular transformers? Perhaps stupid question.