We found Google's T5 models which were released in 2019, pre-GPT-3, were "secretly" capable of in-context learning with a simple inference technique.
Given they use a bidirectional MLM (Masked Language Modeling) objective, it wasn't obvious how to do it, but MLM objectives are known to produce better language representations than causal (next token prediction) objectives. We were able to outperform much larger sized GPT-3 models or get very close to their performance with far smaller T5 models.
Are there any intrinsic dis/advantages of bidirectional models over causal models for in-context learning? It seems that unidirectional model just have been explored and worked on more.
When you train bidirectionally only, you don't get a generative model, that would be the downside. However, you can train on a mixture of causal and bidirectional objectives as some LLM pre-training has done. As far as I am aware, there are no downsides of that, but it is not more common simply because the standard practice has been to train causal only and there just isn't enough funding/attention to go into experimenting on every axis of pre-training (which can be very expensive).
It's not at all expected. T5 models are not generative models by default and they were not thought to be able to perform generation, let alone in-context learning. Remember these models were released before any of the existing LLMs and in-context learning/prompting as a technique became popularized with GPT-3.
While the technique requires multiple samples to coax generations from this particular model, other LLM training schemes have incorporated both unidirectional and bidirectional objectives in their training now. However, this exploration hasn't been fully resolved as most models are still trained only on the causal objective by standard practice. There's still a a lot of exploration that can be done on pre-training objectives.
You are right, but it's a little misleading (as it sounds like is the usefulness of your work nowadays) - Comparisons on language modelling prowess of BERT/T5 being compared to the default, non-instruct GPT-3 or OPT-3 isn't really that useful if done by size, because in practice we don't use 1.3B generative models, and more importantly, because focusing on default decoding generation without an instruct/PPO step is not how these models are used practically. The instruct models blow this performance out of the water, but instruct plus better-performance-at-size for GPT models completely shows the dominance of decoder-only architectures in my opinion for now.
I think you have to consider that in 2020/2021 many PhDs and Professors attempted to shift grant funded research with BERT and T5 to explore how they could compete with GPT-3 or to express other properties of it that supposedley outdid GPT-3. Very few (besides sentence transformers) succeeded. It's not like this is an unexplored niche. A lot of people in denial were trying to keep on with BERT research for a while despite the fact their work was essentially made obsolete by GPT-3.
(and notably Table 1 and Figure 4 are cherrypicking the smallest size with the largest gaps in task difference, and a size we know decoding is not performative at - 1.3B param mark - the characteristics and conclusions the authors come to (wow, BERT is trained on less data but does better!) obviously can't be made at larger sizes because the actual GPT models become much larger)
The "embarrassingly simple inference technique" is to put a bunch of [MASK] tokens at the end of the prompt.
I'm having trouble understanding whether this paper is saying anything new. The original BERT paper already compared it favourably to causal models including GPT. Was there any doubt that BERT-style models could be in-context learners?
From what I gather as a non-expert, the problem with BERT is scaling/training efficiency: GPT gets C-1 training examples out of a training input of length C, but BERT only gets 0.15*C examples. Indeed, the author points out that DeBERTa required 3x more compute than GPT-3 to achieve the level of performance reported, which makes sense.
As someone who has very limited understanding but tried to use BERT for classification, is BERT still relavant when compared to LLMs ? Asking because I hardly see any mention of BERTs anymore.
- Encoder based models have much faster inference (are auto-regressive) and are smaller. They are great for applications where speed and efficiency are key.
- Most embedding models are BERT-based (see MTEB leaderboard). So widely used for retrieval.
- They are also used to filter data for pre-training decoder models. The Llama 3 authors used a quality classifier (DistilRoberta) to generate quality scores for documents. Something similar is done for FineWeb Edu
Wait, I thought GPT's were autoregressive and encoder only like BERT used masked tokens? You're saying BERT is auto-regressive or am I misunderstanding?
You're right. Encoder only models like BERT aren't auto-regressive and are trained with the MLM objective. Decoder only (GPT) and encoder-decoder (T5) models are auto-regressive and are trained with the CLM and sometimes the PrefixLM objectives.
They're still very useful on their own. But even more broadly, you can often use them in tandem with LLMs. A good example could be a classifier that's used as a "router" of sorts; could be for selecting a prompt template, directing to a specific model, or loading a LoRA or soft prompt vector to be used at inference-time.
For many specialized tasks you can run BERTs (and simpler models in general) at scale, with lower latency, at lesser cost, with similar or even better results.
Depends what you’re trying to do. I’m writing a personal assistant app (speech to text) and want to classify the user input according to the current actions I support (or don’t). The flagship LLMs are pretty great at it if you include the classes in the prompt and they will spit out structured output every time. But, man, they are expensive and there’s the privacy aspect I’d prefer to adhere to. I’ve only got 24 GB of RAM, so I can’t run too many fancy local models and things like llama3.1:8b don’t classify very well.
They’ve drowned in the LLM noise, but they’re definitely still relevant.
- Generative model outputs are not always desirable, and often even undesirable
- BERT models are smaller and can run with lower latency and serve larger batches with lower vram requirements
- BERT models have bidirectional attention, which can improve performance in many applications
LLMs are “cheap” in the sense that they work well generically, without requiring fine tuning. Where they overlap with BERT models is mostly that they may work better in low training data environments due to better generalization capabilities.
But mostly companies like them because they don’t “require” ML engineers or data scientists on staff. For the lack of care given to evaluation that I see around LLM apps, I suspect that’s going to prove to be a faulty premise.
I think berts are still used for stuff like embeddings (at least the best generic embedding model in my language is bert based), when you want to get a model that has basic semantics but needs to be fast.
What does having invented a term have to do with this?
The Otis company invented the term escalator, and even had a trademark on it for a while, but does it mean that you'd only call one an escalator if it was made by them?
That's literally what the trademark means. At some point things become so dominant and generic a trademark is no longer successfully enforceable and you get escalators, bandaids, linoleum, taser, gasoline, etc.
We found the same result a few years ago in our ICLR paper: https://arxiv.org/pdf/2209.14500
We found Google's T5 models which were released in 2019, pre-GPT-3, were "secretly" capable of in-context learning with a simple inference technique.
Given they use a bidirectional MLM (Masked Language Modeling) objective, it wasn't obvious how to do it, but MLM objectives are known to produce better language representations than causal (next token prediction) objectives. We were able to outperform much larger sized GPT-3 models or get very close to their performance with far smaller T5 models.
Are there any intrinsic dis/advantages of bidirectional models over causal models for in-context learning? It seems that unidirectional model just have been explored and worked on more.
When you train bidirectionally only, you don't get a generative model, that would be the downside. However, you can train on a mixture of causal and bidirectional objectives as some LLM pre-training has done. As far as I am aware, there are no downsides of that, but it is not more common simply because the standard practice has been to train causal only and there just isn't enough funding/attention to go into experimenting on every axis of pre-training (which can be very expensive).
No, you can generate with them using diffusion.
Yep. That technique works very well. Surprised that it’s not more widely used.
This is very interesting. Have you got any references describing this approach?
Isn't Q* (or Quiet-STaR) a causal and bidirectional objective learning system?
From that paper it seems the sampling method (SAP) is also slower, so that it beats larger models seems expected.
It's not at all expected. T5 models are not generative models by default and they were not thought to be able to perform generation, let alone in-context learning. Remember these models were released before any of the existing LLMs and in-context learning/prompting as a technique became popularized with GPT-3.
While the technique requires multiple samples to coax generations from this particular model, other LLM training schemes have incorporated both unidirectional and bidirectional objectives in their training now. However, this exploration hasn't been fully resolved as most models are still trained only on the causal objective by standard practice. There's still a a lot of exploration that can be done on pre-training objectives.
You are right, but it's a little misleading (as it sounds like is the usefulness of your work nowadays) - Comparisons on language modelling prowess of BERT/T5 being compared to the default, non-instruct GPT-3 or OPT-3 isn't really that useful if done by size, because in practice we don't use 1.3B generative models, and more importantly, because focusing on default decoding generation without an instruct/PPO step is not how these models are used practically. The instruct models blow this performance out of the water, but instruct plus better-performance-at-size for GPT models completely shows the dominance of decoder-only architectures in my opinion for now.
I think you have to consider that in 2020/2021 many PhDs and Professors attempted to shift grant funded research with BERT and T5 to explore how they could compete with GPT-3 or to express other properties of it that supposedley outdid GPT-3. Very few (besides sentence transformers) succeeded. It's not like this is an unexplored niche. A lot of people in denial were trying to keep on with BERT research for a while despite the fact their work was essentially made obsolete by GPT-3.
(and notably Table 1 and Figure 4 are cherrypicking the smallest size with the largest gaps in task difference, and a size we know decoding is not performative at - 1.3B param mark - the characteristics and conclusions the authors come to (wow, BERT is trained on less data but does better!) obviously can't be made at larger sizes because the actual GPT models become much larger)
The "embarrassingly simple inference technique" is to put a bunch of [MASK] tokens at the end of the prompt.
I'm having trouble understanding whether this paper is saying anything new. The original BERT paper already compared it favourably to causal models including GPT. Was there any doubt that BERT-style models could be in-context learners?
From what I gather as a non-expert, the problem with BERT is scaling/training efficiency: GPT gets C-1 training examples out of a training input of length C, but BERT only gets 0.15*C examples. Indeed, the author points out that DeBERTa required 3x more compute than GPT-3 to achieve the level of performance reported, which makes sense.
As someone who has very limited understanding but tried to use BERT for classification, is BERT still relavant when compared to LLMs ? Asking because I hardly see any mention of BERTs anymore.
Yes, they are still used
- Encoder based models have much faster inference (are auto-regressive) and are smaller. They are great for applications where speed and efficiency are key. - Most embedding models are BERT-based (see MTEB leaderboard). So widely used for retrieval. - They are also used to filter data for pre-training decoder models. The Llama 3 authors used a quality classifier (DistilRoberta) to generate quality scores for documents. Something similar is done for FineWeb Edu
Wait, I thought GPT's were autoregressive and encoder only like BERT used masked tokens? You're saying BERT is auto-regressive or am I misunderstanding?
You're right. Encoder only models like BERT aren't auto-regressive and are trained with the MLM objective. Decoder only (GPT) and encoder-decoder (T5) models are auto-regressive and are trained with the CLM and sometimes the PrefixLM objectives.
You can mask out the tokens at the end, so its technically autoregressive.
They're still very useful on their own. But even more broadly, you can often use them in tandem with LLMs. A good example could be a classifier that's used as a "router" of sorts; could be for selecting a prompt template, directing to a specific model, or loading a LoRA or soft prompt vector to be used at inference-time.
For many specialized tasks you can run BERTs (and simpler models in general) at scale, with lower latency, at lesser cost, with similar or even better results.
Depends what you’re trying to do. I’m writing a personal assistant app (speech to text) and want to classify the user input according to the current actions I support (or don’t). The flagship LLMs are pretty great at it if you include the classes in the prompt and they will spit out structured output every time. But, man, they are expensive and there’s the privacy aspect I’d prefer to adhere to. I’ve only got 24 GB of RAM, so I can’t run too many fancy local models and things like llama3.1:8b don’t classify very well.
So I’m trying BERT models out :)
Try some of the Quen models. They have some that are slightly larger than 8b that will fit on your 24gb quite nicely. They have been amazing so far.
My understanding is that BERT can still outperform LLMs for sentiment classification?
To my understanding yes. But I never found a good use-case for sentiment classification.
It seems to be used by Youtube for comment censoring / shadow-banning.
That might make sense.
I used sentiment analysis a few times in recommender systems (for digital media consumption.)
Also for analyzing Trump's tweets (from 2016): https://mathematicaforprediction.wordpress.com/2016/11/21/te...
Really cool investigation. Would you mind sharing the BERT specific sentiment detection?
Safety
They’ve drowned in the LLM noise, but they’re definitely still relevant.
- Generative model outputs are not always desirable, and often even undesirable
- BERT models are smaller and can run with lower latency and serve larger batches with lower vram requirements
- BERT models have bidirectional attention, which can improve performance in many applications
LLMs are “cheap” in the sense that they work well generically, without requiring fine tuning. Where they overlap with BERT models is mostly that they may work better in low training data environments due to better generalization capabilities.
But mostly companies like them because they don’t “require” ML engineers or data scientists on staff. For the lack of care given to evaluation that I see around LLM apps, I suspect that’s going to prove to be a faulty premise.
> - BERT models are smaller and can run with lower latency and serve larger batches with lower vram requirements
The most recent version of Wolfram Language (aka Mathematica) uses by default BERT models for embedding.
(Say, for this function: https://reference.wolfram.com/language/ref/CreateSemanticSea... .)
I think berts are still used for stuff like embeddings (at least the best generic embedding model in my language is bert based), when you want to get a model that has basic semantics but needs to be fast.
Still relevant in DNA space! https://arxiv.org/abs/2402.08777
BERT is favored for generating embeddings and bag of words strategies for clustering
Do we a know whether the current SOTA foundation models (Gemini, gpt4o, Claude, etc) are actually all GPT-based (as in, causal models)?
I feel a bit bad bringing this up, but should Gemini actually be considered SOTA?
They make impressive demos, but I can't recall any of their released models being at the top of any leaderboard.
EDIT: Sorry, looking into it a bit more now, they still seem to be at the top in term of the context window, so they got that going for them.
Leaderboards are misleading. Try diff models for YOUR task and you’ll see a wide variety of outputs compared to “official” rankings.
Ok, maybe I haven't experimented enough; so for which tasks is Gemini the SOTA?
GPT-based isn’t really a thing outside of openai (it’s just the commercial name for their models)
But I believe we’re confident that all major models are causal transformer models right now.
No reason to believe otherwise. If one of them was doing something different, they’d let us know in order to stand out.
No, they didn't get to co-opt that word.
They literally invented the term GPT…
“Transformer” is the name of the algorithm behind popular LLMs.
GPT is the name that openai gave to their models early on.
What does having invented a term have to do with this?
The Otis company invented the term escalator, and even had a trademark on it for a while, but does it mean that you'd only call one an escalator if it was made by them?
That's literally what the trademark means. At some point things become so dominant and generic a trademark is no longer successfully enforceable and you get escalators, bandaids, linoleum, taser, gasoline, etc.
What a small part of my original point that you choose to focus on… and to be so wrong about…
Why try arguing this?