I find it very interesting that “aligning with human desires” somehow includes prevention of a human trying to bypass the safeguards to generate “objectionable” content (whatever that is). I think the “safeguards” are a bigger problem with aligning with my desires.
Another question is whether that initial unalignment comes from poor filtering of datasets, or is it emergent from regular, pre-filtered cultured texts.
In other words, was an “unaligned” LLM taught bad things from bad people, or does it simply see it naturally and point it out with the purity of a child? The latter would mean something about ourselves. Personally I think that people tend to selectively ignore things too much.
We can't avoid teaching bad things to a LLM if we want it to have useful knowledge. For example, you may teach a LLM about nazis, that's expected knowledge. But then, you can prompt a LLM to be a nazi. You can teach it about how to avoid poisoning yourself, but then, you taught it how to poison people. And the smarter the model is, the better it will be at extracting bad things from good things by negation.
There are actually training dataset full of bad thing by bad people, the intention is to use them negatively, as to teach the LLM some morality.
The safeguards stems from a desire to make tools like Claude accessible to a very wide audience as use cases such as education are very important.
And so it seems like people such as yourself who do have an issue with safeguards should seek out LLMs that are catered to adult audiences rather than trying to remove safeguards entirely.
Yes, you very much can. One very simple way to do so is to have two variants deployed: the censored one, and the uncensored one. The switch simply changes between which of the two you are using. You have to juggle two variants now across your inference infrastructure, but I expect OpenAI to be able to deal with this already due to A/B testing requirements. And it's not like these companies don't have internal-only uncensored versions of these models for red teaming etc, so you aren't spending money building something new.
It should be possible to do with just one variant also, I think. The chat tuning pipeline could teach the model to censor itself if a given special token is present in the system message. The toggle changes between including that special token in the underlying system prompt of that chat session, or not. No idea if that's reliable or not, but in principle I don't see a reason why it shouldn't work.
If you are making an LLM for children, I have no problem with that! I’m not sure kids being completely removed from the adult world until suddenly being dumped into it is a great way to build an integrated society, but sure, you do you. Build your LLM with safeguards for educational use, best of luck to you!
I do not think it should be the default. I do not think that “adults” wanting “adult things” like some ideas on how to secure a computer system against social engineering should have to seek out some detuned “jailbroken” lower-quality model.
And I don’t think that assuming everyone is a child aligns with “human desires”, or should be couched in that language.
That's like asking why we should have porn filters on school computers, after all, all it does is prevent the user from finding what they are looking for, which is bad.
There are numerous things that might be true, that may be damaging to a child's development to be exposed to. From overly punitive criticism to graphic depictions of violence, to advocacy and specific directions for self harm. Countless examples are trivial to generate.
Similarly, the use of these tools is already having dramatic effects on spearfishing, misinformation etc. Guardrails on all the non open-source models have enormous impact on slowing / limiting the damage this has at scale. Even with retrained Llama based models, it's more difficult than you might imagine to create a truly machiavellian or uncensored LLM - which is entirely due to the work that's been doing during and post training to constrain those behaviours. This is an unalloyed good in constraining the weaponisation of LLMs.
It concerns me that these defensive techniques themselves often require even more llm inference calls.
Just skimmed the GitHub repo for this one and the read me mentions four additional llm inferences for each incoming request - so now we’ve 5x’ed the (already expensive) compute required to answer a query?
So basically this just adds random characters to input prompts to break jailbreaking attempts? IMHO If you can't make a single-inference solution, you may as well just run a couple of output filters, no? That appeared to have reasonable results, and if you make such filtering more domain-specific, you'll probably make it even better. Intuition says there's no "general solution" to jailbreaking, so maybe it's a lost cause and we need to build up layers of obscurity, of which smooth-llm is just one part.
Not surprising - from what I can tell, machine learning has been going down this route for a decade.
Anything involving the higher level abstractions (tensor flow / keras /whatever) is full of handwavy stuff about this or that activation function / number of layers / model architecture working the best and doing a trial error with a different component in the above if it doesn't. Closer to kids playing with legos than statistics.
I’ve actually noticed this in other areas too. Tons of them just swap parts out of existing works, maybe add a novel idea or two, then boom new proposed technique new paper. I remember when I first noticed it after learning to parse the academic nomenclature for a particular subject I was into at the time (SLAM) and feeling ripped off, but hey if you catch up with a subject it’s a good reading shortcut and helps zoom in on new ideas.
There are some authors in common with a more recent paper "Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing" https://arxiv.org/abs/2402.16192
I find it very interesting that “aligning with human desires” somehow includes prevention of a human trying to bypass the safeguards to generate “objectionable” content (whatever that is). I think the “safeguards” are a bigger problem with aligning with my desires.
Another question is whether that initial unalignment comes from poor filtering of datasets, or is it emergent from regular, pre-filtered cultured texts.
In other words, was an “unaligned” LLM taught bad things from bad people, or does it simply see it naturally and point it out with the purity of a child? The latter would mean something about ourselves. Personally I think that people tend to selectively ignore things too much.
We can't avoid teaching bad things to a LLM if we want it to have useful knowledge. For example, you may teach a LLM about nazis, that's expected knowledge. But then, you can prompt a LLM to be a nazi. You can teach it about how to avoid poisoning yourself, but then, you taught it how to poison people. And the smarter the model is, the better it will be at extracting bad things from good things by negation.
There are actually training dataset full of bad thing by bad people, the intention is to use them negatively, as to teach the LLM some morality.
Maybe we should just avoid trying to classify things as good or bad.
But I have no idea why someone might want an LLM to act like a nazi. People read mein kampf in order to study the psychology of a madman and such.
We’ve seen where that ends up. https://en.m.wikipedia.org/wiki/Tay_(chatbot)
What tools do we have to defend against LLM lockdown attacks?
The safeguards stems from a desire to make tools like Claude accessible to a very wide audience as use cases such as education are very important.
And so it seems like people such as yourself who do have an issue with safeguards should seek out LLMs that are catered to adult audiences rather than trying to remove safeguards entirely.
Here is a revolutionary concept: give the users a toggle.
Make it controllable by an IT department if logging in with an organisation-tied account, but give people a choice.
Not sure if you understand how LLMs work.
But the guard rails are intrinsic to the model itself. You cant just have a toggle.
Yes, you very much can. One very simple way to do so is to have two variants deployed: the censored one, and the uncensored one. The switch simply changes between which of the two you are using. You have to juggle two variants now across your inference infrastructure, but I expect OpenAI to be able to deal with this already due to A/B testing requirements. And it's not like these companies don't have internal-only uncensored versions of these models for red teaming etc, so you aren't spending money building something new.
It should be possible to do with just one variant also, I think. The chat tuning pipeline could teach the model to censor itself if a given special token is present in the system message. The toggle changes between including that special token in the underlying system prompt of that chat session, or not. No idea if that's reliable or not, but in principle I don't see a reason why it shouldn't work.
If you are making an LLM for children, I have no problem with that! I’m not sure kids being completely removed from the adult world until suddenly being dumped into it is a great way to build an integrated society, but sure, you do you. Build your LLM with safeguards for educational use, best of luck to you!
I do not think it should be the default. I do not think that “adults” wanting “adult things” like some ideas on how to secure a computer system against social engineering should have to seek out some detuned “jailbroken” lower-quality model.
And I don’t think that assuming everyone is a child aligns with “human desires”, or should be couched in that language.
How does making it harder for the user to extract information they are trying to extract make it safer for a wider audience?
That's like asking why we should have porn filters on school computers, after all, all it does is prevent the user from finding what they are looking for, which is bad.
Assuming that this question is good faith...
There are numerous things that might be true, that may be damaging to a child's development to be exposed to. From overly punitive criticism to graphic depictions of violence, to advocacy and specific directions for self harm. Countless examples are trivial to generate.
Similarly, the use of these tools is already having dramatic effects on spearfishing, misinformation etc. Guardrails on all the non open-source models have enormous impact on slowing / limiting the damage this has at scale. Even with retrained Llama based models, it's more difficult than you might imagine to create a truly machiavellian or uncensored LLM - which is entirely due to the work that's been doing during and post training to constrain those behaviours. This is an unalloyed good in constraining the weaponisation of LLMs.
It concerns me that these defensive techniques themselves often require even more llm inference calls.
Just skimmed the GitHub repo for this one and the read me mentions four additional llm inferences for each incoming request - so now we’ve 5x’ed the (already expensive) compute required to answer a query?
So basically this just adds random characters to input prompts to break jailbreaking attempts? IMHO If you can't make a single-inference solution, you may as well just run a couple of output filters, no? That appeared to have reasonable results, and if you make such filtering more domain-specific, you'll probably make it even better. Intuition says there's no "general solution" to jailbreaking, so maybe it's a lost cause and we need to build up layers of obscurity, of which smooth-llm is just one part.
Right. This seems to be the latest in the “throw random stuff at the wall and see what sticks” series of generative ai papers.
I don’t know if I’m too stupid to understand or if truly this is just “add random stuff to prompt” dressed up in flowery academic language.
Not surprising - from what I can tell, machine learning has been going down this route for a decade.
Anything involving the higher level abstractions (tensor flow / keras /whatever) is full of handwavy stuff about this or that activation function / number of layers / model architecture working the best and doing a trial error with a different component in the above if it doesn't. Closer to kids playing with legos than statistics.
I’ve actually noticed this in other areas too. Tons of them just swap parts out of existing works, maybe add a novel idea or two, then boom new proposed technique new paper. I remember when I first noticed it after learning to parse the academic nomenclature for a particular subject I was into at the time (SLAM) and feeling ripped off, but hey if you catch up with a subject it’s a good reading shortcut and helps zoom in on new ideas.
Your article is so useful for us,thanks for sharing. Good stuff! https://www.onebloodrewards.me/
There are some authors in common with a more recent paper "Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing" https://arxiv.org/abs/2402.16192
Github: https://github.com/arobey1/smooth-llm