Instead of learning the latest workarounds for the kinks and quirks of a beta AI product, I'm going to wait 3 weeks for the advice to become completely obsolete
What people are discovering with the latest models is that often their errors are due to entirely reasonable choices and assumptions... which happen to be wrong in your specific case. They call a library you don't have installed, or something like that. Short of inventing either telepathy or spice which can allow LLMs to see the future, it will increasingly be the case that you cannot use the best models efficiently without giving them extensive context. Writing 'reports' where you dump in everything even tangentially relevant is the obvious way to do so, and so I would expect future LLMs to be even more so than o1-preview/pro.
There was a debate over whether to integrate Stable Diffusion into the curriculum in a local art school here.
Personally while I consider AI a useful tool, I think it's quite pointless to teach it in school, because whatever you learn will be obsolete next month.
Of course some people might argue that the whole art school (it's already quite a "job-seeking" type, mostly digital painting/Adobe After Effect) will be obsolete anyway...
Modern AI both shortens the useful lifespan of software and increases the importance of development speed. Waiting around doesn’t seem optimal right now.
The reality is that o1 is a step away from general intelligence and back towards narrow ai. It is great for solving the kinds of math, coding and logic puzzles it has been designed for, but for many kinds of tasks, including chat and creative writing, it is actually worse than 4o. It is good at the specific kinds of reasoning tasks that it was built for, much like alpha-go is great at playing go, but that does not actually mean it is more generally intelligent.
This is kind if true. I feel like the reasoning power if O1 is really only truly available on the kinds of math/coding tasks it was trained on so much.
I have a lot of luck using 4o to build and iterate on context and then carry that into o1. I’ll ask 4o to break down concepts, make outlines, identify missing information and think of more angles and options. Then at the end, switch on o1 which can use all that context.
People agreeing and disagreeing about the central thesis of the article, which is fine because i enjoy the discussion...
no matter where you stand in the specific o1/o3 discussion the concept of "question entropy" is very enlightening.
what is the question of theoretical minimum complexity that still solves your question adequately? or for a specific model, are its users capable of supplying the minimum required intellectual complexity the model needs?
Would be interesting to quantify these two and see if our models are close to converging on certain task domains.
FWIW: OpenAI provides advice on how to prompt o1 (https://platform.openai.com/docs/guides/reasoning/advice-on-...). Their first bit of advice is to, “Keep prompts simple and direct: The models excel at understanding and responding to brief, clear instructions without the need for extensive guidance.”
The article links out to OpenAI's advice on prompting, but it also claims:
OpenAI does publish advice on prompting o1,
but we find it incomplete, and in a sense you can
view this article as a “Missing Manual” to lived
experience using o1 and o1 pro in practice.
To that end, the article does seem to contradict some of the advice OpenAI gives. E.g., the article recommends stuffing the model with as much context as possible... while OpenAI's docs note to include only the most relevant information to prevent the model from overcomplicating its response.
I think there is a distinction between “instructions”, “guidance” and “knowledge/context”. I tend to provide o1 pro with a LOT of knowledge/context, a simple instruction, and no guidance. I think TFA is advocating same.
But the way they did their PR for O1 made it sound like it was the next step, while in reality it was a side step. A branching from the current direction towards AGI.
Thanks for sharing this video, swyx. I learned a lot from listening to it. I hadn’t considered checking prompts for a project into source control. This video has also changed my approach to prompting in the future.
The buggy nature of o1 in ChatGPT is what prevents me from using it the most.
Waiting is one thing, but waiting to return to a prompt that never completes is frustrating. It’s the same frustration you get from a long running ‘make/npm/brew/pip’ command that errors out right as it’s about to finish.
One pattern that’s been effective is
1. Use Claude Developer Prompt Generator to create a prompt for what I want.
o1 appears to not be able to see it's own reasoning traces. Or it's own context is potentially being summarized to deal with the cost of giving access to all those chain of thought traces and the chat history. This breaks the computational expressivity or chain of thought, which supports universal (general) reasoning if you have reliable access to the things you've thought, and is threshold circuit (TC0) or bounded parallel pattern matcher when not.
My understanding is that o1's chain-of-thought tokens are in its own internal embedding, and anything human-readable the UI shows you is a translation of these CoT tokens into natural language.
This echoes my experience. I often use ChatGPT to help with D&D module design and I found that O1 did best when I told it exactly what k required, dumped in a large amount of info and did not expect to use it to iterate multiple times.
One thing I'd like to experiment with is "prompt to service". I want to take an existing microservice of about 3-5kloc and see if I can write a prompt to get o1 to generate the entire service, proper structure, all files, all tests, compiles and passes etc. o1 certainly has the context window to do this at 200k input and 100k output - code is ~10 tokens per line of code, so you'd need like 100k input and 50k output tokens.
My approach would be:
- take an exemplar service, dump it in the context
- provide examples explaining specific things in the exemplar service
- write a detailed formal spec
- ask for the output in JSON to simplify writing the code - [{"filename":"./src/index.php", "contents":"<?php...."}]
The first try would inevitably fail, so I'd provide errors and feedback, and ask for new code (ie complete service, not diffs or explanations), plus have o1 update and rewrite the spec based on my feedback and errors.
It's odd to see it recast as "you need to give better instructions [because it's different]" -- you could drop the "because it's different" part, and it'd apply to failure modes in all models.
It also begs the question of how it's different: and that's where the rationale gets cyclical. You have to prompt it different because it's different because you have to prompt it different.
And where that really gets into trouble is the "and that's the point" part -- as the other comment notes, it's expressly against OpenAI's documentation and thus intent.
I'm a yuge AI fan. Models like this are a clear step forward. But it does a disservice to readers to leave the impression that the same techniques don't apply to other models, and recasts a significant issue as design intent.
Looking at o1's behavior, it seems there's a key architectural limitation: while it can see chat history, it doesn't seem able to access its own reasoning steps after outputting them. This is particularly significant because it breaks the computational expressivity that made chain-of-thought prompting work in the first place—the ability to build up complex reasoning through iterative steps.
This will only improve when o1's context windows grow large enough to maintain all its intermediate thinking steps, we're talking orders of magnitude beyond current limits. Until then, this isn't just a UX quirk, it's a fundamental constraint on the model's ability to develop thoughts over time.
It's different because a chat model has been post-trained for chat, while o1/o3 have been post-trained for reasoning.
Imagine trying to have a conversation with someone who's been told to assume that they should interpret anything said to them as a problem they need to reason about and solve. I doubt you'd give them high marks for conversational skill.
Ideally one model could do it all, but for now the tech is apparently being trained using reinforcement learning to steer the response towards a singular training goal (human feedback gaming, or successful reasoning).
I wouldn't be so harsh - you cold have a 4o style LLM turn vague user queries into precise constraints for an o1 style AI - this is how a lot of stable diffusion image generators work already.
It does seem like individual prompting styles greatly effects the performance of these models. Which makes sense of course, but the disparity is a lot larger than I would have expected. As an example, I'd say I see far more people in the HN comments preferring Claude over everything else. This is in stark contrast to my experience, where ChatGPT has and continues to be my go to for everything. And that's on a range of problems: general questions, coding tasks, visual understanding, and creative writing. I use these AIs all day, every day as part of my research, so my experience is quite extensive. Yet in all cases Claude has performed significantly worse for me. Perhaps it just comes down to the way that I prompt versus the average HN user? Very odd.
But yeah, o1 has been a _huge_ leap in my experience. One huge thing, which OpenAI's announcement mentions as well, is that o1 is more _consistently_ strong. 4o is a great model, but sometimes you have to spin the wheel a few times. I much more rarely need to spin o1's wheel, which mostly makes up for its thinking time. (Which is much less these days compared to o1-preview). It also has much stronger knowledge. So far it has solved a number of troubleshooting tasks that there were _no_ fixes for online. One of them was an obscure bug in libjpeg.
It's also better at just general questions, like wanting to know the best/most reputable store for something. 4o is too "everything is good! everything is happy!" to give helpful advice here. It'll say Temu is a "great store for affordable options." That kind of stuff. Whereas o1 will be more honest and thus helpful. o1 is also significantly better at following instructions overall, and inferring meaning behind instructions. 4o will be very literal about examples that you give it whereas o1 can more often extrapolate.
One surprising thing that o1 does that 4o has never done, is that it _pushes back_. It tells me when I'm wrong (and is often right!). Again, part of that being less happy and compliant. I have had scenarios where it's wrong and it's harder to convince it otherwise, so it's a double edged sword, but overall it has been an improvement in the bot's usefulness.
I also find it interesting that o1 is less censored. It refuses far less than 4o, even without coaxing, despite its supposed ability to "reason" about its guidelines :P What's funny is that the "inner thoughts" that it shows says that it's refusing, but its response doesn't.
Is it worth $200? I don't think it is, in general. It's not really an "engineer" replacement yet, in that if you don't have the knowledge to ask o1 the right questions it won't really be helpful. So you have to be an engineer for it to work at the level of one. Maybe $50/mo?
I haven't found o1-pro to be useful for anything; it's never really given better responses than o1 for me.
(As an aside, Gemini 2.0 Flash Experimental is _very_ good. It's been trading blows with even o1 for some tasks. It's a bit chaotic, since its training isn't done, but I rank it at about #2 between all SOTA models. A 2.0 Pro model would likely be tied with o1 if Google's trajectory here continues.)
oh god using an LLM for medical advice? and maybe getting 3/5 right? Barely above a coin flip.
And that Warning section? "Do not be wrong. Give the correct names." That this is necessary to include is an idiotic product "choice" since its non-inclusion implies the bot is able to be wrong and give wrong names. This is not engineering.
Instead of learning the latest workarounds for the kinks and quirks of a beta AI product, I'm going to wait 3 weeks for the advice to become completely obsolete
What people are discovering with the latest models is that often their errors are due to entirely reasonable choices and assumptions... which happen to be wrong in your specific case. They call a library you don't have installed, or something like that. Short of inventing either telepathy or spice which can allow LLMs to see the future, it will increasingly be the case that you cannot use the best models efficiently without giving them extensive context. Writing 'reports' where you dump in everything even tangentially relevant is the obvious way to do so, and so I would expect future LLMs to be even more so than o1-preview/pro.
There was a debate over whether to integrate Stable Diffusion into the curriculum in a local art school here.
Personally while I consider AI a useful tool, I think it's quite pointless to teach it in school, because whatever you learn will be obsolete next month.
Of course some people might argue that the whole art school (it's already quite a "job-seeking" type, mostly digital painting/Adobe After Effect) will be obsolete anyway...
To be fair, the article basically says "ask the LLM for what you want in detail"
The churn is real. I wonder if so much churn due to innovation in a space can prevent enough adoption such that it actually reduces innovation
Great summary of how AI compresses the development (and hype) product cycle
Modern AI both shortens the useful lifespan of software and increases the importance of development speed. Waiting around doesn’t seem optimal right now.
The reality is that o1 is a step away from general intelligence and back towards narrow ai. It is great for solving the kinds of math, coding and logic puzzles it has been designed for, but for many kinds of tasks, including chat and creative writing, it is actually worse than 4o. It is good at the specific kinds of reasoning tasks that it was built for, much like alpha-go is great at playing go, but that does not actually mean it is more generally intelligent.
LLMs will not give us "artificial general intelligence", whatever that means.
So-so general intelligence is a lot harder to sell than narrow competence.
This is kind if true. I feel like the reasoning power if O1 is really only truly available on the kinds of math/coding tasks it was trained on so much.
Yes, I don't understand their ridiculous AGI hype. I get it you need to raise a lot of money.
We need to crack the code for updating the base model on the fly or daily / weekly. Where is the regular learning by doing?
Not over the course of a year, spending untold billions to do it.
Which sounds like... a very good thing?
I have a lot of luck using 4o to build and iterate on context and then carry that into o1. I’ll ask 4o to break down concepts, make outlines, identify missing information and think of more angles and options. Then at the end, switch on o1 which can use all that context.
People agreeing and disagreeing about the central thesis of the article, which is fine because i enjoy the discussion...
no matter where you stand in the specific o1/o3 discussion the concept of "question entropy" is very enlightening.
what is the question of theoretical minimum complexity that still solves your question adequately? or for a specific model, are its users capable of supplying the minimum required intellectual complexity the model needs?
Would be interesting to quantify these two and see if our models are close to converging on certain task domains.
FWIW: OpenAI provides advice on how to prompt o1 (https://platform.openai.com/docs/guides/reasoning/advice-on-...). Their first bit of advice is to, “Keep prompts simple and direct: The models excel at understanding and responding to brief, clear instructions without the need for extensive guidance.”
The article links out to OpenAI's advice on prompting, but it also claims:
To that end, the article does seem to contradict some of the advice OpenAI gives. E.g., the article recommends stuffing the model with as much context as possible... while OpenAI's docs note to include only the most relevant information to prevent the model from overcomplicating its response.I haven't used o1 enough to have my own opinion.
I think there is a distinction between “instructions”, “guidance” and “knowledge/context”. I tend to provide o1 pro with a LOT of knowledge/context, a simple instruction, and no guidance. I think TFA is advocating same.
So in a sense, being an early adopter for the previous models makes you worse at this one?
The advice is wrong
But the way they did their PR for O1 made it sound like it was the next step, while in reality it was a side step. A branching from the current direction towards AGI.
coauthor/editor here!
we recorded a followup conversation after the surprise popularity of this article breaking down some more thoughts and behind the scenes: https://youtu.be/NkHcSpOOC60?si=3KvtpyMYpdIafK3U
Thanks for sharing this video, swyx. I learned a lot from listening to it. I hadn’t considered checking prompts for a project into source control. This video has also changed my approach to prompting in the future.
I made a tool for manually collecting context. I use it when copying and pasting multiple files is cumbersome: https://pypi.org/project/ggrab/
i creates thisismy.franzai.com for the same reason
The buggy nature of o1 in ChatGPT is what prevents me from using it the most.
Waiting is one thing, but waiting to return to a prompt that never completes is frustrating. It’s the same frustration you get from a long running ‘make/npm/brew/pip’ command that errors out right as it’s about to finish.
One pattern that’s been effective is
1. Use Claude Developer Prompt Generator to create a prompt for what I want.
2. Run the prompt on o1 pro mode
o1 appears to not be able to see it's own reasoning traces. Or it's own context is potentially being summarized to deal with the cost of giving access to all those chain of thought traces and the chat history. This breaks the computational expressivity or chain of thought, which supports universal (general) reasoning if you have reliable access to the things you've thought, and is threshold circuit (TC0) or bounded parallel pattern matcher when not.
My understanding is that o1's chain-of-thought tokens are in its own internal embedding, and anything human-readable the UI shows you is a translation of these CoT tokens into natural language.
I'd love to see some examples, of good and bad prompting of o1
I'll admit I'm probably not using O1 well, but I'd learn best from examples.
This echoes my experience. I often use ChatGPT to help with D&D module design and I found that O1 did best when I told it exactly what k required, dumped in a large amount of info and did not expect to use it to iterate multiple times.
Work with chat bots like a junior dev, work with o1 like a senior dev.
Can you provide prompt/response pairs? I'd like to test how other models perform using the same technique.
One thing I'd like to experiment with is "prompt to service". I want to take an existing microservice of about 3-5kloc and see if I can write a prompt to get o1 to generate the entire service, proper structure, all files, all tests, compiles and passes etc. o1 certainly has the context window to do this at 200k input and 100k output - code is ~10 tokens per line of code, so you'd need like 100k input and 50k output tokens.
My approach would be:
- take an exemplar service, dump it in the context
- provide examples explaining specific things in the exemplar service
- write a detailed formal spec
- ask for the output in JSON to simplify writing the code - [{"filename":"./src/index.php", "contents":"<?php...."}]
The first try would inevitably fail, so I'd provide errors and feedback, and ask for new code (ie complete service, not diffs or explanations), plus have o1 update and rewrite the spec based on my feedback and errors.
Curious if anyone's tried something like this.
> To justify the $200/mo price tag, it just has to provide 1-2 Engineer hours a month
> Give a ton of context. Whatever you think I mean by a “ton” — 10x that.
One step forward. Two steps back.
this is hilarious
This is a bug, and a regression, not a feature.
It's odd to see it recast as "you need to give better instructions [because it's different]" -- you could drop the "because it's different" part, and it'd apply to failure modes in all models.
It also begs the question of how it's different: and that's where the rationale gets cyclical. You have to prompt it different because it's different because you have to prompt it different.
And where that really gets into trouble is the "and that's the point" part -- as the other comment notes, it's expressly against OpenAI's documentation and thus intent.
I'm a yuge AI fan. Models like this are a clear step forward. But it does a disservice to readers to leave the impression that the same techniques don't apply to other models, and recasts a significant issue as design intent.
Looking at o1's behavior, it seems there's a key architectural limitation: while it can see chat history, it doesn't seem able to access its own reasoning steps after outputting them. This is particularly significant because it breaks the computational expressivity that made chain-of-thought prompting work in the first place—the ability to build up complex reasoning through iterative steps.
This will only improve when o1's context windows grow large enough to maintain all its intermediate thinking steps, we're talking orders of magnitude beyond current limits. Until then, this isn't just a UX quirk, it's a fundamental constraint on the model's ability to develop thoughts over time.
It's different because a chat model has been post-trained for chat, while o1/o3 have been post-trained for reasoning.
Imagine trying to have a conversation with someone who's been told to assume that they should interpret anything said to them as a problem they need to reason about and solve. I doubt you'd give them high marks for conversational skill.
Ideally one model could do it all, but for now the tech is apparently being trained using reinforcement learning to steer the response towards a singular training goal (human feedback gaming, or successful reasoning).
I wouldn't be so harsh - you cold have a 4o style LLM turn vague user queries into precise constraints for an o1 style AI - this is how a lot of stable diffusion image generators work already.
It does seem like individual prompting styles greatly effects the performance of these models. Which makes sense of course, but the disparity is a lot larger than I would have expected. As an example, I'd say I see far more people in the HN comments preferring Claude over everything else. This is in stark contrast to my experience, where ChatGPT has and continues to be my go to for everything. And that's on a range of problems: general questions, coding tasks, visual understanding, and creative writing. I use these AIs all day, every day as part of my research, so my experience is quite extensive. Yet in all cases Claude has performed significantly worse for me. Perhaps it just comes down to the way that I prompt versus the average HN user? Very odd.
But yeah, o1 has been a _huge_ leap in my experience. One huge thing, which OpenAI's announcement mentions as well, is that o1 is more _consistently_ strong. 4o is a great model, but sometimes you have to spin the wheel a few times. I much more rarely need to spin o1's wheel, which mostly makes up for its thinking time. (Which is much less these days compared to o1-preview). It also has much stronger knowledge. So far it has solved a number of troubleshooting tasks that there were _no_ fixes for online. One of them was an obscure bug in libjpeg.
It's also better at just general questions, like wanting to know the best/most reputable store for something. 4o is too "everything is good! everything is happy!" to give helpful advice here. It'll say Temu is a "great store for affordable options." That kind of stuff. Whereas o1 will be more honest and thus helpful. o1 is also significantly better at following instructions overall, and inferring meaning behind instructions. 4o will be very literal about examples that you give it whereas o1 can more often extrapolate.
One surprising thing that o1 does that 4o has never done, is that it _pushes back_. It tells me when I'm wrong (and is often right!). Again, part of that being less happy and compliant. I have had scenarios where it's wrong and it's harder to convince it otherwise, so it's a double edged sword, but overall it has been an improvement in the bot's usefulness.
I also find it interesting that o1 is less censored. It refuses far less than 4o, even without coaxing, despite its supposed ability to "reason" about its guidelines :P What's funny is that the "inner thoughts" that it shows says that it's refusing, but its response doesn't.
Is it worth $200? I don't think it is, in general. It's not really an "engineer" replacement yet, in that if you don't have the knowledge to ask o1 the right questions it won't really be helpful. So you have to be an engineer for it to work at the level of one. Maybe $50/mo?
I haven't found o1-pro to be useful for anything; it's never really given better responses than o1 for me.
(As an aside, Gemini 2.0 Flash Experimental is _very_ good. It's been trading blows with even o1 for some tasks. It's a bit chaotic, since its training isn't done, but I rank it at about #2 between all SOTA models. A 2.0 Pro model would likely be tied with o1 if Google's trajectory here continues.)
oh god using an LLM for medical advice? and maybe getting 3/5 right? Barely above a coin flip.
And that Warning section? "Do not be wrong. Give the correct names." That this is necessary to include is an idiotic product "choice" since its non-inclusion implies the bot is able to be wrong and give wrong names. This is not engineering.
Not if you're selecting out of 10s or 100s of possible diagnoses