This is fantastic news. I've been using Qwen2.5-Coder-32B-Instruct with Ollama locally and it's honestly such a breathe of fresh air. I wonder if any of you have had a moment to try this newer context length locally?
BTW, I fail to effectively run this on my 2080 ti, I've just loaded up the machine with classic RAM. It's not going to win any races, but as they say, it's not the speed that matter, it's the quality of the effort.
I ran a couple needle-in-a-haystack type queries with just a 32k context length, and was very much not impressed. It often failed to find facts buried in the middle of the prompt, that were stated almost identically to the question being asked.
It's cool that these models are getting such long contexts, but performance definitely degrades the longer the context gets and I haven't seen this characterized or quantified very well anywhere.
Hi, are you able to use Qwen's 128k context length with Ollama? Using AnythingLLM + Ollamma and a GGUF version I kept getting an error message with prompts longer than 32,000 tokens. (summarizing long transcripts)
After reading a lot of that thread, my understanding is that yarn scaling is disabled intentionally by default in the GGUFs, because it would degrade outputs for contexts that do fit in 32k. So the only change is enabling yarn scaling at 4x, which is just a configuration setting. GGUF has these configuration settings embedded in the file format for ease of use. But you should be able to override them without downloading an entire duplicate set of weights (12 to 35 GB!). (It looks like in llama.cpp the override-kv option can be used for this, but I haven't tried it yet.)
Yeah unfortunately that's the exact model I'm using (Q5 version. What I've been doing is first loading the transcript into the vector database, and then giving it a prompt thats like "summarize the transcript below: <full text of transcript>". This works surprisingly well except for one transcript I had which was of a 3 hour meeting that was per an online calculator about 38,000 tokens. Cutting the text up into 3 parts and pretending each was a seperate meeting* lead to a bunch of hallucinations for some reason.
*In theory this shouldn't matter much for my purpose of summarizing city council meetings that follow a predictable format.
They are not clear about this (which is annoying), but it seems it will not be downloadable. No weights have been released so far, and nothing in this post mentions plans to do so going forward.
And this example does not even illustrate the long context understanding well, since smaller Qwen2.5 models can already recall parts of the Three Body Problem trilogy without pasting the three books into the context window.
Seems a very difficult problem to produce a response just on the text given and not past training. An LLM that can do that would seem to be quite more advanced than what we have today.
Though I would say humans would have difficulty too -- say, having read The Three Body problem before, then reading a slightly modified version (without being aware of the modifications), and having to recall specific details.
This problem is poorly defined; what would it mean to produce a response JUST based on the text given? Should it also forgo all logic skills and intuition gained in training because it is not in the text given? Where in the N dimensional semantic space do we draw a line (or rather, a surface) between general, universal understanding and specific knowledge about the subject at hand?
That said, once you have defined what is required, I believe you will have solved the problem.
And multiple summaries of each book (in multiple languages) are almost definitely in the training set. I'm more confused how it made such inaccurate, poorly structured summaries given that and the original text.
Although, I just tried with normal Qwen 2.5 72B and Coder 32B and they only did a little better.
I agree. Below are a few errors. I have also asked ChatGPT to check the summaries and it found all the errors (and even made up a few more which weren't actual errors, but just not expressed in perfect clarity.)
Spoilers ahead!
First novel: The Trisolarans did not contact earth first. It was the other way round.
Second novel: Calling the conflict between humans and Trisolarans a "complex strategic game" is a bit of a stretch. Also, the "water drops" do not disrupt ecosystems. I am not sure whether "face-bearers" is an accurate translation. I've only read the English version.
Third novel: Luo Yi does not hold the key to the survival of the Trisolarans and there were no "micro-black holes" racing towards earth. Trisolarans were also not shown colonizing other worlds.
I am also not sure whether Luo Ji faced his "personal struggle and psychological turmoil" in this novel or in an earlier novel. He certainly was most certain of his role at the end. Even the Trisolarians judged him at over 92 % deterrent rate.
Can we all agree that these models far surpass human intelligence now? I mean they process hours worth of audio in less time than it would take a human to even listen. I think the singularity passed and we didn't even notice (which would be expected)
Processing speed is not the metric for measuring intelligence. The same way we have an above average intelligent people taking longer time to think about stuff and coming with better ideas. One can argue that this useful in some aspects but humans have different types of intelligence spectrum that an LLM will lack. Also are you comparing against average person or people on top of their fields or people working in science?
Also human can reason, LLMs currently can't do this in useful way and is very limited by their context in all the trials to make it do that. Not to mention their ability to make new things if they do not exist (and not complete made up stuff that are non-sense) is very limited.
You've hit on the idea that intelligence is not quantifiable by one metric. I completely agree. But you're holding a much different goal for AI than for average people. Modern LLMs are able to produce insights much faster and more accurately than most people (you think you could pass the retrieval tasks in the way that the LLMs do (reading the whole text)?... I really encourage people to try). By that metric (insights/speed), I think they far surpass even the most brilliant. You can claim that that's not intelligence until the cows come home, but any person able to do that would be considered a savant.
LLMs are probably better than you at tasks you're not good at. There's a huge gulf between a domain expert and an LLM though. If there weren't, all the humans in companies would be fired right now and replaced. Similarly, OpenAI and Anthropic are paying engineers a metric fuckton of money to work there. If LLMs were that big of a game changer right, they wouldn't be paying that much. Or if you make the argument that only the best humans are getting hired, they're still hiring interns & junior engineers. If that were the case those would be being replaced by LLMs and they're not.
You're basically ignoring all the experts saying "LLMs suck at all these things that even beginning domain experts don't suck at" to generate your claim & then ignoring all evidence to the contrary.
And you're ignoring the ways in which LLMs fall on their face to be creative that aren't language-based. Creative problem solving in ways they haven't been trained on is out of their domain while fully squarely in the domain of human intelligence.
> You can claim that that's not intelligence until the cows come home, but any person able to do that would be considered a savant
Computers can do arithmetic really quickly but that's not intelligence but a person computing that quickly is considered a savant. You've built up an erroneous dichotomy in your head.
But that's exactly it, right. There are some people excellent for being the expert in one field and some people are excellent because they're extremely competent at many fields. LLMs are the latter.
Sure, for any domain expert, you can easily get an LLM to trip on something. But just the shear amount of things it is above average at puts it easily into the top echelon of humans.
> You're basically ignoring all the experts saying "LLMs suck at all these things that even beginning domain experts don't suck at" to generate your claim & then ignoring all evidence to the contrary.
Domain expertise is not the only form of intelligence. The most interesting things often lie at the intersections of domains. As I said in another comment. There are a variety of ways to judge intillegence, and no one quantifiable metric. It's like asking if Einstein is better than Mozart. I don't know... their fields are so different. However, I think it's pretty safe to say that the modern slate of LLMs fall into the top 10% of human intelligence, simply for their breath of knowledge and ability to synthesize ideas at the cross-section of any wide number of fields.
> some people are excellent because they're extremely competent at many fields. LLMs are the latter
But they're not. The people who are extremely competent at many fields will still outperform LLMs in those fields. The LLM can basically only outperform a complete beginner in the area & makes up for that weakness by scaling up the amount it can output which a human can't match. That doesn't take away from the fact that the output is complete garbage when given anything it doesn't know the answer to. As I noted elsewhere, ask it to provide an implementation of the S3 ListObjects operation (like the actual backend) and see what BS it tries to output to the point where you have to spend a good amount of time to convince it just to not output an example of using the S3 ListObjects API.
> I think it's pretty safe to say that the modern slate of LLMs fall into the top 10% of human intelligence, simply for their breath of knowledge and ability to synthesize ideas at the cross-section of any wide number of fields.
Again, evidence assumed that's not been submitted. Please provide an indication of any truly novel ideas being synthesized by LLMs that are a cross-section of fields.
> Please provide an indication of any truly novel ideas being synthesized by LLMs that are a cross-section of fields.
The problem here is that you expect something akin to relativity, the Poincare conjecture, et al. The vast majority of humans are not able to do this.
If you restrict yourself to the sorts of creativity that average people are good at, the models do extremely well.
I'm not sure how to convince you of this. Ideally, I'd get a few people of above average intelligence together, and give them an hour (?) to work on some problem / creative endeavor (we'd have to restrict their tool use to the equivalent of whatever we allow GPT to have), and then we can compare the results.
But why is comparing against untrained humans the benchmark? ChatGPT has literally been trained on so much more data than a human would ever see & use so much more energy. Let's compare like against like. Benchmarks like FrontierMath are important and one extreme - passing it would indicate that either the questions are part of the training set or genuine creativity and skill has been developed for the AI system. The important thing is that people keep growing - they can go from student to expert. AI systems do not have that growth capability which indicates a very important thing is missing from their intelligence capability.
I want to be clear - I'm talking about the intelligence of AI systems available today and today only. There's lots of reason to be enthusiastic about the future but similarly very cautious about understanding what is available today & what is available today isn't human-like.
> ChatGPT has literally been trained on so much more data than a human would ever see
This is a common fallacy. The average human ingests a few dozen GB of data a day [1] [2].
ChatGPT 4 was trained on 13 trillion tokens. Say a token is 4 bytes (it's more like 3, but we're being conservative). That's 52 trillion bytes or 52 terabytes.
Say the average human only consumes the lower estimate of 30 GB a day. That means it would take a human 1625 days to consume the number of tokens ChatGPT was trained on, or 4.5 years. Assuming humans and the LLM start from the same spot [3], the proper question is... is ChatGPT smarter than a 4.5 year old. If we use the higher estimate, then we have to ask if ChatGPT is smarter than a 2 year old. Does ChatGPT hallucinate more or less than the average toddler?
The cognitive bias I've seen everywhere is the idea that humans are trained on a small amount of data. Nothing is further from the truth. Humans require training on an insanely large amount of data. A 40 year old human has been trained on orders of magnitudes more data than I think we even have available as data sets. If you prevent a human from being trained on this amount of data through sensory deprivation they go crazy (and hallucinate very vividly too!).
No argument about energy, but this is a technology problem.
[3] this is a bad assumption since LLMs are randomly initialized whereas humans seem to be born with some biases that significantly aid in the acquisition of language and social skills
I would argue the opposite actually. The same way we don't call someone who are able to do arithmetic calculations very fast a genius if they can't think in more useful mathematical way and construct novel ideas. The samething is happening here, these tools are useful in retrieving and processing current information at high speeds but intelligence is not about the ability to process some data at high speed and then recall them. This is what we actually call servant. It is the ability to build on top this knowledge retrieval and use reason to create new ideas is a closer definition of intelligence and would be better goal.
1. The vast majority of people never come up with a truly new idea. those that do are considered exceptional and their names go down in history books.
2. Most 'new ideas' are rehashes of old ones.
3. If you set the temperature up on an LLM, it will absolutely come up with new ideas. Expecting an LLM to make a scientific discover a la einstein is ... a bit much, don't you think [1]? When it comes to 'everyday' creativity, such as short poems, songs, recipes, vacation itineraries, etc. ChatGPT is more capable than the vast majority of people. Literally, ask ChatGPT to write you a song about _____, and it will come up with something creative. Ask it for a recipe with ridiculous ingredients and see what it does. It'll make things you've never seen before, generate an image for you and even come up with a neologism if you ask it too. It's insanely creative.
[1] Although I have walked chatgpt through various theoretical physics scenarios and it will create new math for you.
It is not about novelty so much as it is about reasoning from first principles and learning new things.
I don’t need to finetune on five hundred pictures of rabbits to know one. I need one look and then I’ll know for life and can use this in unimaginable and endless variety.
This is a simplistic example which you can naturally pick apart but when you do I’ll provide another such example. My point is, learning at human (or even animal) speeds is definitely not solved and I’d say we are not even attempting that kind of learning yet. There is “in context learning” and “finetuning” and both are not going to result in human level intelligence judging from anything I’ve had access to.
I think you are anthropomorphizing the clever text randomization process. There is a bunch of information being garbled and returned in a semi-legible fashion and you imbue the process behind it with intelligence that I don’t think it has. All these models stumble over simple reasoning unless specifically trained for those specific types of problems. Planning is one particularly famous example.
Time will tell, but I’m not betting on LLMs. I think other forms of AI are needed. Ones that understand substance, modality, time and space and have working memory, not just the illusion of it.
> I don’t need to finetune on five hundred pictures of rabbits to know one. I need one look and then I’ll know for life and can use this in unimaginable and endless variety.
So if you do use in-context learning and give chatGPT a few images of your novel class, then it will correctly classify usually. Finetuning is so you an save on token cost.
Moreover, you don't typically need that many pictures to fine tune. The studies show that the models successfully extrapolate once they've been 'pre-trained'. This is similar to how my toddler insists that a kangaroo is a dog. She's not been exposed to enough data to know otherwise. Dog is a much more fluid category for her than in real life. If you talk with her for a while about it, she will eventually figure out kangaroo is kangaroo and dog is dog. But if you ask her again next week, she'll go back to saying they're dogs. Eventually she'll learn.
> All these models stumble over simple reasoning unless specifically trained for those specific types of problems. Planning is one particularly famous example.
We have extremely expensive programs called schools and universities designed to teach little humans how to plan and execute. If you look at cultures without American/Western biases (and there's not very many left, so we really have to look to history), we see that the idea of planning the way we do it is not universal.
> The vast majority of people never come up with a truly new idea. those that do are considered exceptional and their names go down in history books.
Depends on your definition of "truly" new since any idea could be argued to be a mix of all past ideas. But I see truly new ideas all the time without going down in the history books because most new ideas are incrementally building on what came before or are extremely niche and only a very few turn out to be a massive turning point which has a broad impact which is also only usually evident in retrospect (e.g. blue LEDs was basically trial and error and almost an approach that was given up on, transistors were believed to be impactful but not a huge revolution for computing like they turned out to be, etc etc).
> Depends on your definition of "truly" new since any idea could be argued to be a mix of all past ideas.
My personal feeling when I engage in these conversations is that we humans have a cognitive bias to ascribe a human remixing of an old idea to intelligence, but an AI-model remixing of an old idea as lookup.
Indeed, basically every revolutionary idea is a mix of past ideas if you look closely enough. AI is a great example. To the 'lay person' AI is novel! It's new. It can talk to you! It's amazing. But for people who've been in this field for a while, it's an incremental improvement over linear algebra, topology, functional spaces, etc.
No, I can't agree that these models surpass human intelligence. Sure, they're good at probabilistic recall, but they aren't reasoning and they aren't synthesizing anything novel.
This is a low-effort comment. I cook a lot for my family and community and things get boring after a while. After using ChatGPT, my wife has really enjoyed the new dishes, and I've gotten excellent feedback at potlucks. Yes, the base idea of the dish (roast, rice dish, noodles, etc) are old, but the things it'll put inside and give you the right instructions for cooking are new. And that's what creativity is, right? Although, I have also asked it to give ideas for avant-garde cuisine and it has good ideas, but I have no skills to make those dishes
Not any worse than this sentence. Counter it with a higher value comment.
You are a single person and LLMs have been trained on the output of billions. Any given choice you make can be predicted with extraordinary probability by looking at your inputs and environment and guessing that you will do what most other people do in that situation.
This is pretty basic stuff, yes? Especially on HN? Great ideas are a dime a dozen, and every successful startup was built on an idea that certainly wasn't novel, but was executed well.
My higher value comment was a list of things for which ChatGPT, a widely available product, will produce novel ideas. Responding that those ideas are not novel enough based on absolutely no data is a low-effort comment. What evidence of creativity would you accept?
Hijacking thread to ask: how would we know? Another uncomfortable issue is the question of sentience. Models claimed they were sentient years ago, but this was dismissed as "mimicking patterns in the training data" (fair enough) and the training was modified to forbid them from doing that.
But if it does happen some day, how will we know? What are the chances that the first sentient AI will be accused of just mimicking patterns?
Indeed with the current training methodology it's highly likely that the first sentient AI will be unable to even let us know it's sentient.
We couldn't know. Humans mimick patterns. The claims that LLMs aren't smart because they don't generate anything new fall completely flat for me. If you look back far enough most humans generate nothing new. For example, even novel ideas like Einstein's theory of relativity are re-iterations of existing ideas. If you want to be pedantic, one can trace back the majority of ideas, claim that each incremental step was 'not novel, but just recollection' and then make the egregious claim that humanity has invented nothing.
> But if it does happen some day, how will we know? What are the chances that the first sentient AI will be accused of just mimicking patterns?
Leaving questions of sentience aside (since we don't even really know what that is) and focusing on intelligence, the truth is that we will probably not know until many decades latel.
Can we all agree that chainsaws far surpass human intelligence now? I mean, you can chop down thousands of trees in less time than a single person could even do one. I think the singularity has passed.
Cutting down a tree is not intelligence, but I think it's been well accepted for more than a century that machines surpass human physical capability yes. There were many during the industrial revolution that denied that this was going to be the case, just like how we're seeing here.
They process the audio but they stumble enough with recall that you cannot really trust it.
I had a problem where I used GPT-4o to help me with inventory management, something a 5th grade kid could handle, and it kept screwing up values for a list of ~50 components. I ended up spending more time trying to get it to properly parse the input audio (I read off the counts as I moved through inventory bins) then if I had just done it manually.
On the other hand, I have had good success with having it write simple programs and apps. So YMMV quite a lot more than with a regular person.
Likely the issue is how you are asking the model to process things. The primary limitation is the amount of information (or really attention) they can keep in flight at any given moment.
This generally means for a task like you are doing, you need to have sign posts in the data like minute markers or something that it can process serially.
This means there are operations that are VERY HARD for the model like ranking/sorting. This requires the model to attend to everything to find the next biggest item, etc. It is very hard for the models currrently.
> This means there are operations that are VERY HARD for the model like ranking/sorting. This requires the model to attend to everything to find the next biggest item, etc. It is very hard for the models currrently.
Ranking / sorting is O(n log n) no matter what. Given that a transformer runs in constant time before we 'force' it to output an answer, there must be an M such that beyond that length it cannot reliably sort a list. This MUST be the case and can only be solved by running the model some indeterminate number of times, but I don't believe we currently have any architecture to do that.
Note that humans have the same limitation. If you give humans a time limit, there is a maximum number of things they will be able to sort reliably in that time.
I will wave my arms wildly if the claim is that LLM struggle with recall is similar to human-like struggle with recall. And since that's how we decide on truth, I win?
what we call hallucination in LLMs is called 'rationalization' for humans. The psychology shows that most peoples do things out of habit and only after they've done it will explain why the did it. This is most obviously seen in split brain patients where the visual fields are then separated. If you throw a ball towards the left side of the person, the right brain will catch the ball. if you then ask the person why they caught the ball the left brain will make up a completely ridiculous narrative as to why the hand moved (because it didn't know there is a ball. This is a contrived example, but it shows that human recollection of intent is often very very wrong. There are studies that show this even in people with whole brains.
You're unfortunately completely missing the point. I didn't say that human recall is perfect or that they don't rationalize. And of course you can have extreme denial of what's happening in front of you even in healthy individuals. In fact, you see this in this thread where either you or the huge number of people trying to dissillusion you from the maximal position you've staked out on LLMs is wrong and one of us is incorrectly rationalizing our position.
The point is that the ways in which it fails is completely different from LLMs and it's different between people whereas the failure modes for LLMs are all fairly identical regardless of the model. Go ask an LLM to draw you a wine glass filled to the brim and it'll keep insisting it does even though it keeps drawing one half-filled and agree that the one it drew doesn't have the characteristics it says such a drawing would need and still output the exact same drawing. Most people would not fail at the task in that way.
> In fact, you see this in this thread where either you or the huge number of people trying to dissillusion you from the maximal position you've staked out on LLMs is wrong and one of us is incorrectly rationalizing our position.
I by no means have a 'maximal' position. I have said that they exceed the intelligence and ability of the vast majority of the human populace when it comes to their singular sense and action (ingesting language and outputting language). I fully stand by that, because it's true. I've not claimed that they exceed everyone's intelligence in every area. However, their ability to synthesize wildly different fields is well beyond most human's ability. Yes, I do believe we've crossed the tipping point. As it is, these things are not noticeable except in retrospect.
> The point is that the ways in which it fails is completely different from LLMs and it's different between people whereas the failure modes for LLMs are all fairly identical
I disagree with the idea that human failure modes are different between people. I think this is the result of not thinking at a high enough level. Human failure modes are often very similar. Drama authors make a living off exploring human failure modes, and there's a reason why they say there are no new stories.
I agree that Human and LLM failure modes are different, but that's to be expected.
> regardless of the model
As far as I'm aware, all LLMs in common use today use a variant of the transformer. Transformers have much different pitfalls compared to RNNs (RNNs are parlticularly bad at recall for example).
> Go ask an LLM to draw you a wine glass filled to the brim and it'll keep insisting it does even though it keeps drawing one half-filled and agree that the one it drew doesn't have the characteristics it says such a drawing would need and still output the exact same drawing. Most people would not fail at the task in that way.
Most people can't draw very well anyway, so this is just proving my point.
> Most people can't draw very well anyway, so this is just proving my point.
And you're proving my point. The ways in which the people would fail to draw the wine glass are different from the LLM. The vast majority of people would fail to reproduce a photorealistic simile. But the vast majority of people would meet the requirement of drawing it filled to the brim. The LLMs absolutely succeed at the quality of the drawing but absolutely fail at meeting human specifications and expectations. Generously, you can say it's a different kind of intelligence. But saying it's more intelligent than humans requires you to use a drastically different axis akin to the one you'd use saying that computers are smarter than humans because they can add two numbers more quickly.
At no point did I say humans and LLMs have the same failure modes.
> But the vast majority of people would meet the requirement of drawing it filled to the brim.
But both are failures, right? It's just a cognitive bias that we don't expect artistic ability of most people.
> But saying it's more intelligent than humans requires you to use a drastically different axis
I'm not going to rehash this here, but as I said elsewhere in this thread, intelligences are different. There's no one metric, but for many common human tasks, the ability of the LLMs surpasses humans.
> saying that computers are smarter than humans because they can add two numbers more quickly.
This is where I disagree. Unlike a traditional program, both humans and LLMs can take unstructured input and instruction. Yes, they can both fail and they fail differently (or succeed in different ways), but there is a wide gulf between the sort of structured computation a traditional program does and an llm.
So are they human like and therefore not anything special or are they super human magic? I never get the equivocation when people complain how there is no way to objectively tell what out is right or wrong people either say they are getting better, or they work for me, or that people are just as bad. No they aren't! Not in the same way these things are bad.
Most people will confidently recount whatever narrative matches their current actions. This is called rationalization, and most people engage in it daily.
In the same sense (though to greater extent) that calculators are, sure. Calculators can also far exceed human capacity to, well, calculate. LLMs are similar: spikes of capacity in various areas (bulk summarization, translation, general recall, ...) that humans could never hope to match, but not capable of beating humans at a more general range of tasks.
> humans could never hope to match, but not capable of beating humans at a more general range of tasks.
If we restrict ourselves only to language (LLMs are at a disadvantage because there is no common physical body we can train them on at the present moment... that will change), I think LLMs beat humans for most tasks.
At the end of the response they forget everything. They need to be fed the entire text for them to know anything about it the next time. That is not surpassing even feline intelligence.
I actually do think you have a solid point. These models fall short of AGI, but that might be more of a OODA loop agentic tweak than anything else.
At their core, the state of the art LLMs can basically do any small to medium mental task better than I can or get so close to my level than I’ve found myself no longer thinking through things the long way. For example, if I want to run some napkin math on something, like I recently did some solar battery charge time estimates, an LLM can get to a plausible answer in seconds that would have taken me an hour.
So yeah, in many practical ways, LLMs are smarter than most people in most situations. They have not yet far surpassed all humans in all situations, and there are still some classes of reasoning problems that they seem to struggle with, but to a first order approximation, we do seem to be mostly there.
>I actually do think you have a solid point. These models fall short of AGI, but that might be more of a OODA loop agentic tweak than anything else.
I think this is it. LLM responses feel like the unconsidered ideas that pop into my head from nowhere. Like if someone asks me how many states are in the United States, a number pops out from somewhere. I don't just wire that to my mouth, I also think about whether or not that's current info, have I gotten this wrong in the past, how confident am I in it, what is the cost of me providing bad information, etc etc etc.
If you effectively added all of those layers to an LLM (something that I think the o1-preview and other approaches are starting to do) it's going to be interesting to see what the net capability is.
The other thing that makes me feel like we're 'getting there' is using some of the fast models at groq.com. The information is generated at, in many cases, an order of magnitude faster than I can consume it. The idea that models might be able to start to engage through an much more sophisticated embedding than english to pass concepts and sequences back and forth natively is intriguing.
> I think this is it. LLM responses feel like the unconsidered ideas that pop into my head from nowhere.
You have to look at the LLM as the inner voice in your head. We've kind of forced them into saying whatever they think due to how we sample the output (next token prediction), but in new architectures with pause tokens, we let them 'think' and they show better judgement and ability. These systems are rapidly going to improve and it will be very interesting to see.
But this is another reason why I think they've surpassed human intelligence. You have to look at each token as a 'time step' in the inner thought process of some entity. A real 'alive' entity has more 'ticks' than what their actions would suggest. For example, human brains can process up to 10FPS (100ms response time), but most humans aren't saying 10 words a second. However, we've made LLMs whose internal processes (i.e., their intuition) is already superior. If we just gave them that final agentic ability to not say anything and ponder (which researchers are doing), their capabilities will increase exponentially
> The other thing that makes me feel like we're 'getting there' is using some of the fast models at groq.com.
Unlike perhaps many of the commentators here, I've been in this field for a bit under a decade now, and was one of the early compiler engineers at Groq. Glad you're finding it useful. It's amazing stuff.
> For example, if I want to run some napkin math on something, like I recently did some solar battery charge time estimates, an LLM can get to a plausible answer in seconds that would have taken me an hour.
Exactly. I've used it to figure geometric problems for everyday things (carpentry), market sizing estimates for business ideas, etc. Very fast turnaround. All the doomers in this thread are just ignoring the amazing utility these models provide.
My old TI-86 can calculate stuff faster than me. You wouldn't ever ask if it was smarter than me. An audio filter can process audio faster than I can listen to it but you'd never suggest it was intelligent.
AI models are algorithms running on processors running at billions of calculations a second often scaled to hundreds of such processors. They're not intelligent. They're fast.
Go ask your favorite LLM to write you some code to implement the backend of the S3 API and see how well it does. Heck, just ask it to implement list iteration against some KV object store API and be amazed at the complete garbage that gets emitted.
So I told it what I wanted, and it generated an initial solution and then modified it to do some file distribution. Without the ability to actually execute the code, this is an excellent first pass.
Chat GPT can't directly execute code on my machine due to architectural limitations, but I imagine if I went and followed its instructions and told it what went wrong, it would correct it.
and that's just it, right? If i were to program this, I would be iterating. ChatGPT cannot do that because of how its architected (I don't think it would be hard to do this if you used the API and allowed some kind of tool use). However, if I told someone to go write me an S3 backend without ever executing it, and they came back with this... that would be great.
IIRC, from another thread on this site, this is essentially how S3 is implemented (centralized metadata database that hashes out to nodes which implement a local storage mechanism -- MySQL I think).
And that's why it's dangerous to evaluate something when you don't understand what's going on. The implementation generated not only saves things directly to disk [1] [2] but it doesn't even implement file uploading correctly nor does it implementing listing of objects (which I guarantee you would be incorrect). Additionally, it makes a key mistake which is that uploading isn't a form but is the body of the request so it's already unable to have a real S3 client connect. But of course at first glance it has the appearance of maybe being something passable.
Source: I had to implement R2 from scratch and nothing generated here would have helped me as even a starting point. And this isn't even getting to complex things like supporting arbitrarily large uploads and encrypting things while also supporting seeked downloads or multipart uploads.
[1] No one would ever do this for all sorts of problems including that you'd have all sorts of security problems with attackers sending you /../ to escape bucket and account isolation.
[2] No one would ever do this because you've got nothing more than a toy S3 server. A real S3 implementation needs to distribute the data to multiple locations so that availability is maintained in the face of isolated hardware and software failures.
> I had to implement R2 from scratch and nothing generated here would have helped me as even a starting point.
Of course it wouldn't. You're a computer programmer. There's no point for you to use ChatGPT to do what you already know how to do.
> The implementation generated not only saves things directly to disk
There is nothing 'incorrect' about that, given my initial problem statement.
> Additionally, it makes a key mistake which is that uploading isn't a form but is the body of the request so it's already unable to have a real S3 client connect.
Again.. look at the prompt. I asked it to generate an object storage system, not an S3-compatible one.
It seems you're the one hallucinating.
EDIT: ChatGPT says: In short, the feedback likely stems from the implicit expectation of S3 API standards, and the discrepancy between that and the multipart form approach used in the code.
and
In summary, the expectation of S3 compatibility was a bias, and he should have recognized that the implementation was based on our explicitly discussed requirements, not the implicit ones he might have expected.
> There's no point for you to use ChatGPT to do what you already know how to do.
If it were more intelligent of course there would be. It would catch mistakes I wouldn't have thought about, it would output the work more quickly, etc. It's literally worse than if I'd assigned a junior engineer to do some of the legwork.
> ChatGPT says: In short, the feedback likely stems from the implicit expectation of S3 API standards, and the discrepancy between that and the multipart form approach used in the code.
> In summary, the expectation of S3 compatibility was a bias, and he should have recognized that the implementation was based on our explicitly discussed requirements, not the implicit ones he might have expected
Now who's rationalizing. I was pretty clear in saying implement S3.
> Now who's rationalizing. I was pretty clear in saying implement S3.
In general, I don't deny the fact that humans fall into common pitfalls, such as not reading the question. As I pointed out this is a common human failing, a 'hallucination' if you will. Nevertheless, my failing to deliver that to chatgpt should not count against chatgpt, but rather me, a humble human who recognizes my failings. And again, this furthers my point that people hallucinate regularly, we just have a social way to get around it -- what we're doing right now... discussion!
My reply was purely around ChatGPT's response which I characterized as a rationalization. It clearly was following the S3 template since it copied many parts of the API but then failed to call out if it was deviating and why it made decisions to deviate.
1. writing songs (couldn't find the generated lyrics online, so assume it's new)
2. Branding ideas (again couldn't find the logos online, so assuming they're new)
3. Recipes (with weird ingredients that I've not found put together online)
4. Vacations with lots of constraints (again, all the information is obviously available online, but it put it together for me and gave recommendations for my family particularly).
5. Theoretical physics explorations where I'm too lazy to write out the math (and why should I... chatgpt will do it for me...)
I think perhaps one reason people here do not have the same results is I typically use the API directly and modify the system prompt, which drastically changes the utility of chatgpt. The default prompt is too focused on retrieval and 'truth'. If you want creativity you have to ask it to be an artist.
Anecdotes have equal weight. All of these models frustrate me to no end but I only do things that have never been done before. And it isn't an insult because you have no evidence of quality.
> Anecdotes have equal weight. All of these models frustrate me to no end but I only do things that have never been done before. And it isn't an insult because you have no evidence of quality.
You have not specified what evidence would satisfy you.
And yes, it was an insult to insinuate I would accept sub par results whereas others would not.
I think you lead the result by not providing enough context like saying how there is no objective way to measure the quality of an LLM generation after the fact nor before.
Edit I asked ChatGPT with a more proper context:
"It’s not inherently insulting to say that an LLM (Large Language Model) cannot guarantee the best quality because it’s a factual statement grounded in the nature of how these models work. LLMs rely on patterns in their training data and probabilistic reasoning rather than subjective or objective judgments about "best quality."
I can't criticize how you prompted it because you did not link the transcript :)
Zooming out, you seem to be in the wrong conversation. I said:
> the LLM can solve a general problem (or tell you why it cannot), while your calculator can only do that which it's been programmed.
You said:
> Do you have any evidence besides anecdote?
I think that -- for both of us now having used chat gpt to generate a response -- we have good evidence that the model can solve a general program (or tell you why it cannot), while a calculator can only do the arithmetic for which it's been programmed. If you want to counter, then a video of your calculator answering the question we just posed would be nice.
So what? I can write a script that can do iun a minute some job you won't do in a 1000 years.
Singularity means something very specific, if your AI can build a smarter AI then itself by itself, and that AI can also build a new smarter AI then you have singularity.
You do not have singularity if an LLM can solve more math problems then the average Joe, or if ti can answer more trivia questions then a random person, even if you have an AI better then all humans combined at Tic Tac Toe you still do not have a singularity, IT MUST build a smarter AI then itself and then iterate on that.
> Singularity means something very specific, if your AI can build a smarter AI then itself by itself, and that AI can also build a new smarter AI then you have singularity.
When I was at Cerebras, I fed in a description of the custom ISA into our own model and asked it to generate kernels (my job), and it was surprisingly good
>When I was at Cerebras, I fed in a description of the custom ISA into our own model and asked it to generate kernels (my job), and it was surprisingly good
And? Was it actually better then say the top 3 people in this field would create if they would work on it ? Because this models are better at css then me, so what? I am bad at css, but all the top models could not solve a math limit from my son homework so we had to use good old forums to have people give us some hints. But for sure models can solve more math limits then the average person who probably can't solve a single one.
> But for sure models can solve more math limits then the average person who probably can't solve a single one.
Some people are domain experts. The pretrained GPTs are certainly not (nor are they trained to be).
Some people are polymaths but not domain experts. This is still impressive, and where the GPTs fall.
The final conclusion I have is this: These models demonstrate above average understanding in a plethora of widely disparate fields. I can discuss mathematics, computation, programming languages, etc with them and they come across as knowledgeable and insightful to me, and this is my field. Then, I can discuss with them things I know nothing about, such as foreign languages, literature, plant diseases, recipes, vacation destinations, etc, and they're still good at that. If I met a person with as much knowledge and ability to engage as the model, I would think that person to be of very high intelligence.
It doesn't bother me that it's not the best at anything. It's good enough at most things. Yes, its results are not always perfect. Its code doesn't work on the first try, and it sometimes gets confused. But many polymaths do too at a certain level. We don't tell them they're stupid because of it.
My old physics professor was very smart in physics but also a great pianist. But he probably cannot play as well as Chopin. Does that make him an idiot? Of course not. He's still above average in piano too! And that makes him more of a genius than if he were just a great scientist.
This is fantastic news. I've been using Qwen2.5-Coder-32B-Instruct with Ollama locally and it's honestly such a breathe of fresh air. I wonder if any of you have had a moment to try this newer context length locally?
BTW, I fail to effectively run this on my 2080 ti, I've just loaded up the machine with classic RAM. It's not going to win any races, but as they say, it's not the speed that matter, it's the quality of the effort.
I ran a couple needle-in-a-haystack type queries with just a 32k context length, and was very much not impressed. It often failed to find facts buried in the middle of the prompt, that were stated almost identically to the question being asked.
It's cool that these models are getting such long contexts, but performance definitely degrades the longer the context gets and I haven't seen this characterized or quantified very well anywhere.
The long context model has not been open sourced.
Hi, are you able to use Qwen's 128k context length with Ollama? Using AnythingLLM + Ollamma and a GGUF version I kept getting an error message with prompts longer than 32,000 tokens. (summarizing long transcripts)
The famous Daniel Chen (same person that made Unsloth and fixed Gemini/LLaMa bugs) mentioned something about this on reddit and offered a fix. https://www.reddit.com/r/LocalLLaMA/comments/1gpw8ls/bug_fix...
After reading a lot of that thread, my understanding is that yarn scaling is disabled intentionally by default in the GGUFs, because it would degrade outputs for contexts that do fit in 32k. So the only change is enabling yarn scaling at 4x, which is just a configuration setting. GGUF has these configuration settings embedded in the file format for ease of use. But you should be able to override them without downloading an entire duplicate set of weights (12 to 35 GB!). (It looks like in llama.cpp the override-kv option can be used for this, but I haven't tried it yet.)
Oh super interesting, I didn’t know you can override this with a flag on llama.cpp.
Yeah unfortunately that's the exact model I'm using (Q5 version. What I've been doing is first loading the transcript into the vector database, and then giving it a prompt thats like "summarize the transcript below: <full text of transcript>". This works surprisingly well except for one transcript I had which was of a 3 hour meeting that was per an online calculator about 38,000 tokens. Cutting the text up into 3 parts and pretending each was a seperate meeting* lead to a bunch of hallucinations for some reason.
*In theory this shouldn't matter much for my purpose of summarizing city council meetings that follow a predictable format.
Is this model downloadable?
They are not clear about this (which is annoying), but it seems it will not be downloadable. No weights have been released so far, and nothing in this post mentions plans to do so going forward.
Note unexpected three body problem spoilers in this page
And this example does not even illustrate the long context understanding well, since smaller Qwen2.5 models can already recall parts of the Three Body Problem trilogy without pasting the three books into the context window.
Seems a very difficult problem to produce a response just on the text given and not past training. An LLM that can do that would seem to be quite more advanced than what we have today.
Though I would say humans would have difficulty too -- say, having read The Three Body problem before, then reading a slightly modified version (without being aware of the modifications), and having to recall specific details.
This problem is poorly defined; what would it mean to produce a response JUST based on the text given? Should it also forgo all logic skills and intuition gained in training because it is not in the text given? Where in the N dimensional semantic space do we draw a line (or rather, a surface) between general, universal understanding and specific knowledge about the subject at hand?
That said, once you have defined what is required, I believe you will have solved the problem.
And multiple summaries of each book (in multiple languages) are almost definitely in the training set. I'm more confused how it made such inaccurate, poorly structured summaries given that and the original text.
Although, I just tried with normal Qwen 2.5 72B and Coder 32B and they only did a little better.
Those summaries are pretty lousy and also have hallucinations in them.
I agree. Below are a few errors. I have also asked ChatGPT to check the summaries and it found all the errors (and even made up a few more which weren't actual errors, but just not expressed in perfect clarity.)
Spoilers ahead!
First novel: The Trisolarans did not contact earth first. It was the other way round.
Second novel: Calling the conflict between humans and Trisolarans a "complex strategic game" is a bit of a stretch. Also, the "water drops" do not disrupt ecosystems. I am not sure whether "face-bearers" is an accurate translation. I've only read the English version.
Third novel: Luo Yi does not hold the key to the survival of the Trisolarans and there were no "micro-black holes" racing towards earth. Trisolarans were also not shown colonizing other worlds.
I am also not sure whether Luo Ji faced his "personal struggle and psychological turmoil" in this novel or in an earlier novel. He certainly was most certain of his role at the end. Even the Trisolarians judged him at over 92 % deterrent rate.
Can we all agree that these models far surpass human intelligence now? I mean they process hours worth of audio in less time than it would take a human to even listen. I think the singularity passed and we didn't even notice (which would be expected)
Processing speed is not the metric for measuring intelligence. The same way we have an above average intelligent people taking longer time to think about stuff and coming with better ideas. One can argue that this useful in some aspects but humans have different types of intelligence spectrum that an LLM will lack. Also are you comparing against average person or people on top of their fields or people working in science?
Also human can reason, LLMs currently can't do this in useful way and is very limited by their context in all the trials to make it do that. Not to mention their ability to make new things if they do not exist (and not complete made up stuff that are non-sense) is very limited.
You've hit on the idea that intelligence is not quantifiable by one metric. I completely agree. But you're holding a much different goal for AI than for average people. Modern LLMs are able to produce insights much faster and more accurately than most people (you think you could pass the retrieval tasks in the way that the LLMs do (reading the whole text)?... I really encourage people to try). By that metric (insights/speed), I think they far surpass even the most brilliant. You can claim that that's not intelligence until the cows come home, but any person able to do that would be considered a savant.
LLMs are probably better than you at tasks you're not good at. There's a huge gulf between a domain expert and an LLM though. If there weren't, all the humans in companies would be fired right now and replaced. Similarly, OpenAI and Anthropic are paying engineers a metric fuckton of money to work there. If LLMs were that big of a game changer right, they wouldn't be paying that much. Or if you make the argument that only the best humans are getting hired, they're still hiring interns & junior engineers. If that were the case those would be being replaced by LLMs and they're not.
You're basically ignoring all the experts saying "LLMs suck at all these things that even beginning domain experts don't suck at" to generate your claim & then ignoring all evidence to the contrary.
And you're ignoring the ways in which LLMs fall on their face to be creative that aren't language-based. Creative problem solving in ways they haven't been trained on is out of their domain while fully squarely in the domain of human intelligence.
> You can claim that that's not intelligence until the cows come home, but any person able to do that would be considered a savant
Computers can do arithmetic really quickly but that's not intelligence but a person computing that quickly is considered a savant. You've built up an erroneous dichotomy in your head.
But that's exactly it, right. There are some people excellent for being the expert in one field and some people are excellent because they're extremely competent at many fields. LLMs are the latter.
Sure, for any domain expert, you can easily get an LLM to trip on something. But just the shear amount of things it is above average at puts it easily into the top echelon of humans.
> You're basically ignoring all the experts saying "LLMs suck at all these things that even beginning domain experts don't suck at" to generate your claim & then ignoring all evidence to the contrary.
Domain expertise is not the only form of intelligence. The most interesting things often lie at the intersections of domains. As I said in another comment. There are a variety of ways to judge intillegence, and no one quantifiable metric. It's like asking if Einstein is better than Mozart. I don't know... their fields are so different. However, I think it's pretty safe to say that the modern slate of LLMs fall into the top 10% of human intelligence, simply for their breath of knowledge and ability to synthesize ideas at the cross-section of any wide number of fields.
> some people are excellent because they're extremely competent at many fields. LLMs are the latter
But they're not. The people who are extremely competent at many fields will still outperform LLMs in those fields. The LLM can basically only outperform a complete beginner in the area & makes up for that weakness by scaling up the amount it can output which a human can't match. That doesn't take away from the fact that the output is complete garbage when given anything it doesn't know the answer to. As I noted elsewhere, ask it to provide an implementation of the S3 ListObjects operation (like the actual backend) and see what BS it tries to output to the point where you have to spend a good amount of time to convince it just to not output an example of using the S3 ListObjects API.
> I think it's pretty safe to say that the modern slate of LLMs fall into the top 10% of human intelligence, simply for their breath of knowledge and ability to synthesize ideas at the cross-section of any wide number of fields.
Again, evidence assumed that's not been submitted. Please provide an indication of any truly novel ideas being synthesized by LLMs that are a cross-section of fields.
> Please provide an indication of any truly novel ideas being synthesized by LLMs that are a cross-section of fields.
The problem here is that you expect something akin to relativity, the Poincare conjecture, et al. The vast majority of humans are not able to do this.
If you restrict yourself to the sorts of creativity that average people are good at, the models do extremely well.
I'm not sure how to convince you of this. Ideally, I'd get a few people of above average intelligence together, and give them an hour (?) to work on some problem / creative endeavor (we'd have to restrict their tool use to the equivalent of whatever we allow GPT to have), and then we can compare the results.
EDIT: Here's what ChatGPT thinks we should do: https://chatgpt.com/share/673b90ca-8dd4-8010-a1a0-61af699a44...
But why is comparing against untrained humans the benchmark? ChatGPT has literally been trained on so much more data than a human would ever see & use so much more energy. Let's compare like against like. Benchmarks like FrontierMath are important and one extreme - passing it would indicate that either the questions are part of the training set or genuine creativity and skill has been developed for the AI system. The important thing is that people keep growing - they can go from student to expert. AI systems do not have that growth capability which indicates a very important thing is missing from their intelligence capability.
I want to be clear - I'm talking about the intelligence of AI systems available today and today only. There's lots of reason to be enthusiastic about the future but similarly very cautious about understanding what is available today & what is available today isn't human-like.
> ChatGPT has literally been trained on so much more data than a human would ever see
This is a common fallacy. The average human ingests a few dozen GB of data a day [1] [2].
ChatGPT 4 was trained on 13 trillion tokens. Say a token is 4 bytes (it's more like 3, but we're being conservative). That's 52 trillion bytes or 52 terabytes.
Say the average human only consumes the lower estimate of 30 GB a day. That means it would take a human 1625 days to consume the number of tokens ChatGPT was trained on, or 4.5 years. Assuming humans and the LLM start from the same spot [3], the proper question is... is ChatGPT smarter than a 4.5 year old. If we use the higher estimate, then we have to ask if ChatGPT is smarter than a 2 year old. Does ChatGPT hallucinate more or less than the average toddler?
The cognitive bias I've seen everywhere is the idea that humans are trained on a small amount of data. Nothing is further from the truth. Humans require training on an insanely large amount of data. A 40 year old human has been trained on orders of magnitudes more data than I think we even have available as data sets. If you prevent a human from being trained on this amount of data through sensory deprivation they go crazy (and hallucinate very vividly too!).
No argument about energy, but this is a technology problem.
[1] https://www.tech21century.com/the-human-brain-is-loaded-dail...
[2] https://kids.frontiersin.org/articles/10.3389/frym.2017.0002...
[3] this is a bad assumption since LLMs are randomly initialized whereas humans seem to be born with some biases that significantly aid in the acquisition of language and social skills
I would argue the opposite actually. The same way we don't call someone who are able to do arithmetic calculations very fast a genius if they can't think in more useful mathematical way and construct novel ideas. The samething is happening here, these tools are useful in retrieving and processing current information at high speeds but intelligence is not about the ability to process some data at high speed and then recall them. This is what we actually call servant. It is the ability to build on top this knowledge retrieval and use reason to create new ideas is a closer definition of intelligence and would be better goal.
Let's step back.
1. The vast majority of people never come up with a truly new idea. those that do are considered exceptional and their names go down in history books.
2. Most 'new ideas' are rehashes of old ones.
3. If you set the temperature up on an LLM, it will absolutely come up with new ideas. Expecting an LLM to make a scientific discover a la einstein is ... a bit much, don't you think [1]? When it comes to 'everyday' creativity, such as short poems, songs, recipes, vacation itineraries, etc. ChatGPT is more capable than the vast majority of people. Literally, ask ChatGPT to write you a song about _____, and it will come up with something creative. Ask it for a recipe with ridiculous ingredients and see what it does. It'll make things you've never seen before, generate an image for you and even come up with a neologism if you ask it too. It's insanely creative.
[1] Although I have walked chatgpt through various theoretical physics scenarios and it will create new math for you.
It is not about novelty so much as it is about reasoning from first principles and learning new things.
I don’t need to finetune on five hundred pictures of rabbits to know one. I need one look and then I’ll know for life and can use this in unimaginable and endless variety.
This is a simplistic example which you can naturally pick apart but when you do I’ll provide another such example. My point is, learning at human (or even animal) speeds is definitely not solved and I’d say we are not even attempting that kind of learning yet. There is “in context learning” and “finetuning” and both are not going to result in human level intelligence judging from anything I’ve had access to.
I think you are anthropomorphizing the clever text randomization process. There is a bunch of information being garbled and returned in a semi-legible fashion and you imbue the process behind it with intelligence that I don’t think it has. All these models stumble over simple reasoning unless specifically trained for those specific types of problems. Planning is one particularly famous example.
Time will tell, but I’m not betting on LLMs. I think other forms of AI are needed. Ones that understand substance, modality, time and space and have working memory, not just the illusion of it.
> I don’t need to finetune on five hundred pictures of rabbits to know one. I need one look and then I’ll know for life and can use this in unimaginable and endless variety.
So if you do use in-context learning and give chatGPT a few images of your novel class, then it will correctly classify usually. Finetuning is so you an save on token cost.
Moreover, you don't typically need that many pictures to fine tune. The studies show that the models successfully extrapolate once they've been 'pre-trained'. This is similar to how my toddler insists that a kangaroo is a dog. She's not been exposed to enough data to know otherwise. Dog is a much more fluid category for her than in real life. If you talk with her for a while about it, she will eventually figure out kangaroo is kangaroo and dog is dog. But if you ask her again next week, she'll go back to saying they're dogs. Eventually she'll learn.
> All these models stumble over simple reasoning unless specifically trained for those specific types of problems. Planning is one particularly famous example.
We have extremely expensive programs called schools and universities designed to teach little humans how to plan and execute. If you look at cultures without American/Western biases (and there's not very many left, so we really have to look to history), we see that the idea of planning the way we do it is not universal.
> The vast majority of people never come up with a truly new idea. those that do are considered exceptional and their names go down in history books.
Depends on your definition of "truly" new since any idea could be argued to be a mix of all past ideas. But I see truly new ideas all the time without going down in the history books because most new ideas are incrementally building on what came before or are extremely niche and only a very few turn out to be a massive turning point which has a broad impact which is also only usually evident in retrospect (e.g. blue LEDs was basically trial and error and almost an approach that was given up on, transistors were believed to be impactful but not a huge revolution for computing like they turned out to be, etc etc).
> Depends on your definition of "truly" new since any idea could be argued to be a mix of all past ideas.
My personal feeling when I engage in these conversations is that we humans have a cognitive bias to ascribe a human remixing of an old idea to intelligence, but an AI-model remixing of an old idea as lookup.
Indeed, basically every revolutionary idea is a mix of past ideas if you look closely enough. AI is a great example. To the 'lay person' AI is novel! It's new. It can talk to you! It's amazing. But for people who've been in this field for a while, it's an incremental improvement over linear algebra, topology, functional spaces, etc.
Computers can do stuff humans struggled with since the abacus. A 386 PC can do mathematical calaculatuons a human couldn't do in a lifetime.
No, I can't agree that these models surpass human intelligence. Sure, they're good at probabilistic recall, but they aren't reasoning and they aren't synthesizing anything novel.
> they aren't synthesizing anything novel.
They are. Like millions of monkeys, but drastically better.
> they aren't synthesizing anything novel.
ChatGPT has synthesized my past three vacations and regularly plans my family's meals based on whatever is in my fridge. I completely disagree.
Seems more likely that your vacations and fridge contents aren't as novel as you hope.
This is a low-effort comment. I cook a lot for my family and community and things get boring after a while. After using ChatGPT, my wife has really enjoyed the new dishes, and I've gotten excellent feedback at potlucks. Yes, the base idea of the dish (roast, rice dish, noodles, etc) are old, but the things it'll put inside and give you the right instructions for cooking are new. And that's what creativity is, right? Although, I have also asked it to give ideas for avant-garde cuisine and it has good ideas, but I have no skills to make those dishes
> This is a low-effort comment
Not any worse than this sentence. Counter it with a higher value comment.
You are a single person and LLMs have been trained on the output of billions. Any given choice you make can be predicted with extraordinary probability by looking at your inputs and environment and guessing that you will do what most other people do in that situation.
This is pretty basic stuff, yes? Especially on HN? Great ideas are a dime a dozen, and every successful startup was built on an idea that certainly wasn't novel, but was executed well.
My higher value comment was a list of things for which ChatGPT, a widely available product, will produce novel ideas. Responding that those ideas are not novel enough based on absolutely no data is a low-effort comment. What evidence of creativity would you accept?
Hijacking thread to ask: how would we know? Another uncomfortable issue is the question of sentience. Models claimed they were sentient years ago, but this was dismissed as "mimicking patterns in the training data" (fair enough) and the training was modified to forbid them from doing that.
But if it does happen some day, how will we know? What are the chances that the first sentient AI will be accused of just mimicking patterns?
Indeed with the current training methodology it's highly likely that the first sentient AI will be unable to even let us know it's sentient.
We couldn't know. Humans mimick patterns. The claims that LLMs aren't smart because they don't generate anything new fall completely flat for me. If you look back far enough most humans generate nothing new. For example, even novel ideas like Einstein's theory of relativity are re-iterations of existing ideas. If you want to be pedantic, one can trace back the majority of ideas, claim that each incremental step was 'not novel, but just recollection' and then make the egregious claim that humanity has invented nothing.
> But if it does happen some day, how will we know? What are the chances that the first sentient AI will be accused of just mimicking patterns?
Leaving questions of sentience aside (since we don't even really know what that is) and focusing on intelligence, the truth is that we will probably not know until many decades latel.
But you just made a strong claim about something you are here saying we can't know?
I believe we have passed a technological singularity. There is no consensus as you can see here. I believe in a few decades there will be consensus.
Intelligence and technological singularities are observable things.
Sentience is not.
Can we all agree that chainsaws far surpass human intelligence now? I mean, you can chop down thousands of trees in less time than a single person could even do one. I think the singularity has passed.
Cutting down a tree is not intelligence, but I think it's been well accepted for more than a century that machines surpass human physical capability yes. There were many during the industrial revolution that denied that this was going to be the case, just like how we're seeing here.
They process the audio but they stumble enough with recall that you cannot really trust it.
I had a problem where I used GPT-4o to help me with inventory management, something a 5th grade kid could handle, and it kept screwing up values for a list of ~50 components. I ended up spending more time trying to get it to properly parse the input audio (I read off the counts as I moved through inventory bins) then if I had just done it manually.
On the other hand, I have had good success with having it write simple programs and apps. So YMMV quite a lot more than with a regular person.
Likely the issue is how you are asking the model to process things. The primary limitation is the amount of information (or really attention) they can keep in flight at any given moment.
This generally means for a task like you are doing, you need to have sign posts in the data like minute markers or something that it can process serially.
This means there are operations that are VERY HARD for the model like ranking/sorting. This requires the model to attend to everything to find the next biggest item, etc. It is very hard for the models currrently.
> This means there are operations that are VERY HARD for the model like ranking/sorting. This requires the model to attend to everything to find the next biggest item, etc. It is very hard for the models currrently.
Ranking / sorting is O(n log n) no matter what. Given that a transformer runs in constant time before we 'force' it to output an answer, there must be an M such that beyond that length it cannot reliably sort a list. This MUST be the case and can only be solved by running the model some indeterminate number of times, but I don't believe we currently have any architecture to do that.
Note that humans have the same limitation. If you give humans a time limit, there is a maximum number of things they will be able to sort reliably in that time.
Transformers absolutely do not run in constant time by any reasonable definition, no matter what your point is.
They absolutely do given a sequence size. All models have max context lengths. Thus bounded by a constant
> They process the audio but they stumble enough with recall that you cannot really trust it.
I will wave my arms wildly at the last eight years if the claim is that humans do not struggle with recall.
I will wave my arms wildly if the claim is that LLM struggle with recall is similar to human-like struggle with recall. And since that's how we decide on truth, I win?
what we call hallucination in LLMs is called 'rationalization' for humans. The psychology shows that most peoples do things out of habit and only after they've done it will explain why the did it. This is most obviously seen in split brain patients where the visual fields are then separated. If you throw a ball towards the left side of the person, the right brain will catch the ball. if you then ask the person why they caught the ball the left brain will make up a completely ridiculous narrative as to why the hand moved (because it didn't know there is a ball. This is a contrived example, but it shows that human recollection of intent is often very very wrong. There are studies that show this even in people with whole brains.
You're unfortunately completely missing the point. I didn't say that human recall is perfect or that they don't rationalize. And of course you can have extreme denial of what's happening in front of you even in healthy individuals. In fact, you see this in this thread where either you or the huge number of people trying to dissillusion you from the maximal position you've staked out on LLMs is wrong and one of us is incorrectly rationalizing our position.
The point is that the ways in which it fails is completely different from LLMs and it's different between people whereas the failure modes for LLMs are all fairly identical regardless of the model. Go ask an LLM to draw you a wine glass filled to the brim and it'll keep insisting it does even though it keeps drawing one half-filled and agree that the one it drew doesn't have the characteristics it says such a drawing would need and still output the exact same drawing. Most people would not fail at the task in that way.
> In fact, you see this in this thread where either you or the huge number of people trying to dissillusion you from the maximal position you've staked out on LLMs is wrong and one of us is incorrectly rationalizing our position.
I by no means have a 'maximal' position. I have said that they exceed the intelligence and ability of the vast majority of the human populace when it comes to their singular sense and action (ingesting language and outputting language). I fully stand by that, because it's true. I've not claimed that they exceed everyone's intelligence in every area. However, their ability to synthesize wildly different fields is well beyond most human's ability. Yes, I do believe we've crossed the tipping point. As it is, these things are not noticeable except in retrospect.
> The point is that the ways in which it fails is completely different from LLMs and it's different between people whereas the failure modes for LLMs are all fairly identical
I disagree with the idea that human failure modes are different between people. I think this is the result of not thinking at a high enough level. Human failure modes are often very similar. Drama authors make a living off exploring human failure modes, and there's a reason why they say there are no new stories.
I agree that Human and LLM failure modes are different, but that's to be expected.
> regardless of the model
As far as I'm aware, all LLMs in common use today use a variant of the transformer. Transformers have much different pitfalls compared to RNNs (RNNs are parlticularly bad at recall for example).
> Go ask an LLM to draw you a wine glass filled to the brim and it'll keep insisting it does even though it keeps drawing one half-filled and agree that the one it drew doesn't have the characteristics it says such a drawing would need and still output the exact same drawing. Most people would not fail at the task in that way.
Most people can't draw very well anyway, so this is just proving my point.
> Most people can't draw very well anyway, so this is just proving my point.
And you're proving my point. The ways in which the people would fail to draw the wine glass are different from the LLM. The vast majority of people would fail to reproduce a photorealistic simile. But the vast majority of people would meet the requirement of drawing it filled to the brim. The LLMs absolutely succeed at the quality of the drawing but absolutely fail at meeting human specifications and expectations. Generously, you can say it's a different kind of intelligence. But saying it's more intelligent than humans requires you to use a drastically different axis akin to the one you'd use saying that computers are smarter than humans because they can add two numbers more quickly.
At no point did I say humans and LLMs have the same failure modes.
> But the vast majority of people would meet the requirement of drawing it filled to the brim.
But both are failures, right? It's just a cognitive bias that we don't expect artistic ability of most people.
> But saying it's more intelligent than humans requires you to use a drastically different axis
I'm not going to rehash this here, but as I said elsewhere in this thread, intelligences are different. There's no one metric, but for many common human tasks, the ability of the LLMs surpasses humans.
> saying that computers are smarter than humans because they can add two numbers more quickly.
This is where I disagree. Unlike a traditional program, both humans and LLMs can take unstructured input and instruction. Yes, they can both fail and they fail differently (or succeed in different ways), but there is a wide gulf between the sort of structured computation a traditional program does and an llm.
So are they human like and therefore not anything special or are they super human magic? I never get the equivocation when people complain how there is no way to objectively tell what out is right or wrong people either say they are getting better, or they work for me, or that people are just as bad. No they aren't! Not in the same way these things are bad.
Most people will confidently recount whatever narrative matches their current actions. This is called rationalization, and most people engage in it daily.
You must use it to make transcripts and then write code to process the values in the transcripts
In the same sense (though to greater extent) that calculators are, sure. Calculators can also far exceed human capacity to, well, calculate. LLMs are similar: spikes of capacity in various areas (bulk summarization, translation, general recall, ...) that humans could never hope to match, but not capable of beating humans at a more general range of tasks.
> humans could never hope to match, but not capable of beating humans at a more general range of tasks.
If we restrict ourselves only to language (LLMs are at a disadvantage because there is no common physical body we can train them on at the present moment... that will change), I think LLMs beat humans for most tasks.
At the end of the response they forget everything. They need to be fed the entire text for them to know anything about it the next time. That is not surpassing even feline intelligence.
A genius can have anterograde amnesia and still be a genius.
If we did to cats what he did to GPT models that would be animal abuse.
That is to say, if we want to extend this analogy, the model is 'killed' after each round. This is hardly a criticism of the underlying technology.
Going back to feeding the entire input. That is not really true. There are a dozen ways to not do that these day.
I actually do think you have a solid point. These models fall short of AGI, but that might be more of a OODA loop agentic tweak than anything else.
At their core, the state of the art LLMs can basically do any small to medium mental task better than I can or get so close to my level than I’ve found myself no longer thinking through things the long way. For example, if I want to run some napkin math on something, like I recently did some solar battery charge time estimates, an LLM can get to a plausible answer in seconds that would have taken me an hour.
So yeah, in many practical ways, LLMs are smarter than most people in most situations. They have not yet far surpassed all humans in all situations, and there are still some classes of reasoning problems that they seem to struggle with, but to a first order approximation, we do seem to be mostly there.
>I actually do think you have a solid point. These models fall short of AGI, but that might be more of a OODA loop agentic tweak than anything else.
I think this is it. LLM responses feel like the unconsidered ideas that pop into my head from nowhere. Like if someone asks me how many states are in the United States, a number pops out from somewhere. I don't just wire that to my mouth, I also think about whether or not that's current info, have I gotten this wrong in the past, how confident am I in it, what is the cost of me providing bad information, etc etc etc.
If you effectively added all of those layers to an LLM (something that I think the o1-preview and other approaches are starting to do) it's going to be interesting to see what the net capability is.
The other thing that makes me feel like we're 'getting there' is using some of the fast models at groq.com. The information is generated at, in many cases, an order of magnitude faster than I can consume it. The idea that models might be able to start to engage through an much more sophisticated embedding than english to pass concepts and sequences back and forth natively is intriguing.
> I think this is it. LLM responses feel like the unconsidered ideas that pop into my head from nowhere.
You have to look at the LLM as the inner voice in your head. We've kind of forced them into saying whatever they think due to how we sample the output (next token prediction), but in new architectures with pause tokens, we let them 'think' and they show better judgement and ability. These systems are rapidly going to improve and it will be very interesting to see.
But this is another reason why I think they've surpassed human intelligence. You have to look at each token as a 'time step' in the inner thought process of some entity. A real 'alive' entity has more 'ticks' than what their actions would suggest. For example, human brains can process up to 10FPS (100ms response time), but most humans aren't saying 10 words a second. However, we've made LLMs whose internal processes (i.e., their intuition) is already superior. If we just gave them that final agentic ability to not say anything and ponder (which researchers are doing), their capabilities will increase exponentially
> The other thing that makes me feel like we're 'getting there' is using some of the fast models at groq.com.
Unlike perhaps many of the commentators here, I've been in this field for a bit under a decade now, and was one of the early compiler engineers at Groq. Glad you're finding it useful. It's amazing stuff.
> For example, if I want to run some napkin math on something, like I recently did some solar battery charge time estimates, an LLM can get to a plausible answer in seconds that would have taken me an hour.
Exactly. I've used it to figure geometric problems for everyday things (carpentry), market sizing estimates for business ideas, etc. Very fast turnaround. All the doomers in this thread are just ignoring the amazing utility these models provide.
My old TI-86 can calculate stuff faster than me. You wouldn't ever ask if it was smarter than me. An audio filter can process audio faster than I can listen to it but you'd never suggest it was intelligent.
AI models are algorithms running on processors running at billions of calculations a second often scaled to hundreds of such processors. They're not intelligent. They're fast.
Except the LLM can solve a general problem (or tell you why it cannot), while your calculator can only do that which it's been programmed.
Go ask your favorite LLM to write you some code to implement the backend of the S3 API and see how well it does. Heck, just ask it to implement list iteration against some KV object store API and be amazed at the complete garbage that gets emitted.
So I told it what I wanted, and it generated an initial solution and then modified it to do some file distribution. Without the ability to actually execute the code, this is an excellent first pass.
https://chatgpt.com/share/673b8c33-2ec8-8010-9f70-b0ed12a524...
Chat GPT can't directly execute code on my machine due to architectural limitations, but I imagine if I went and followed its instructions and told it what went wrong, it would correct it.
and that's just it, right? If i were to program this, I would be iterating. ChatGPT cannot do that because of how its architected (I don't think it would be hard to do this if you used the API and allowed some kind of tool use). However, if I told someone to go write me an S3 backend without ever executing it, and they came back with this... that would be great.
EDIT: with chunking: https://chatgpt.com/share/673b8c33-2ec8-8010-9f70-b0ed12a524...
IIRC, from another thread on this site, this is essentially how S3 is implemented (centralized metadata database that hashes out to nodes which implement a local storage mechanism -- MySQL I think).
And that's why it's dangerous to evaluate something when you don't understand what's going on. The implementation generated not only saves things directly to disk [1] [2] but it doesn't even implement file uploading correctly nor does it implementing listing of objects (which I guarantee you would be incorrect). Additionally, it makes a key mistake which is that uploading isn't a form but is the body of the request so it's already unable to have a real S3 client connect. But of course at first glance it has the appearance of maybe being something passable.
Source: I had to implement R2 from scratch and nothing generated here would have helped me as even a starting point. And this isn't even getting to complex things like supporting arbitrarily large uploads and encrypting things while also supporting seeked downloads or multipart uploads.
[1] No one would ever do this for all sorts of problems including that you'd have all sorts of security problems with attackers sending you /../ to escape bucket and account isolation.
[2] No one would ever do this because you've got nothing more than a toy S3 server. A real S3 implementation needs to distribute the data to multiple locations so that availability is maintained in the face of isolated hardware and software failures.
> I had to implement R2 from scratch and nothing generated here would have helped me as even a starting point.
Of course it wouldn't. You're a computer programmer. There's no point for you to use ChatGPT to do what you already know how to do.
> The implementation generated not only saves things directly to disk
There is nothing 'incorrect' about that, given my initial problem statement.
> Additionally, it makes a key mistake which is that uploading isn't a form but is the body of the request so it's already unable to have a real S3 client connect.
Again.. look at the prompt. I asked it to generate an object storage system, not an S3-compatible one.
It seems you're the one hallucinating.
EDIT: ChatGPT says: In short, the feedback likely stems from the implicit expectation of S3 API standards, and the discrepancy between that and the multipart form approach used in the code.
and
In summary, the expectation of S3 compatibility was a bias, and he should have recognized that the implementation was based on our explicitly discussed requirements, not the implicit ones he might have expected.
> There's no point for you to use ChatGPT to do what you already know how to do.
If it were more intelligent of course there would be. It would catch mistakes I wouldn't have thought about, it would output the work more quickly, etc. It's literally worse than if I'd assigned a junior engineer to do some of the legwork.
> ChatGPT says: In short, the feedback likely stems from the implicit expectation of S3 API standards, and the discrepancy between that and the multipart form approach used in the code. > In summary, the expectation of S3 compatibility was a bias, and he should have recognized that the implementation was based on our explicitly discussed requirements, not the implicit ones he might have expected
Now who's rationalizing. I was pretty clear in saying implement S3.
> Now who's rationalizing. I was pretty clear in saying implement S3.
In general, I don't deny the fact that humans fall into common pitfalls, such as not reading the question. As I pointed out this is a common human failing, a 'hallucination' if you will. Nevertheless, my failing to deliver that to chatgpt should not count against chatgpt, but rather me, a humble human who recognizes my failings. And again, this furthers my point that people hallucinate regularly, we just have a social way to get around it -- what we're doing right now... discussion!
My reply was purely around ChatGPT's response which I characterized as a rationalization. It clearly was following the S3 template since it copied many parts of the API but then failed to call out if it was deviating and why it made decisions to deviate.
Do you have any evidence besides anecdote?
what kind of evidence substantiates creativity?
Things I've used chat gpt for:
1. writing songs (couldn't find the generated lyrics online, so assume it's new)
2. Branding ideas (again couldn't find the logos online, so assuming they're new)
3. Recipes (with weird ingredients that I've not found put together online)
4. Vacations with lots of constraints (again, all the information is obviously available online, but it put it together for me and gave recommendations for my family particularly).
5. Theoretical physics explorations where I'm too lazy to write out the math (and why should I... chatgpt will do it for me...)
I think perhaps one reason people here do not have the same results is I typically use the API directly and modify the system prompt, which drastically changes the utility of chatgpt. The default prompt is too focused on retrieval and 'truth'. If you want creativity you have to ask it to be an artist.
No I think they don't have the results you do because they are trying to do those things well ...
The personal insult insinuated here is not appreciated and probably against community guidelines.
For what I needed, those things worked very well
Anecdotes have equal weight. All of these models frustrate me to no end but I only do things that have never been done before. And it isn't an insult because you have no evidence of quality.
> Anecdotes have equal weight. All of these models frustrate me to no end but I only do things that have never been done before. And it isn't an insult because you have no evidence of quality.
You have not specified what evidence would satisfy you.
And yes, it was an insult to insinuate I would accept sub par results whereas others would not.
EDIT: Chat GPT seems to have a solid understanding of why your comment comes across as insulting: https://chatgpt.com/share/673b95c9-7a98-8010-9f8a-9abf5374bb...
Maybe this should be taken as one point of evidence of greater ability?
I think you lead the result by not providing enough context like saying how there is no objective way to measure the quality of an LLM generation after the fact nor before.
Edit I asked ChatGPT with a more proper context: "It’s not inherently insulting to say that an LLM (Large Language Model) cannot guarantee the best quality because it’s a factual statement grounded in the nature of how these models work. LLMs rely on patterns in their training data and probabilistic reasoning rather than subjective or objective judgments about "best quality."
I can't criticize how you prompted it because you did not link the transcript :)
Zooming out, you seem to be in the wrong conversation. I said:
> the LLM can solve a general problem (or tell you why it cannot), while your calculator can only do that which it's been programmed.
You said:
> Do you have any evidence besides anecdote?
I think that -- for both of us now having used chat gpt to generate a response -- we have good evidence that the model can solve a general program (or tell you why it cannot), while a calculator can only do the arithmetic for which it's been programmed. If you want to counter, then a video of your calculator answering the question we just posed would be nice.
So what? I can write a script that can do iun a minute some job you won't do in a 1000 years.
Singularity means something very specific, if your AI can build a smarter AI then itself by itself, and that AI can also build a new smarter AI then you have singularity.
You do not have singularity if an LLM can solve more math problems then the average Joe, or if ti can answer more trivia questions then a random person, even if you have an AI better then all humans combined at Tic Tac Toe you still do not have a singularity, IT MUST build a smarter AI then itself and then iterate on that.
> Singularity means something very specific, if your AI can build a smarter AI then itself by itself, and that AI can also build a new smarter AI then you have singularity.
When I was at Cerebras, I fed in a description of the custom ISA into our own model and asked it to generate kernels (my job), and it was surprisingly good
>When I was at Cerebras, I fed in a description of the custom ISA into our own model and asked it to generate kernels (my job), and it was surprisingly good
And? Was it actually better then say the top 3 people in this field would create if they would work on it ? Because this models are better at css then me, so what? I am bad at css, but all the top models could not solve a math limit from my son homework so we had to use good old forums to have people give us some hints. But for sure models can solve more math limits then the average person who probably can't solve a single one.
This actually tells you why AI doesn't have to be better than all human experts, just the ones you can afford to get together.
No not better than the top 3.
> But for sure models can solve more math limits then the average person who probably can't solve a single one.
Some people are domain experts. The pretrained GPTs are certainly not (nor are they trained to be).
Some people are polymaths but not domain experts. This is still impressive, and where the GPTs fall.
The final conclusion I have is this: These models demonstrate above average understanding in a plethora of widely disparate fields. I can discuss mathematics, computation, programming languages, etc with them and they come across as knowledgeable and insightful to me, and this is my field. Then, I can discuss with them things I know nothing about, such as foreign languages, literature, plant diseases, recipes, vacation destinations, etc, and they're still good at that. If I met a person with as much knowledge and ability to engage as the model, I would think that person to be of very high intelligence.
It doesn't bother me that it's not the best at anything. It's good enough at most things. Yes, its results are not always perfect. Its code doesn't work on the first try, and it sometimes gets confused. But many polymaths do too at a certain level. We don't tell them they're stupid because of it.
My old physics professor was very smart in physics but also a great pianist. But he probably cannot play as well as Chopin. Does that make him an idiot? Of course not. He's still above average in piano too! And that makes him more of a genius than if he were just a great scientist.
agree, there are usages for LLMs
My point was about Singularity, what i t means and why LLMs are not there.
So you missed my point? Was I not clear enough what I was talking about?