Qwen2.5-Coder-32B is an LLM that can code well that runs on my Mac

(simonwillison.net)

160 points | by pabs3 a day ago ago

60 comments

mythz a day ago ago
Qwen2.5 Coder 32B is great for an OSS model, but in my testing (ollama) Sonnet 3.5 yields noticeably better results, a lot more than what the provided benchmarks suggest.
Best thing about it is that it's an OSS model that can be hosted by anyone, resulting in an open competitive market bringing hosting costs down, currently sitting at $0.18/$0.18 M tok/s [1] making it 50x cheaper than Sonnet 3.5 and ~17x cheaper than Haiku 3.5.
[1] https://openrouter.ai/qwen/qwen-2.5-coder-32b-instruct
[-]
- Vetch a day ago ago
  Claude Sonnet 3.5s are bars too high to clear. No other model comes close, with the occasional exception of o1-preview. But o1-preview is always a gamble, your rolls are limited and it will either be the best answer possible from an LLM or it returns after a wild goose chase, having talked itself into a tangled mess of confusion.
  I'd personally rank the Qwen2.5 32B model only a little behind GPT4o at worst, and preferable to gemini 1.5 pro 002 (at code only, Gemini is a model that's surprisingly bad at code considering its top class STEM reasoning).
  This makes Qwen2.5-coder-32B astounding all considered. It's really quite capable and is finally an accessible model that's useful for real work. I tested it on some linear algebra, discussed pros and cons of a belief propagation based approach to SAT solving, had it implement a fast simple approximate nearest neighbor based on the near orthogonality of random vectors in high dimensions (in OCaml, not perfect with but close enough to useful/easily correctable), simulate execution of a very simple recursive program (also Ocaml) and write a basic post processing shader for Unity. It did really well on each of those tasks.
  [-]
  - dragonsh a day ago ago
    Not really tried the Claude 3.5, later tried o1-preview on github models and recently Qwen2.5 32B for a prompt to generate a litestar[0] app to manage a wysiwyg content using grapesjs[1] and use pelican[2] to generate static site. It generated very bad code and invented many libraries in import which didn't exist. Cluade was one of the worst code generator, later tried sieve of atkin to generate primes to N and then use miller-rabin test to test each generated prime both using all the cpu core available. Claude completely failed and could never get a correct code without some or the other errors especially using multiprocess, o1-preview got it right in first attempt, Qwen 2.5 32B got it right in 3'rd error fix. In general for some very simple code Claude is correct but when using something new it completely fails, o1-preview performs much better. Give a try to generate some manim community edition visualization using Claude, it generates something not working correct or with errors, o1-preview does much better job.
    In most of my test o1-preview performed way better than Claude and Qwen was not that bad either.
    [0] https://github.com/litestar-org/litestar
    [1] https://grapesjs.com/
    [3] https://getpelican.com/
- faangguyindia a day ago ago
  Too bad zed editor doesn't have code completion via custom LLM
  So far I am using zed editor and can't switch to somethints else untill zed editor gets the update which support FIM directive via custom LLMs
  [-]
  - aargh_aargh a day ago ago
    This question keeps popping up but I don't get it. Everyone and their dog has an OpenAI-compatible API. Why not just serve a local LLM and put api.openai.com 127.0.0.1 in your hosts file?
    [-]
    - tbocek a day ago ago
      There is a difference between chat and code completion. While with chat, you can use localhost with llama.cpp, but code completion you cannot do that: https://github.com/zed-industries/zed/issues/12519.
      The config for chat, you can do:
      "language_models": { "openai": { "version": "1", "api_url": "http://localhost:8080", "low_speed_timeout_in_seconds": 120, "available_models": [ { "provider": "openai", "name": "Qwen2.5-Coder-7B-Instruct-Q8_0.gguf", "display_name": "llama.cpp", "max_tokens": 131072 } ] } },
      While for code completion, you have two choices atm: supermaven and copilot: https://zed.dev/docs/completions.
      [-]
      - aargh_aargh a day ago ago
        Thank you, makes sense. I haven't used code completion yet.
  - satvikpendem a day ago ago
    You could...switch editors? Why not do that until Zed gets that support?
- Tiberium a day ago ago
  Yeah, benchmarks are one thing but when you actually interact with the model it becomes clear very fast how "intelligent" the model actually is, by doing or noting small things that other models won't. 3.5 Sonnet v1 was great, v2 is already incredible.
- LeoPanthera a day ago ago
  ...but you can't run it locally. Not unless you're sitting on some monster metal. It's tiresome when people compare enormous cloud models to tiny little things. They're completely different.
  [-]
  - mythz a day ago ago
    > ...but you can't run it locally. Not unless you're sitting on some monster metal.
    I'm getting a very usable ~18 tok/s running it on 2x NVIDIA A4000 (32GB VRAM).
    Both GPUs cost less than USD $1,400 on eBay.
    qwen2.5-coder:32b is 19GB on ollama [1]
    [1] https://ollama.com/library/qwen2.5-coder:32b
    [-]
    - menaerus a day ago ago
      I believe that parent comment was referring to the point that Sonnet 3.5 cannot be run locally which it obviously cannot but not because of locally available compute but because it's not OSS.
    - FloatArtifact a day ago ago
      I'm curious if the M4 Max will be good enough.
      [-]
      - simonw a day ago ago
        That will run Qwen 2.5 Coder 32B just fine. I'm using a M2 Max with 64GB of RAM.
  - a day ago ago
    [deleted]
  - guerrilla a day ago ago
    Yeah, over 65GB VRAM... that'd be expensive but not impossible. I think three RTX 4090's could do it, with their 24GB each.
    [-]
    - csomar a day ago ago
      Only 21.3GB is required.
      [-]
      - guerrilla a day ago ago
        Is this wrong? It says 65.8GB. If it's wrong, what source should I be using instead?
        https://llm.extractum.io/model/Qwen%2FQwen2.5-Coder-32B-Inst...
        [-]
        exe34 a day ago ago
        the ollama one is probably quantised.
      - tmikaeld a day ago ago
        ... with 4-bit quant and 4000 token context.
manamorphic a day ago ago
I heard conflicting things about it. Some claim it was trained so it can do well on benchmarks and in real world scenarios it's lacking. Can somebody deny/confirm ?
[-]
- dodslaser a day ago ago
  What else should you train for? If the benchmark dosn't represent real world scenarios, isn't that a problem with the benchmark rather than the model?
  [-]
  - isoprophlex a day ago ago
    If your benchmark covers all possible programming tasks then you dont need an llm, you need search over your benchmark.
    Hypothetically let's say the benchmark contains "test divisibility of this integer by n" for all n of the form 3x+1. An extremely overfit llm won't be able to code divisibility for all n not of the form 3x+1, and your benchmark will never tell.
    [-]
    - YetAnotherNick a day ago ago
      No, because solving a well defined problem with well defined right or wrong is generally not what people use llm for. Most of the times my query to llm is underspecified, and lot of time I figure out the problem when chatting with LLM. And benchmark by definition only measures just right/wrong answer.
  - LeoPanthera a day ago ago
    This is called Goodhart's law, who said: "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes."
    But in modern usage it is often rephrased to: "When a measure becomes a target, it ceases to be a good measure"
    https://en.wikipedia.org/wiki/Goodhart%27s_law
  - exitb a day ago ago
    Overfitting is a concern.
  - ithkuil a day ago ago
    It's more subtle than this:
    https://en.m.wikipedia.org/wiki/Training,_validation,_and_te...
- anonzzzies a day ago ago
  It's small... And for that size, it does very well. Been using it a few days and it's quite good for it's size and the fact you can run it locally. So not sure if it's true what you say; for us it works really well.
- csomar a day ago ago
  I tried the Qwen2.5 32B a couple weeks ago. It was amazing for a model that can run on my laptop but far from Claude/GPT-4o. I am downloading the coder tuned version now.
- tyler33 a day ago ago
  i tried qwen and it is surprinsingly good, maybe not as good as claude but could replaced it
fareesh a day ago ago
I like the idea of offline LLMs but in practice there's no way I'm wasting battery life on running a Language Model.
On a desktop too, I wonder if it's worth the additional stress and heat on my GPU as opposed to one somewhere in a datacenter which will cost me a few dollars per month, or a few cents per hour if I spin up the infra myself on demand.
Super useful for confidential / secret work though
[-]
- InsideOutSanta a day ago ago
  In my experience, a lot of companies still have rules against using tools like Copilot due to security and copyright concerns, even though many software engineers just to ignore them.
  This could be a way to satisfy both sides, although it only solves the issue of sending internal data to companies like OpenAI, it doesn't solve the "we might accidentally end up with somebody else's copyrighted code in our code base" issue.
- dizhn a day ago ago
  What provider/service do you use for this?
a day ago ago
[deleted]
kundi a day ago ago
It seems fine-tuned for benchmarking more than the actual tasks
tucnak a day ago ago
The issue with some recent models is that they're basically overfitting on public evals, and it's not clear who's the biggest offender—OpenAI, or the Chinese? And regardless, "Mandelbrot in plaintext" is a bad choice for evaluation's sake. The public datasets are full of stuff like that. You really want to be testing stuff that isn't overfit to death, beginning with tasks that notoriously don't generalise all too well, all the while being most indicative of capability: like, translating a program that is unlikely to have been included in the training set verbatim, from a lesser-known language—to a well-known language, and back.
I'd be shocked if this model held up in the comprehensive private evals.
[-]
- simonw a day ago ago
  That's why I threw in "same size as your terminal window" for the Mandelbrot demo - I thought that was just enough of a tweak to avoid exact regurgitation of some previously trained program.
  I have not performed comprehensive evals of my own here - clearly - but I did just enough to confirm that the buzz I was seeing around this model appeared to hold up. That's enough for me to want to write about it.
  [-]
  - tucnak a day ago ago
    Hey, Simon! Have you ever considered to host private evals? I think, with the weight of the community behind you, you could easily accumulate a bunch of really high-quality, "curated" data, if you will. That is to say, people would happily send it to you. More people should self-host stuff like https://github.com/lm-sys/FastChat without revealing their dataset, I think, and we would probably trust it more than the public stuff, considering they already trust _you_ to some extent! So far the private eval scene is just a handful of guys on twitter reporting their findings in unsystematic manner, but a real grassroots approach backed up by a respectable influencer would go a long way to change that.
    Food for thought.
    [-]
    - simonw a day ago ago
      Honestly I don't think I have the right temperament for being a reliable source for evals. I've played around with a few ideas - like "Pelicans on a bicycle" https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/ - but running evals well on an ongoing basis requires a focus and attention to detail that isn't a great fit for how I work.
      Here's Qwen 2.5 Coder 32B for "Generate an SVG of a pelican riding a bicycle" https://gist.github.com/simonw/56217af454695a90be2c8e09c7031...
      [-]
      - tucnak a day ago ago
        "Pelican on a bicycle" is really fun, honestly. I hope it doesn't go the way of the unicorn! :-D
- JimDabell a day ago ago
  > The issue with some recent models is that they're basically overfitting on public evals… You really want to be testing stuff that isn't overfit to death… I'd be shocked if this model held up in the comprehensive private evals.
  From the announcement:
  > we selected the latest 4 months of LiveCodeBench (2024.07 - 2024.11) questions as the evaluation, which are the latest published questions that could not have leaked into the training set, reflecting the model’s OOD capabilities.
  — https://qwenlm.github.io/blog/qwen2.5-coder-family/
  [-]
  - tucnak a day ago ago
    They say a lot of things, like that their base models weren't instruction-tuned, however people have confirmed that it's impossible to find instruction that it wouldn't follow, and the output would indicate that exactly. The labs absolutely love incorporating public evals in their training; of course, they're not going to admit that.
- isoprophlex a day ago ago
  All the big guys are hiring domain experts - serious brains, phd level, in some cases - to build bespoke train and test data for their models.
  As long as Jensen Huang keeps shitting out nvidia cards, progress is just a function of cash to burn on paying humans to dump their knowledge into train data... and hoping this silly transformer architecture keeps holding up
  [-]
  - Wheatman a day ago ago
    Interestingly enough, the new "Orion" model by OpenAI doesnt outperform, and even sometimes underperforms in programing tasks, when compared to GPT-4.
    There is an interesing discussion about it here:https://news.ycombinator.com/item?id=42104964.
  - tucnak a day ago ago
    > All the big guys are hiring domain experts - serious brains, phd level, in some cases
    I don't know where this myth had originated, and perhaps it was true at least at some point, but you just have to consider that all the recent major advances in datasets had to do with _unsupervised_ reward models, synthetic, generational datasets, and new advanced alignment methods. The big labs _are_ hiring serious PhD level researchers, and most of these are physicists, Bayesians of many kind and breed, not "domain experts." However, perception matters a lot these days; some labs, I won't point, but OpenAI is probably the biggest offender, simply cannot control themselves. The fact of the matter is they LOVE including the public evals in their finetuning, as it makes them appear stronger in the "benchmarks."
  - kwlranb a day ago ago
    The PhD folks will steal from Stackoverflow and leetcode solutions. Just another laundering buffer.
    Hardly any PhD has the patience or skill for that matter to code robust solutions from scratch. Just look at PhD code in the wild.
- f1shy a day ago ago
  > like, translating a program that is unlikely to have been included in the training set verbatim, from a lesser-known language—to a well-known language, and back.
  I would exactly want to see that, or "make a little interpreter for a basic subset of C, or Scheme or <X>".
  [-]
  - tucnak a day ago ago
    So far, non-English inputs have been most telling: I deal with Ukrainian datasets mostly, and what we see is OpenAI models, the Chinese models, of course, and Llama's, to admittedly, lesser extent—all degrading disproportionately compared to the other models. You know what model degrades the least comparatively? Gemma 27b. The arena numbers would suggest it's not so strong, but they'd actually managed to make something useful for the whole world (I can only judge re: Ukrainian, of course, but I suspect it's probably equally good in the other languages, too.) However, nothing can compete currently with Sonnet 3.5 in reasoning. I predict a surge in the private eval scene when people inevitably grow wary of leaderboard-propaganda.
    More people should host https://github.com/lm-sys/FastChat
joseferben a day ago ago
@simonw what is the token/s like on your 64gb m2 mbp?
[-]
- simonw a day ago ago
  With MLX:
```
    Prompt: 49 tokens, 95.691 tokens-per-sec
    Generation: 723 tokens, 10.016 tokens-per-sec
    Peak memory: 32.685 GB
```
  [-]
  - joseferben a day ago ago
    so quite usable, thanks!
nenadst a day ago ago
try it with something more "obscure" like e.g. creating an OPC/Ua Server from scratch.
It fails spectacularly - the code wont even compile and there are many things missing to get a working solution.
[-]
- simonw a day ago ago
  Can you share a transcript?
a day ago ago
[deleted]
a day ago ago
[deleted]
adricg23 a day ago ago
[dead]
doomlaser a day ago ago
[flagged]
kwlranb a day ago ago
Xerox is a photocopier that can write a Shakespeare drama if I put the correct book on it.
What is the end goal here? Reduction of the workforce by 90% while the remaining 10% click buttons to produce buggy, insecure and bloated code?
[-]
- smusamashah a day ago ago
  We (engineers specially) have been automating things ever since. Why/when should it ever stop?
  [-]
  - simonw a day ago ago
    Right. Have your heard of open source? Biggest concerted effort to save each other time by avoiding repeating work I've ever seen.