Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

(cerebras.ai)

420 points | by benchmarkist 2 days ago ago

151 comments

zackangelo 2 days ago ago
This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.
I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?
[-]
- danpalmer 2 days ago ago
  Cerebras makes CPUs with ~1 million cores, and they're inferring on that not on GPUs. It's an entirely different architecture which means no network involved. It's possible they're doing this significantly from CPU caches rather than HBM as well.
  I recommend the TechTechPotato YouTube videos on Cerebras to understand more of their chip design.
  [-]
  - swyx 2 days ago ago
    > TechTechPotato YouTube videos on Cerebras
    https://www.youtube.com/@TechTechPotato/search?query=cerebra... for anyone also looking. there are quite a lot of them.
  - accrual 2 days ago ago
    I hope we can buy Cerebras cards one day. Imagine buying a ~$500 AI card for your desktop and having easy access to 70B+ models (the price is speculative/made up).
    [-]
    - danpalmer 2 days ago ago
      I believe pricing was mid 6 figures per machine. They're also like 8U and water cooled I believe. I doubt it would be possible to deploy one outside of a fairly top tier colo facility where they have the ability to support water cooling. Also imagine learning a new CUDA but that is designed for another completely different compute model.
      [-]
      - trsohmers a day ago ago
        Based on their S1 filing and public statements, the average cost per WSE system for their (~90% of their total revenue) largest customer is ~$1.36M, and I’ve heard “retail” pricing of $2.5M per system. They are also 15U and due to power and additional support equipment take up an entire rack.
        The other thing people don’t seem to be getting in this thread that just to hold the weights for 405B at FP16 requires 19 of their systems since it is SRAM only… rounding up to 20 to account for program code + KV cache for the user context would mean 20 systems/racks, so well over $20M. The full rack (including support equipment) also consumes 23kW, so we are talking nearly half a megawatt and ~$30M for them to be getting this performance on Llama 405B
        [-]
        danpalmer a day ago ago
        Thank you, far better answer than mine! Those are indeed wild numbers, although interestingly "only" 23kw, I'd expect the same level of compute in GPUs to be quite a lot more than that, or at least higher power density.
        [-]
        YetAnotherNick a day ago ago
        You get ~400TFLOP/s in H100 for 350W. You need (2 * token/s * param count) FLOP/s. For 405b, 969tok/s you just need 784 TFLOP/s which is just 2 H100s.
        The limiting factor with GPU for inference is memory bandwidth. For 969 tok/s in int8, you need 392 TB/s memory bandwidth or 200 H100s.
        [-]
        latchkey a day ago ago
        Memory bandwidth and memory size. Along with power/cooling density.
        Hence why you see AMD's MI325x coming out with 256GB HBM3e, but it is the same FLOPs as a 300x. 6TB/s too, which outperforms H200's, by a lot.
        You can see the direction AMD is going with this...
        https://www.amd.com/en/products/accelerators/instinct/mi300/...
        ryao 11 hours ago ago
        Memory bandwidth for inferencing does not scale with the number of GPUs. Scaling instead requires more concurrent users. Also, I am told that 8 H100 cards can achieve 600 to 1000 tokens per second with concurrent users.
        [-]
        YetAnotherNick 4 hours ago ago
        8 H100 could achieve lot more than 1000 token/sec.
        > Memory bandwidth for inferencing does not scale with the number of GPU
        It does
        Const-me a day ago ago
        > For 969 tok/s in int8, you need 392 TB/s memory bandwidth
        I think that math is only valid for batch size = 1. When these 969 tokens/second come from multiple sessions of the same batch, loaded model tensor elements are reused to compute many tokens for the entire batch. With large enough batches, you can even saturate compute throughput of the GPU instead of bottlenecking on memory bandwidth.
        [-]
        ryao 11 hours ago ago
        They claim to obtain that number with 8 to 20 concurrent users:
        https://x.com/draecomino/status/1858998347090325846
        ryao 11 hours ago ago
        From what I have read, it is a maximum of 23 kW per chip and each chip goes into a 16U. That said, you would need at least 460 kW power to run the setup you described.
        As for retail pricing being $2.5 million, I read $2 million in a news article earlier this year. $2.5 million makes it sound even worse.
        meowface a day ago ago
        Thank you for the breakdown. Bit of an emotional journey.
        "$500 in the future...? Oh, $30 million now, so that might be a while..."
        [-]
        jamalaramala a day ago ago
        It took 30 years for computers go from entire rooms to desktops, and another 30 years to go from desktops to our pockets.
        I don't know if we can extrapolate, but I can imagine AI inference on our desktops for $500 in a few years...
        [-]
        stefs a day ago ago
        well, we can AI inference on our desktops for $500 today, just with smaller models and far slower.
        sumedh a day ago ago
        > Based on their S1 filing and public statements
        Is it a good stock to buy :)
        petra a day ago ago
        Given those details they seem not much better on cost per token than nvidia based systems.
      - bboygravity a day ago ago
        That means it'll be close to affordable in 3 to 5 years if we follow the curve we've been on for the past decades.
        [-]
        schoen a day ago ago
        How have power and cooling been doing with respect to chip improvements? Have power requirements per operation been coming down rapidly, as other features have improved?
        My recollection from PC CPUs is that we've gotten many more operations per second, and many more operations per second per dollar, but that the power and corresponding cooling requirements for the CPUs have tended to go up as well. I don't really know what power per operation has looked like there. (I guess it's clearly improved, though, because it seems like the power consumption of a desktop PC has only increased by a single order of magnitude, while the computational capacity has increased by more than that.)
        A reason that I wonder about this in this context is that people are saying that the power and cooling requirements for these devices are currently enormous (by individual or hobbyist standards, not by data center standards!). If we imagine a Moore's Law-style improvement where the hardware itself becomes 1/10 or 1/100 of its current price, would we expect the overall power consumption to be similarly reduced, or to remain closer to its current levels?
        [-]
        chaxor a day ago ago
        Mooers law in the consumer space seems to be pretty much asymptoting now, as indicated by Apple's amazing Macbooks with an astounding 8GB of RAM. Data center compute is arguable, as it tends to be catered to some niche, making it confusing (cerebras as an example vs GPU datacenters vs more standard HPC). Also Clusters and even GPUs don't really fit in to Mooers law as originally framed.
        [-]
        saagarjha a day ago ago
        Apple doesn’t sell those anymore.
        [-]
        chaxor a day ago ago
        Aw man, are they selling only 4GB ones now?
        More seriously, even 16GB was essentially the 'norm' in consumer PCs about 15 years ago.
        dgfl a day ago ago
        Not really. These are wafer-scale chips, which (as far as I'm aware) were first introduced by Cerebras.
        Cost reduction for cutting-edge products in the semiconductor industry has historically been driven by 1) reducing transistor size (by following the Dennard scaling laws), and 2) a variety of techniques (e.g. high-k dielectrics and strained silicon, or FinFETs and now GAAFETs) to improve transistor performance further. These techniques added more steps during manufacturing, but they were inexpensive enough that they allowed to reduce $/transistor still. In the last few years, we've had to pull off ever more expensive tricks which stopped the $/transistor progress. This is why the phrase "Moore's law is dead" has been circulating for a while.
        In any case, higher performance transistors means that you can get the same functionality for less power and a smaller area, meaning that iso-functionality chips are cheaper to build in bulk. This is especially true for older nodes, e.g. look at the absurdly low price of most microcontrollers.
        On the other hand, $/wafer is mostly a volume-related metric based on less scalable technology and more conventional manufacturing (relatively speaking). Cerebra's innovation was in making a wafer-scale chip possible, which is conventionally hard due to unavoidable manufacturing defects. But crucially, such a product (by definition) cannot scale like any other circuit produced so far.
        It may for sure drop in price in the future, especially once it gets obsolete. But I don't expect it to ever reach consumer level prices.
        [-]
        adrian_b a day ago ago
        Wafer-scale chips have been attempted for many decades, but none of the previous attempts before Cerebras has resulted in a successful commercial product.
        The main reason why Cerebras has succeeded and the previous attempts have failed is not technical, but the existence of market demand.
        Before ML/AI training and inference, there has been no application where wafer-scale chips could provide enough additional performance to make their high cost worthwhile.
        [-]
        ryao 11 hours ago ago
        Cerebras has a patent on the technique used to etch across scribe lines. Is there any prior work that would invalidate that patent?
        By the way, I am a software developer, so you will not see me challenging their patent. I am just curious.
        dheera a day ago ago
        It will also mean 405B models will be uninteresting in 3 to 5 years if we follow the curve we've been on for the past decades.
        [-]
        int_19h a day ago ago
        I don't think they'll be uninteresting. They won't be cutting-edge anymore, sure, but much of the more practical applications of AI that we see today don't run on today's cutting-edge models, either. We're always going to have a certain compute budget, and if a smaller model does the job fine, why wouldn't you use it, and use the rest for something else (or use all of it to run the smaller model faster).
      - initplus a day ago ago
        Yeah you can see the cooling requirements by looking at their product images. https://cerebras.ai/wp-content/uploads/2021/04/Cerebras_Prod...
        Thing is nearly all cooling. And look at the diameter on the water cooling pipes. Airflow guides on the fans are solid steel. Apparently the chip itself measures 21.5cm^2. Insane.
      - szundi a day ago ago
        Parent wishes 70b not 405b though
      - wkat4242 a day ago ago
        Yeah but what is in a 4090 is also comparable to a whole rack of servers a decade ago. The tech will get smaller.
    - chessgecko 2 days ago ago
      One day is doing some heavy heavy lifting here, we’re currently off by ~3-4 orders of magnitude…
      [-]
      - accrual 2 days ago ago
        Thank you, for the reality check! :)
        [-]
        thomashop 2 days ago ago
        We have moved 2 orders of magnitude in the last year. Not that unreasonable
      - grahamj 2 days ago ago
        So 1000-10000 days? ;)
        [-]
        Yizahi a day ago ago
        In a few thousand days (c) St. Altman
        [-]
        grahamj 2 hours ago ago
        lol I almost said that too
    - visarga a day ago ago
      You still have to pay for the memory. The Cerebras chip is fast because they use 700x more SRAM than, say, A100 GPUs. Loading the whole model in SRAM every time you compute one token is the expensive bit.
    - killingtime74 2 days ago ago
      Maybe not $500, but $500,000
  - zackangelo 2 days ago ago
    Ah, makes a lot more sense now.
    [-]
    - StrangeDoctor 2 days ago ago
      also the WSE3 pulls 15kw. https://www.eetimes.com/cerebras-third-gen-wafer-scale-chip-...
      but 8x h100 are ~2.6-5.2kw (I get conflicting info, I think based on pice vs smx) so anywhere between roughly even and up to 2x efficient.
- parsimo2010 2 days ago ago
  They are doing it with custom silicon with several times more area than 8x H100s. I’m sure they are doing some sort of optimization at execution/runtime, but the primary difference is the sheer transistor count.
  https://cerebras.ai/product-chip/
  [-]
  - coder543 2 days ago ago
    To be specific, a single WSE-3 has the same die area as about 57 H100s. It's a big chip.
    [-]
    - cma 2 days ago ago
      It is worth splitting out the stacked memory silicon layers on both too (if Cerebras is set up with external DRAM memory). HBM is over 10 layers now so the die area is a good bit more than the chip area, but different process nodes are involved.
    - tomrod a day ago ago
      Amazing!
- simonw 2 days ago ago
  They have a chip the size of a dinner plate. Take a look at the pictures: https://cerebras.ai/product-chip/
  [-]
  - Aeolun a day ago ago
    21 petabytes per second. Can push the whole internet over that chip xD
    [-]
    - why_only_15 a day ago ago
      The number for that is I believe 1 terabit or 125GB/s -- 21 petabytes is the speed from the SRAM (~registers) to the cores (~ALU) for the whole chip. It's not especially impressive for SRAM speeds. The impressive thing is that they have an enormous amount of SRAM
    - KeplerBoy a day ago ago
      That's their on chip cache bandwidth. Usually that stuff isn't even measured in bandwidth but latency.
  - pram a day ago ago
    I'd love to see the heatsink for this lol
    [-]
    - futureshock a day ago ago
      They call it the “engine block”!
      https://www.servethehome.com/a-cerebras-cs-2-engine-block-ba...
  - ekianjo a day ago ago
    what kind of yield do they get on that size?
    [-]
    - petra a day ago ago
      Part of their technology is managing/bypassing defects.
    - bufferoverflow a day ago ago
      It's near 100%. Discussed here:
      https://youtu.be/f4Dly8I8lMY?t=95
- modeless 2 days ago ago
  Cerebras is a chip company. They are not using GPUs. Their chip uses wafer scale integration which means it's the physical size of a whole wafer, dozens of GPUs in one.
  They have limited memory on chip (all SRAM) and it's not clear how much HBM bandwidth they have per wafer. It's a completely different optimization problem than running on GPU clusters.
  [-]
  - ryao 11 hours ago ago
    They do not use HBM. Offchip memory is accessible at 150GB/sec.
  - why_only_15 a day ago ago
    they have about 125GB/s of off-chip bandwidth
    [-]
    - saagarjha a day ago ago
      Do they just not do HBM at all or
      [-]
      - why_only_15 a day ago ago
        I'm not too up to date but as I recall there are a lot of weirdnesses because of how big their chip is (e.g. thermal expansion being a problem). I believe they have a single giant line in the middle of the chip for this reason. maybe this makes HBM etc. hard? certainly their chip would be more appealing if they cut down the # of cores by 10x, added matrix units and added HBM but looks like they're not going to go this way.
- boroboro4 2 days ago ago
  There are two big tricks: their chips are enormous, and they use sram as their memory, which is vastly faster than hbm ram being used by GPUs. In fact this is main reason it’s so fast. Groq has the speed because of the same reason.
- yalok 2 days ago ago
  how much memory do you need to run fp8 llama 3 70b - can it potentially fit 1 H100 GPU with 96GB RAM?
  In other words, if you wanted to run 8 separate 70b models on your cluster, each of which would fit into 1 GPU, how much larger your overall token output could be than parallelizing 1 model per 8 GPUs and having things slowed down a bit due to NVLink?
  [-]
  - zackangelo 2 days ago ago
    It’s been a minute so my memory might be off but I think when I ran 70b at fp16 it just barely fit on a 2x A100 80GB cluster but quickly OOMed as the context/kv cache grew.
    So if I had to guess a 96GB H100 could probably run it at fp8 as long as you didn’t need a big context window. If you’re doing speculative decoding it probably won’t fit because you also need weights and kv cache for the draft model.
  - qingcharles a day ago ago
    It should work, I believe. And anything that doesn't fit you can leave on your system RAM.
    Looks like an H100 runs about $30K online for one. Are there any issues with just sticking one of these in a stock desktop PC and running llama.cpp?
    [-]
    - joha4270 a day ago ago
      > Are there any issues with just sticking one of these in a stock desktop PC and running llama.cpp?
      Cooling might be a challenge. The H100 has a heatsink designed to make use of the case fans. So you need a fairly high airflow through a part which is itself passive.
      On a server this isn't too big a problem, you have fans in one end and GPU's blocking the exit on the other end, but in a desktop you probably need to get creative with cardboard/3d printed shrouds to force enough air through it.
- mikewarot a day ago ago
  Imagine if you could take Llama 3.1 405B and break it down to a tree of logical gates, optimizing out all the things like multiplies by 0 in one of the bits, etc... then load it into a massive FPGA like chip that had no von Neumann bottleneck, was just pure compute without memory access latency with a conservative 1 Ghz clock rate.
  Such a system would be limited by the latency across the reported 126 layers worth of math involved, before it could generate the next token, which might be as much as 100 uSec. So it would be 10x faster, but you could have thousands of other independent streams pipelined through in parallel because you'd get a token per clock cycle out the end.
  In summary, 1 Gigatoken/second, divided into 100,000 separate users each getting 10k tokens/second.
  This is the future I want to build.
  [-]
  - jacobgorm a day ago ago
    See Convolutional Differentiable Logic Gate Networks https://arxiv.org/abs/2411.04732 , which is a small step in that direction.
  - seangrogg a day ago ago
    I'm actively trying to learn how to do exactly this, though I'm just getting started with FPGAs now so probably a very long range goal.
    [-]
    - ryao 11 hours ago ago
      There is not enough memory attached to FPGAs to do this. Some FOGAs come with 16GB of HBM attached, but that is not enough and the bandwidth provided is not as high as it is on GPUs. You would need to work out how to connect enough memory chips simultaneously to get high bandwidth and enough capacity in order for a FPGA solution to be performance competitive with a GPU solution.
      [-]
      - mikewarot 9 hours ago ago
        Instead of separate memory/compute, I propose to fuse them.
- mmaunder 2 days ago ago
  Nah. Try vLLM and 405B FP8 on that hardware. And make sure you’re benchmarking with some concurrency for max TPS.
  [-]
  - zackangelo 15 hours ago ago
    Related recent discussion on twitter: https://x.com/Teknium1/status/1858987850739728635
    Looks like other folks get 80 tok/s with max batch size, that's surprising to me but vLLM is definitely more optimized than my implementation.
- hendler a day ago ago
  Check out BaseTen for performant use of GPUs
danpalmer 2 days ago ago
I'm not sure if they're comparing apples to apples on the latency here. There are roughly three parts to the latency: the throughput of the context/prompt, the time spent queueing for hardware access, and the other standard API overheads (network, etc).
From what I understand, several, maybe all, of the comparison services are not based on provisioned capacity, which means that the measurements include the queue time. For LLMs this can be significant. The Cerebras number on the other hand almost certainly doesn't have some unbounded amount of queue time included, as I expect they had guaranteed hardware access.
The throughput here is amazing, but to get that throughput at a good latency for end-users means over-provisioning, and it's unclear what queueing will do to this. Additionally, does that latency depend on the machine being ready with the model, or does that include loading the model if necessary? If using a fine-tuned model does this change the latency?
I'm sure it's a clear win for batch workloads where you can keep Cerebras machines running at 100% utilisation and get 1k tokens/s constantly.
[-]
- qeternity 2 days ago ago
  Everyone presumes this is under ideal conditions...and it's incredible.
  It's bs=1. At 1,000 t/s. Of a 405B parameter model. Wild.
  [-]
  - ryao 11 hours ago ago
    They claim it is with between 8 and 20 users:
    https://x.com/draecomino/status/1858998347090325846
    That said, they appear to be giving the per user performance.
  - danpalmer 2 days ago ago
    Cerebras' benchmark is most likely under ideal conditions, but I'm not sure it's possible to test public cloud APIs under ideal conditions as it's shared infrastructure so you just don't know if a request is "ideal". I think you can only test these things across significant numbers of requests, and that still assumes that shared resource usage doesn't change much.
    [-]
    - qeternity a day ago ago
      I'm not talking about that. I and many others here have spun up 8x or more H100 clusters and run this exact model. Zero other traffic. You won't come anywhere close to this.
      [-]
      - aurareturn a day ago ago
        I'm not talking about that. I and many others here have spun up 8x or more H100 clusters and run this exact model. Zero other traffic. You won't come anywhere close to this.
        8x H100 can also do fine tuning right? Does Cerebras offer fine tuning support?
      - danpalmer a day ago ago
        In that case I'm misunderstanding you. Are you saying that it's "BS" that they are reaching ~1k tokens/s? If so, you may be misunderstanding what a Cerebras machine is. Also 8xH100 is still ~half the price of a single Cerebras machine, and that's even accounting for H100s being massively over priced. You've got easily twice the value in a Cerebras machine, they have nearly 1m cores on a single die.
        [-]
        sam_dam_gai a day ago ago
        Ha ha. He probably means ”at a batch size of 1”, i.e. not even using some amortization tricks to get better numbers.
        [-]
        danpalmer a day ago ago
        Ah! That does make more sense!
  - colordrops 2 days ago ago
    Right, I'd assume most LLM benchmarks are run on dedicated hardware.
LASR 2 days ago ago
What you can do with current-gen models, along with RAG, multi-agent & code interpreters, the wall is very much model latency, and not accuracy any more.
There are so many interactive experiences that could be made possible at this level of token throughput from 405B class models.
[-]
- TeeWEE 10 hours ago ago
  How can a rule book help fixing incidents. I mean I hope every incident is novel. Since you solve the root issue. So every time you need to dig in the code, or recently deployed code and correlate it with your production metrics.
  Or is the rulebook a simple rollback?
- TechDebtDevin 2 days ago ago
  Like what..
  [-]
  - vineyardmike a day ago ago
    You can create massive variants of OpenAI's 01 model. The "Chain of Thought" tools become way more useful when you can get when you can iterate 100x faster. Right now, flagship LLMs stream responses back, and barely beat the speed a human can read, so adding CoT makes it really slow for human-in-the-loop experiences. You can really get a lot more interesting "thoughts" (or workflow steps, or whatever) when it can do more, without slowing down the human experience of using the tool.
    You can also get a lot fancier with tool-usage when you can start getting an LLM to use and reply to tools at a speed closer to the speed of a normal network service.
    I've never timed it, but I'm guessing current LLMs don't handle "live video" type applications well. Imagine an LLM you could actually video chat with - it'd be useful for walking someone through a procedure, or advanced automation of GUI applications, etc.
    AND the holy-grail of AI applications that would combine all of this - Robotics. Today, Cerebras chips are probably too power hungry for battery powered robotic assistants, but one could imagine a Star-Wars style robot assistant many years from now. You can have a robot that can navigate some space (home setting, or work setting) and it can see its environment and behavior, processing the video in real-time. Then, can reason about the world and its given task, by explicitly thinking through steps, and critically self-challenging the steps.
    [-]
    - manmal a day ago ago
      > barely beat the speed a human can read
      4o is way faster than a human can read.
  - davidfiala a day ago ago
    Imagine increasing the quality and FPS of those AI-generated minecraft clones and experiencing even more high-quality, realtime AI-generated gameplay
    (yeah, I know they are doing textual tokens. but just sayin..)
    edit: context is https://oasisaiminecraft.com/
perfobotto a day ago ago
To be clear a cerebras chip is consuming a whole wafer and has only 44 GB of SRAM on it. To fit a 405B model in bf16 precision (excluding kv cache and activation memory usage) you need 19 of these “chips” (and the requirement will grow as the sequence length increases for the kvcache). Looking online it seems on one wafer one can fit between 60 to 80 H100 chips, so it’s equivalent to using >1500 H100 using wafer manufacturing cost as a metric
[-]
- ffsm8 a day ago ago
  The budget these companies spend on this tech is seriously mind boggling to me.
- manmal a day ago ago
  Is wafer cost a major factor in the actual chip price?
shreezus a day ago ago
This is seriously impressive performance. I think there's a high probability Nvidia attempts to acquire Cerebras.
[-]
- gorkempacaci a day ago ago
  They're considering an IPO. I'd say an acquisition is unlikely. Even then, they'd be worth more to Facebook or MS.
  [-]
  - szundi a day ago ago
    No, they would make a capital infusion on paper and then make Cerebras buy more hw from that money on paper, thus showing huge revenues on Nvidia books.
    Makes sense… or whatever
    [-]
    - dustypotato a day ago ago
      They're making custom chips right? Why would Cerebras buy hardware from Nvidia?
sumedh a day ago ago
They have a waitlist for trying their API. You have to be a but skeptical when a company makes claims but does not offer their services to buy.
fillskills 2 days ago ago
No mention of their direct competitor Groq?
[-]
- icelancer 2 days ago ago
  I'm a happily-paying customer of Groq but they aren't competitive against Cerebras in the 405b space (literally at all).
  Groq has paying customers below the enterprise-level and actually serves all their models to everyone in a wide berth, unlike Cerebras who is very selective, so they have that going for them. But in terms of sheer speed and in the largest models, Groq doesn't really compare.
  [-]
  - hendler a day ago ago
    Is this because 405b doesn't fit on Groq? If they perform better, I would also have liked to have seen.
    [-]
    - KTibow a day ago ago
      When 405b first launched Groq ran it, it's not currently running due to capacity issues though
- guyomes a day ago ago
  Sambanova is not often mentioned either [0]. One of his co-founder is known as “father of the multi-core processor” [1].
  [0]: https://sambanova.ai/
  [1]: https://en.wikipedia.org/wiki/Kunle_Olukotun
owenpalmer a day ago ago
The fact that such a boost is possible with new hardware, I wonder what the ceiling is for improving performance for training via hardware as well.
[-]
- why_only_15 a day ago ago
  Not enormous without significant changes to the ML. There are two pieces to this: improving efficiency and improving flops.
  Improving flops is the most obvious way to improve speed, but I think we're pretty close to physical limits for a given process node and datatype precision. It's hard to give proof positive of this, but there are a few lines of evidence. One is that the fundamental operation of LLMs, matrix multiplications, are really simple (unlike e.g. CPU work) and so all the e.g. control flow logic is pretty minimized. We're largely spending electricity on doing the matrix multiplications themselves, and the matrix multiplications are in fact electricity-bound[1]. There are gains to be made by changing precision, but this is difficult and we're close to tapped out on it in my opinion (already very low precisions (fp8 can't represent 17), new research showing limitations).
  Efficiency in LLM training is measured with a very punishing standard, "Model Flops Utilization" (MFU), where we divide the theoretical number of flops the hardware could provide with the theoretical number of flops necessary to implement the mathematical operation. We're able to get 30% without thinking (just FSDP) and 50-60% are not implausible/unheard of. The inefficiency is largely because 1) the hardware can't provide the number of flops it says on the tin for various reasons and 2) we have to synchronize terabytes of data across tens of thousands of machines. The theoretical limit here is 2x but in practice there's not a ton to eke out here.
  There will be gains but they will be mostly focused on reducing NVIDIA's margin (TPU), on improving process node, on reducing datatype (B100), or on enlarging the size of a chip to reduce costly cross-chip communication (B100). There's not room for a 10x (again at constant precision and process node).
  [1]: https://www.thonking.ai/p/strangely-matrix-multiplications
- bufferoverflow a day ago ago
  The ultimate solution would be to convert an LLM to a pure ASIC.
  My guess is that would 10X the performance. But then it's a very very expensive solution.
  [-]
  - tiagod 20 hours ago ago
    There's some interesting research on using stacked flat lenses to build analog, physical neural network inference that operate directly on light (each lens is a hidden layer). If we managed to make this work for non-trivial cases, it could be absurdly fast.
  - why_only_15 a day ago ago
    Why would converting a specific LLM to an ASIC help you? LLMs are like 99% matrix multiplications by work and we already have things that amount to ASICs for matrix multiplications (e.g. TPU) that aren't cheaper than e.g. H100
    [-]
    - mikewarot 17 hours ago ago
      An ASIC could have all of the weights baked into the design, completely eliminating the von Neumann bottleneck that plagues computation.
      They are inherently parallel, so you might be able to get a token per clock cycle. A billion tokens per second opens quite a few possibilities.
      It could also eliminate all of the multiplication or addition of bits that are 0 from the design, making each multiply smaller by 50 percent silicon area, on average.
      However, an ASIC is a speculation that all the design tools work. It may require multiple rounds to get it right.
      [-]
      - ryao 11 hours ago ago
        I doubt you could have a token per clock cycle unless it is very low clocked. In practice, even dedicated hardware for matrix-matrix multiplication does not perform the multiplication in a single clock cycle. Presumably, the circuit paths would be so large that you would need to have a very slow clock to make that work, and there are many matrix multiplications done per token. Furthermore, things are layered and must run through each layer. Presumably if you implement this you would aim for 1 layer per clock cycle, but even that seems like it would be quite long as far as circuit paths go.
        I have some local code running llama 3 8B and matrix multiplications in it are being done by 2D matrices with dimensions ranging from 1024 to 4096. Let’s just go with a nice 1024x1024 matrix and do matrix-vector multiplication, which is the minimum needed to implement llama3. That is 1048576 elements. If you try to do matrix-vector multiplication in 1 cycle, you will need 1048576 fmadd units.
        I am by no means a chip designer, so I asked ChatGPT to estimate how many transistors are needed for a bf16 fmadd unit. It said 100,000 to 200,000. Let’s go with 100,000 transistors per unit. Thus to implement a single matrix multiplication according to your idea, we would need over 100 billion transistors, and this is only a small part of the llama 3 8b model’s calculations. You would probably be well into the trillions of transistors if you implemented all of it in an ASIC and did 1 layer per cycle (don’t even think of 1 token per cycle). For reference, Nvidia’s H100 has 80 billion transistors. The CSE-3 has 4 trillion transistors and I am not sure if even that would be enough.
        It is a nice idea, but I do not think it is feasible with current technology. That said, I do like your out of box thinking. This might be a bit too far out of the box, but there is probably a middle ground somewhere.
        [-]
        mikewarot 9 hours ago ago
        You're right in the numbers, I wasn't thinking of trying to push all of that into one chip, but if you can distribute the problem such that an array of chips can break the problem apart cleanly, the numbers fall within the range of what's feasible with modern technology.
        The key to this, in my view, is to give up on the idea of trying to get the latency as low as possible for a given piece of computation, as it typically done, and instead try to make reliable small cells that are clocked so that you don't have to worry about getting data far or fast. Taking this idea to its limits has a completely homogeneous systolic array that operates on 4 bits at a time, using look up tables to do everything. No dedicated switching fabric, multipliers, or anything else.
        It's the same tradeoff von Neumann made with the ENIAC, which slowed it down by a factor of 6 (according to wikipedia), but eliminating multiple weeks of human labor in setup by instead loading stored programs effectively instantly.
        To multiply numbers, you don't have to do all of it at the same time, you just have to pipeline the steps so that all of them are taking place for part of the data, and it all stays synced (which the clocking again helps)
        Since I'm working alone, right now I'm just trying to get something that other people can grok, and play with.
        Ideally, I'd have chips with multiple channels of LVDS interfaces running at 10 Gbps or more each to allow meshing the chips. Mostly, they'd be vast strings of D flip flops and 16:1 multiplexers.
        I'm well aware of the fact that I've made arbitrary choices, and they might not be optimal for real world hardware. I do remain steadfast in my opinion that providing a better impedance match between the computing substrate and the code that runs on it could allow multiple orders improvement in efficiency. Not to mention the ability to run the exact same code on everything from an emulator to every successive version/size of the chip, without recompilation.
        Not to mention being able to route around bad cells, actually build "walls" around code with sensitive info, etc.
WiSaGaN 2 days ago ago
I am wondering how much cost is needed for serving at such a latency. Of course for customers, static cost depends on the pricing strategy. But still, the cost really determines how widely this can be adopted. Is it only for those business that really need the latency, or this can be generally deployed.
[-]
- ilaksh 2 days ago ago
  Maybe it could become standard for everyone to make giant chips and use SRAM?
  How many SRAM manufacturers are there? Or does it somehow need to be fully integrated into the chip?
  [-]
  - AlotOfReading 2 days ago ago
    SRAM is usually produced on the same wafer as the rest of the logic. SRAM on an external chip would lose many of the advantages without being significantly cheaper.
    [-]
    - YetAnotherNick a day ago ago
      Yes, the limiting factor for bandwidth is generally the number of pins which are not cheap and you can only have few 1000s in a chip. The absolute state of the art is 36 Gb/s/pin[1], and your $30 RAM could have 6 Gb/s/pin[2].
      [1]: https://en.wikipedia.org/wiki/GDDR7_SDRAM
      [2]: https://en.wikipedia.org/wiki/DDR5_SDRAM
  - why_only_15 a day ago ago
    the cost is not the memory technology per se but primarily the wires. SRAM is fast because it's directly inside the chip and so the connections with the logic that does the work is cheap because it's close.
qwertox a day ago ago
I'd like to see a tokens / second / watt comparison.
bargle0 2 days ago ago
Their hardware is cool and bizarre. It has to be seen in person to be believed. It reminds me of the old days when supercomputers were weird.
[-]
- IAmNotACellist 2 days ago ago
  Don't leave us hanging, show us a weird computer!
  [-]
  - campers a day ago ago
    https://web.archive.org/web/20230812020202/https://www.youtu...
germanjoey 2 days ago ago
Pretty amazing speed, especially considering this is bf16. But how many racks is this using? The used 4 racks for 70B, so this, what, at least 24? A whole data center for one model?!
[-]
- aurareturn a day ago ago
  Each Cerebras wafer scale chip has 44GB of SRAM. You need 972 GB of memory to run Llama 405b at fp16. So you need 22 of these.
  I assume they're using SRAM only to achieve this speed and not HBM.
aurareturn a day ago ago
Normally, I don't think 1000 tokens/s is that much more useful than 50 tokens/s.
However, given that CoT makes models a lot smarter, I think Cerebras chips will be in huge demand from now on. You can have a lot more CoT runs when the inference is 20x faster.
Also, I assume financial applications such as hedge funds would be buying these things in bulk now.
[-]
- deadmutex a day ago ago
  > Also, I assume financial applications such as hedge funds would be buying these things in bulk now.
  Please elaborate.. why?
  [-]
  - aurareturn a day ago ago
    I'm assuming hedge funds are using LLMs to dissect information from company news, SEC reports as soon as possible then make a decision on trading. Having faster inference would be a huge advantage.
gdiamos 2 days ago ago
I'm so curious to see some multi-agent systems running with inference this fast.
[-]
- ipsum2 2 days ago ago
  There's no good open source agent models at the moment unfortunately.
brcmthrowaway 2 days ago ago
So out of all AI chip startups, Cerebras is probably the real deal
[-]
- icelancer 2 days ago ago
  Groq is legitimate. Cerebras so far doesn't scale (wide) nearly as good as Groq. We'll see how it goes.
  [-]
  - throwawaymaths a day ago ago
    How exactly does groq scale wide well? Last I heard it was 9 racks!! to run llama-2 70b
    Which is why they throttle your requests
    [-]
    - pama 21 hours ago ago
      Well, Cerebras pretty much needs a data center to simply fit the 405B model for inference.
      [-]
      - throwawaymaths 21 hours ago ago
        I guess this just shows the insanity of venture led AI hardware hype and shady startup messaging practices
  - hendler a day ago ago
    Google TPUs, Amazon, a YC funded ASIC/FPGA company, a Chinese Co. all have custom hardware too that might scale well.
- gdiamos 2 days ago ago
  just in time for their ipo
  [-]
  - ipsum2 2 days ago ago
    It got cancelled/postponed.
latchkey a day ago ago
This gets tons of press and discussion here on HN, but frankly AMD has a better overall product with the upcoming MI325x [0].
I love to see the development and activity, but companies like Cerebras are trying to compete on a single usecase and doing a poor job of it because they can only offer a tightly controlled API.
Ask yourself how much capex + power/space/cooling (opex) it requires to run that model (and how many people it can really serve) and then compare that against what AMD is offering.
[0] https://www.amd.com/en/products/accelerators/instinct/mi300/...
[-]
- ryao 10 hours ago ago
  The same could be said for Nvidia’s H100 and other GPUs.
frogfish a day ago ago
Genuinely curious and willing to learn: what are the different inference approaches broadly? Is there any difference in the approach between Cerebras and simplismart.ai which claims to be the fastest?
leobg a day ago ago
Cerebras features in the internal OpenAI emails that recently came out. One example:
Ilya Sutskever to Elon Musk, Sam Altman, (cc: Greg Brockman, Sam Teller, Shivon Zilis) - Sep 20, 2017 2:08 PM
> In the event we decide to buy Cerebras, my strong sense is that it'll be done through Tesla. But why do it this way if we could also do it from within OpenAI?
jadbox 2 days ago ago
Not open beta until Q1 2025
dgfitz a day ago ago
Holy bananas, the title alone is almost its own language.
easeout a day ago ago
How does binning work when your chip is the entire wafer?
[-]
- shrubble a day ago ago
  They expect that some of the cores on the wafer will fail, so they have redundant links all throughout the chip, so they can seal off/turn off any cores that fail and still have enough cores to do useful work.
  [-]
  - why_only_15 a day ago ago
    My understanding is that they mask off or otherwise disable a whole row+column of cores when one dies
    [-]
    - wtallis a day ago ago
      That's way too wasteful.
      Take a look at https://fuse.wikichip.org/news/3010/a-look-at-cerebras-wafer... and specifically the diagram https://fuse.wikichip.org/wp-content/uploads/2019/11/hc31-ce...
      The fabric can effectively route signals diagonally to work around an individual defective core, with a displacement of one position for cores in the same row from that defect over to the nearest spare core. That's how they get away with a claimed "1–1.5%" of spare cores.
gorkempacaci a day ago ago
nvidia hates this one little trick
[-]
- zurfer a day ago ago
  I laughed and upvoted, but if anything I bet they put their best people on it to replicate this offering.
  What I take away from this is: we are just getting started. I remember in 2023 begging OpenAI to give us more than 7 tokens/second on GPT-4.
  [-]
  - ryao 10 hours ago ago
    Nvidia’s target is performance across concurrent users and they are likely already outperforming Cerebras there as far as costs are concerned. They have no reason to try to beat the single user performance of this.
arthurcolle a day ago ago
Damn that's a big model and that's really fast inference.
kuprel 2 days ago ago
I wonder if Cerebras could generate video decent quality in real time
xwww a day ago ago
Transistor（GPU）-> Integrated Circuit (WSE-3)
adhambadr a day ago ago
is it just me or isn't the most important contender in speed, Groq, missing from the comparison ? not sure why does it matter to put azure there, no one uses it for speed.