This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.
I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?
Cerebras makes CPUs with ~1 million cores, and they're inferring on that not on GPUs. It's an entirely different architecture which means no network involved. It's possible they're doing this significantly from CPU caches rather than HBM as well.
I recommend the TechTechPotato YouTube videos on Cerebras to understand more of their chip design.
I hope we can buy Cerebras cards one day. Imagine buying a ~$500 AI card for your desktop and having easy access to 70B+ models (the price is speculative/made up).
I believe pricing was mid 6 figures per machine. They're also like 8U and water cooled I believe. I doubt it would be possible to deploy one outside of a fairly top tier colo facility where they have the ability to support water cooling. Also imagine learning a new CUDA but that is designed for another completely different compute model.
Based on their S1 filing and public statements, the average cost per WSE system for their (~90% of their total revenue) largest customer is ~$1.36M, and I’ve heard “retail” pricing of $2.5M per system. They are also 15U and due to power and additional support equipment take up an entire rack.
The other thing people don’t seem to be getting in this thread that just to hold the weights for 405B at FP16 requires 19 of their systems since it is SRAM only… rounding up to 20 to account for program code + KV cache for the user context would mean 20 systems/racks, so well over $20M. The full rack (including support equipment) also consumes 23kW, so we are talking nearly half a megawatt and ~$30M for them to be getting this performance on Llama 405B
Thank you, far better answer than mine! Those are indeed wild numbers, although interestingly "only" 23kw, I'd expect the same level of compute in GPUs to be quite a lot more than that, or at least higher power density.
Memory bandwidth for inferencing does not scale with the number of GPUs. Scaling instead requires more concurrent users. Also, I am told that 8 H100 cards can achieve 600 to 1000 tokens per second with concurrent users.
> For 969 tok/s in int8, you need 392 TB/s memory bandwidth
I think that math is only valid for batch size = 1. When these 969 tokens/second come from multiple sessions of the same batch, loaded model tensor elements are reused to compute many tokens for the entire batch. With large enough batches, you can even saturate compute throughput of the GPU instead of bottlenecking on memory bandwidth.
From what I have read, it is a maximum of 23 kW per chip and each chip goes into a 16U. That said, you would need at least 460 kW power to run the setup you described.
As for retail pricing being $2.5 million, I read $2 million in a news article earlier this year. $2.5 million makes it sound even worse.
How have power and cooling been doing with respect to chip improvements? Have power requirements per operation been coming down rapidly, as other features have improved?
My recollection from PC CPUs is that we've gotten many more operations per second, and many more operations per second per dollar, but that the power and corresponding cooling requirements for the CPUs have tended to go up as well. I don't really know what power per operation has looked like there. (I guess it's clearly improved, though, because it seems like the power consumption of a desktop PC has only increased by a single order of magnitude, while the computational capacity has increased by more than that.)
A reason that I wonder about this in this context is that people are saying that the power and cooling requirements for these devices are currently enormous (by individual or hobbyist standards, not by data center standards!). If we imagine a Moore's Law-style improvement where the hardware itself becomes 1/10 or 1/100 of its current price, would we expect the overall power consumption to be similarly reduced, or to remain closer to its current levels?
Mooers law in the consumer space seems to be pretty much asymptoting now, as indicated by Apple's amazing Macbooks with an astounding 8GB of RAM.
Data center compute is arguable, as it tends to be catered to some niche, making it confusing (cerebras as an example vs GPU datacenters vs more standard HPC). Also Clusters and even GPUs don't really fit in to Mooers law as originally framed.
Not really. These are wafer-scale chips, which (as far as I'm aware) were first introduced by Cerebras.
Cost reduction for cutting-edge products in the semiconductor industry has historically been driven by 1) reducing transistor size (by following the Dennard scaling laws), and 2) a variety of techniques (e.g. high-k dielectrics and strained silicon, or FinFETs and now GAAFETs) to improve transistor performance further. These techniques added more steps during manufacturing, but they were inexpensive enough that they allowed to reduce $/transistor still. In the last few years, we've had to pull off ever more expensive tricks which stopped the $/transistor progress. This is why the phrase "Moore's law is dead" has been circulating for a while.
In any case, higher performance transistors means that you can get the same functionality for less power and a smaller area, meaning that iso-functionality chips are cheaper to build in bulk. This is especially true for older nodes, e.g. look at the absurdly low price of most microcontrollers.
On the other hand, $/wafer is mostly a volume-related metric based on less scalable technology and more conventional manufacturing (relatively speaking).
Cerebra's innovation was in making a wafer-scale chip possible, which is conventionally hard due to unavoidable manufacturing defects. But crucially, such a product (by definition) cannot scale like any other circuit produced so far.
It may for sure drop in price in the future, especially once it gets obsolete. But I don't expect it to ever reach consumer level prices.
Wafer-scale chips have been attempted for many decades, but none of the previous attempts before Cerebras has resulted in a successful commercial product.
The main reason why Cerebras has succeeded and the previous attempts have failed is not technical, but the existence of market demand.
Before ML/AI training and inference, there has been no application where wafer-scale chips could provide enough additional performance to make their high cost worthwhile.
I don't think they'll be uninteresting. They won't be cutting-edge anymore, sure, but much of the more practical applications of AI that we see today don't run on today's cutting-edge models, either. We're always going to have a certain compute budget, and if a smaller model does the job fine, why wouldn't you use it, and use the rest for something else (or use all of it to run the smaller model faster).
Thing is nearly all cooling. And look at the diameter on the water cooling pipes. Airflow guides on the fans are solid steel. Apparently the chip itself measures 21.5cm^2. Insane.
You still have to pay for the memory. The Cerebras chip is fast because they use 700x more SRAM than, say, A100 GPUs. Loading the whole model in SRAM every time you compute one token is the expensive bit.
They are doing it with custom silicon with several times more area than 8x H100s. I’m sure they are doing some sort of optimization at execution/runtime, but the primary difference is the sheer transistor count.
It is worth splitting out the stacked memory silicon layers on both too (if Cerebras is set up with external DRAM memory). HBM is over 10 layers now so the die area is a good bit more than the chip area, but different process nodes are involved.
The number for that is I believe 1 terabit or 125GB/s -- 21 petabytes is the speed from the SRAM (~registers) to the cores (~ALU) for the whole chip. It's not especially impressive for SRAM speeds. The impressive thing is that they have an enormous amount of SRAM
Cerebras is a chip company. They are not using GPUs. Their chip uses wafer scale integration which means it's the physical size of a whole wafer, dozens of GPUs in one.
They have limited memory on chip (all SRAM) and it's not clear how much HBM bandwidth they have per wafer. It's a completely different optimization problem than running on GPU clusters.
I'm not too up to date but as I recall there are a lot of weirdnesses because of how big their chip is (e.g. thermal expansion being a problem). I believe they have a single giant line in the middle of the chip for this reason. maybe this makes HBM etc. hard? certainly their chip would be more appealing if they cut down the # of cores by 10x, added matrix units and added HBM but looks like they're not going to go this way.
There are two big tricks: their chips are enormous, and they use sram as their memory, which is vastly faster than hbm ram being used by GPUs. In fact this is main reason it’s so fast. Groq has the speed because of the same reason.
how much memory do you need to run fp8 llama 3 70b - can it potentially fit 1 H100 GPU with 96GB RAM?
In other words, if you wanted to run 8 separate 70b models on your cluster, each of which would fit into 1 GPU, how much larger your overall token output could be than parallelizing 1 model per 8 GPUs and having things slowed down a bit due to NVLink?
It’s been a minute so my memory might be off but I think when I ran 70b at fp16 it just barely fit on a 2x A100 80GB cluster but quickly OOMed as the context/kv cache grew.
So if I had to guess a 96GB H100 could probably run it at fp8 as long as you didn’t need a big context window. If you’re doing speculative decoding it probably won’t fit because you also need weights and kv cache for the draft model.
> Are there any issues with just sticking one of these in a stock desktop PC and running llama.cpp?
Cooling might be a challenge. The H100 has a heatsink designed to make use of the case fans. So you need a fairly high airflow through a part which is itself passive.
On a server this isn't too big a problem, you have fans in one end and GPU's blocking the exit on the other end, but in a desktop you probably need to get creative with cardboard/3d printed shrouds to force enough air through it.
Imagine if you could take Llama 3.1 405B and break it down to a tree of logical gates, optimizing out all the things like multiplies by 0 in one of the bits, etc... then load it into a massive FPGA like chip that had no von Neumann bottleneck, was just pure compute without memory access latency with a conservative 1 Ghz clock rate.
Such a system would be limited by the latency across the reported 126 layers worth of math involved, before it could generate the next token, which might be as much as 100 uSec. So it would be 10x faster, but you could have thousands of other independent streams pipelined through in parallel because you'd get a token per clock cycle out the end.
In summary, 1 Gigatoken/second, divided into 100,000 separate users each getting 10k tokens/second.
There is not enough memory attached to FPGAs to do this. Some FOGAs come with 16GB of HBM attached, but that is not enough and the bandwidth provided is not as high as it is on GPUs. You would need to work out how to connect enough memory chips simultaneously to get high bandwidth and enough capacity in order for a FPGA solution to be performance competitive with a GPU solution.
I'm not sure if they're comparing apples to apples on the latency here. There are roughly three parts to the latency: the throughput of the context/prompt, the time spent queueing for hardware access, and the other standard API overheads (network, etc).
From what I understand, several, maybe all, of the comparison services are not based on provisioned capacity, which means that the measurements include the queue time. For LLMs this can be significant. The Cerebras number on the other hand almost certainly doesn't have some unbounded amount of queue time included, as I expect they had guaranteed hardware access.
The throughput here is amazing, but to get that throughput at a good latency for end-users means over-provisioning, and it's unclear what queueing will do to this. Additionally, does that latency depend on the machine being ready with the model, or does that include loading the model if necessary? If using a fine-tuned model does this change the latency?
I'm sure it's a clear win for batch workloads where you can keep Cerebras machines running at 100% utilisation and get 1k tokens/s constantly.
Cerebras' benchmark is most likely under ideal conditions, but I'm not sure it's possible to test public cloud APIs under ideal conditions as it's shared infrastructure so you just don't know if a request is "ideal". I think you can only test these things across significant numbers of requests, and that still assumes that shared resource usage doesn't change much.
I'm not talking about that. I and many others here have spun up 8x or more H100 clusters and run this exact model. Zero other traffic. You won't come anywhere close to this.
I'm not talking about that. I and many others here have spun up 8x or more H100 clusters and run this exact model. Zero other traffic. You won't come anywhere close to this.
8x H100 can also do fine tuning right? Does Cerebras offer fine tuning support?
In that case I'm misunderstanding you. Are you saying that it's "BS" that they are reaching ~1k tokens/s? If so, you may be misunderstanding what a Cerebras machine is. Also 8xH100 is still ~half the price of a single Cerebras machine, and that's even accounting for H100s being massively over priced. You've got easily twice the value in a Cerebras machine, they have nearly 1m cores on a single die.
What you can do with current-gen models, along with RAG, multi-agent & code interpreters, the wall is very much model latency, and not accuracy any more.
There are so many interactive experiences that could be made possible at this level of token throughput from 405B class models.
How can a rule book help fixing incidents. I mean I hope every incident is novel. Since you solve the root issue. So every time you need to dig in the code, or recently deployed code and correlate it with your production metrics.
You can create massive variants of OpenAI's 01 model. The "Chain of Thought" tools become way more useful when you can get when you can iterate 100x faster. Right now, flagship LLMs stream responses back, and barely beat the speed a human can read, so adding CoT makes it really slow for human-in-the-loop experiences. You can really get a lot more interesting "thoughts" (or workflow steps, or whatever) when it can do more, without slowing down the human experience of using the tool.
You can also get a lot fancier with tool-usage when you can start getting an LLM to use and reply to tools at a speed closer to the speed of a normal network service.
I've never timed it, but I'm guessing current LLMs don't handle "live video" type applications well. Imagine an LLM you could actually video chat with - it'd be useful for walking someone through a procedure, or advanced automation of GUI applications, etc.
AND the holy-grail of AI applications that would combine all of this - Robotics. Today, Cerebras chips are probably too power hungry for battery powered robotic assistants, but one could imagine a Star-Wars style robot assistant many years from now. You can have a robot that can navigate some space (home setting, or work setting) and it can see its environment and behavior, processing the video in real-time. Then, can reason about the world and its given task, by explicitly thinking through steps, and critically self-challenging the steps.
To be clear a cerebras chip is consuming a whole wafer and has only 44 GB of SRAM on it. To fit a 405B model in bf16 precision (excluding kv cache and activation memory usage) you need 19 of these “chips” (and the requirement will grow as the sequence length increases for the kvcache). Looking online it seems on one wafer one can fit between 60 to 80 H100 chips, so it’s equivalent to using >1500 H100 using wafer manufacturing cost as a metric
No, they would make a capital infusion on paper and then make Cerebras buy more hw from that money on paper, thus showing huge revenues on Nvidia books.
I'm a happily-paying customer of Groq but they aren't competitive against Cerebras in the 405b space (literally at all).
Groq has paying customers below the enterprise-level and actually serves all their models to everyone in a wide berth, unlike Cerebras who is very selective, so they have that going for them. But in terms of sheer speed and in the largest models, Groq doesn't really compare.
Not enormous without significant changes to the ML. There are two pieces to this: improving efficiency and improving flops.
Improving flops is the most obvious way to improve speed, but I think we're pretty close to physical limits for a given process node and datatype precision. It's hard to give proof positive of this, but there are a few lines of evidence. One is that the fundamental operation of LLMs, matrix multiplications, are really simple (unlike e.g. CPU work) and so all the e.g. control flow logic is pretty minimized. We're largely spending electricity on doing the matrix multiplications themselves, and the matrix multiplications are in fact electricity-bound[1]. There are gains to be made by changing precision, but this is difficult and we're close to tapped out on it in my opinion (already very low precisions (fp8 can't represent 17), new research showing limitations).
Efficiency in LLM training is measured with a very punishing standard, "Model Flops Utilization" (MFU), where we divide the theoretical number of flops the hardware could provide with the theoretical number of flops necessary to implement the mathematical operation. We're able to get 30% without thinking (just FSDP) and 50-60% are not implausible/unheard of. The inefficiency is largely because 1) the hardware can't provide the number of flops it says on the tin for various reasons and 2) we have to synchronize terabytes of data across tens of thousands of machines. The theoretical limit here is 2x but in practice there's not a ton to eke out here.
There will be gains but they will be mostly focused on reducing NVIDIA's margin (TPU), on improving process node, on reducing datatype (B100), or on enlarging the size of a chip to reduce costly cross-chip communication (B100). There's not room for a 10x (again at constant precision and process node).
There's some interesting research on using stacked flat lenses to build analog, physical neural network inference that operate directly on light (each lens is a hidden layer). If we managed to make this work for non-trivial cases, it could be absurdly fast.
Why would converting a specific LLM to an ASIC help you? LLMs are like 99% matrix multiplications by work and we already have things that amount to ASICs for matrix multiplications (e.g. TPU) that aren't cheaper than e.g. H100
An ASIC could have all of the weights baked into the design, completely eliminating the von Neumann bottleneck that plagues computation.
They are inherently parallel, so you might be able to get a token per clock cycle. A billion tokens per second opens quite a few possibilities.
It could also eliminate all of the multiplication or addition of bits that are 0 from the design, making each multiply smaller by 50 percent silicon area, on average.
However, an ASIC is a speculation that all the design tools work. It may require multiple rounds to get it right.
I doubt you could have a token per clock cycle unless it is very low clocked. In practice, even dedicated hardware for matrix-matrix multiplication does not perform the multiplication in a single clock cycle. Presumably, the circuit paths would be so large that you would need to have a very slow clock to make that work, and there are many matrix multiplications done per token. Furthermore, things are layered and must run through each layer. Presumably if you implement this you would aim for 1 layer per clock cycle, but even that seems like it would be quite long as far as circuit paths go.
I have some local code running llama 3 8B and matrix multiplications in it are being done by 2D matrices with dimensions ranging from 1024 to 4096. Let’s just go with a nice 1024x1024 matrix and do matrix-vector multiplication, which is the minimum needed to implement llama3. That is 1048576 elements. If you try to do matrix-vector multiplication in 1 cycle, you will need 1048576 fmadd units.
I am by no means a chip designer, so I asked ChatGPT to estimate how many transistors are needed for a bf16 fmadd unit. It said 100,000 to 200,000. Let’s go with 100,000 transistors per unit. Thus to implement a single matrix multiplication according to your idea, we would need over 100 billion transistors, and this is only a small part of the llama 3 8b model’s calculations. You would probably be well into the trillions of transistors if you implemented all of it in an ASIC and did 1 layer per cycle (don’t even think of 1 token per cycle). For reference, Nvidia’s H100 has 80 billion transistors. The CSE-3 has 4 trillion transistors and I am not sure if even that would be enough.
It is a nice idea, but I do not think it is feasible with current technology. That said, I do like your out of box thinking. This might be a bit too far out of the box, but there is probably a middle ground somewhere.
You're right in the numbers, I wasn't thinking of trying to push all of that into one chip, but if you can distribute the problem such that an array of chips can break the problem apart cleanly, the numbers fall within the range of what's feasible with modern technology.
The key to this, in my view, is to give up on the idea of trying to get the latency as low as possible for a given piece of computation, as it typically done, and instead try to make reliable small cells that are clocked so that you don't have to worry about getting data far or fast. Taking this idea to its limits has a completely homogeneous systolic array that operates on 4 bits at a time, using look up tables to do everything. No dedicated switching fabric, multipliers, or anything else.
It's the same tradeoff von Neumann made with the ENIAC, which slowed it down by a factor of 6 (according to wikipedia), but eliminating multiple weeks of human labor in setup by instead loading stored programs effectively instantly.
To multiply numbers, you don't have to do all of it at the same time, you just have to pipeline the steps so that all of them are taking place for part of the data, and it all stays synced (which the clocking again helps)
Since I'm working alone, right now I'm just trying to get something that other people can grok, and play with.
Ideally, I'd have chips with multiple channels of LVDS interfaces running at 10 Gbps or more each to allow meshing the chips. Mostly, they'd be vast strings of D flip flops and 16:1 multiplexers.
I'm well aware of the fact that I've made arbitrary choices, and they might not be optimal for real world hardware. I do remain steadfast in my opinion that providing a better impedance match between the computing substrate and the code that runs on it could allow multiple orders improvement in efficiency. Not to mention the ability to run the exact same code on everything from an emulator to every successive version/size of the chip, without recompilation.
Not to mention being able to route around bad cells, actually build "walls" around code with sensitive info, etc.
I am wondering how much cost is needed for serving at such a latency. Of course for customers, static cost depends on the pricing strategy. But still, the cost really determines how widely this can be adopted. Is it only for those business that really need the latency, or this can be generally deployed.
SRAM is usually produced on the same wafer as the rest of the logic. SRAM on an external chip would lose many of the advantages without being significantly cheaper.
Yes, the limiting factor for bandwidth is generally the number of pins which are not cheap and you can only have few 1000s in a chip. The absolute state of the art is 36 Gb/s/pin[1], and your $30 RAM could have 6 Gb/s/pin[2].
the cost is not the memory technology per se but primarily the wires. SRAM is fast because it's directly inside the chip and so the connections with the logic that does the work is cheap because it's close.
Pretty amazing speed, especially considering this is bf16. But how many racks is this using? The used 4 racks for 70B, so this, what, at least 24? A whole data center for one model?!
Normally, I don't think 1000 tokens/s is that much more useful than 50 tokens/s.
However, given that CoT makes models a lot smarter, I think Cerebras chips will be in huge demand from now on. You can have a lot more CoT runs when the inference is 20x faster.
Also, I assume financial applications such as hedge funds would be buying these things in bulk now.
I'm assuming hedge funds are using LLMs to dissect information from company news, SEC reports as soon as possible then make a decision on trading. Having faster inference would be a huge advantage.
This gets tons of press and discussion here on HN, but frankly AMD has a better overall product with the upcoming MI325x [0].
I love to see the development and activity, but companies like Cerebras are trying to compete on a single usecase and doing a poor job of it because they can only offer a tightly controlled API.
Ask yourself how much capex + power/space/cooling (opex) it requires to run that model (and how many people it can really serve) and then compare that against what AMD is offering.
Genuinely curious and willing to learn: what are the different inference approaches broadly? Is there any difference in the approach between Cerebras and simplismart.ai which claims to be the fastest?
Cerebras features in the internal OpenAI emails that recently came out. One example:
Ilya Sutskever to Elon Musk, Sam Altman, (cc: Greg Brockman, Sam Teller, Shivon Zilis) - Sep 20, 2017 2:08 PM
> In the event we decide to buy Cerebras, my strong sense is that it'll be done through Tesla. But why do it this way if we could also do it from within OpenAI?
They expect that some of the cores on the wafer will fail, so they have redundant links all throughout the chip, so they can seal off/turn off any cores that fail and still have enough cores to do useful work.
The fabric can effectively route signals diagonally to work around an individual defective core, with a displacement of one position for cores in the same row from that defect over to the nearest spare core. That's how they get away with a claimed "1–1.5%" of spare cores.
Nvidia’s target is performance across concurrent users and they are likely already outperforming Cerebras there as far as costs are concerned. They have no reason to try to beat the single user performance of this.
is it just me or isn't the most important contender in speed, Groq, missing from the comparison ?
not sure why does it matter to put azure there, no one uses it for speed.
This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.
I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?
Cerebras makes CPUs with ~1 million cores, and they're inferring on that not on GPUs. It's an entirely different architecture which means no network involved. It's possible they're doing this significantly from CPU caches rather than HBM as well.
I recommend the TechTechPotato YouTube videos on Cerebras to understand more of their chip design.
> TechTechPotato YouTube videos on Cerebras
https://www.youtube.com/@TechTechPotato/search?query=cerebra... for anyone also looking. there are quite a lot of them.
I hope we can buy Cerebras cards one day. Imagine buying a ~$500 AI card for your desktop and having easy access to 70B+ models (the price is speculative/made up).
I believe pricing was mid 6 figures per machine. They're also like 8U and water cooled I believe. I doubt it would be possible to deploy one outside of a fairly top tier colo facility where they have the ability to support water cooling. Also imagine learning a new CUDA but that is designed for another completely different compute model.
Based on their S1 filing and public statements, the average cost per WSE system for their (~90% of their total revenue) largest customer is ~$1.36M, and I’ve heard “retail” pricing of $2.5M per system. They are also 15U and due to power and additional support equipment take up an entire rack.
The other thing people don’t seem to be getting in this thread that just to hold the weights for 405B at FP16 requires 19 of their systems since it is SRAM only… rounding up to 20 to account for program code + KV cache for the user context would mean 20 systems/racks, so well over $20M. The full rack (including support equipment) also consumes 23kW, so we are talking nearly half a megawatt and ~$30M for them to be getting this performance on Llama 405B
Thank you, far better answer than mine! Those are indeed wild numbers, although interestingly "only" 23kw, I'd expect the same level of compute in GPUs to be quite a lot more than that, or at least higher power density.
You get ~400TFLOP/s in H100 for 350W. You need (2 * token/s * param count) FLOP/s. For 405b, 969tok/s you just need 784 TFLOP/s which is just 2 H100s.
The limiting factor with GPU for inference is memory bandwidth. For 969 tok/s in int8, you need 392 TB/s memory bandwidth or 200 H100s.
Memory bandwidth and memory size. Along with power/cooling density.
Hence why you see AMD's MI325x coming out with 256GB HBM3e, but it is the same FLOPs as a 300x. 6TB/s too, which outperforms H200's, by a lot.
You can see the direction AMD is going with this...
https://www.amd.com/en/products/accelerators/instinct/mi300/...
Memory bandwidth for inferencing does not scale with the number of GPUs. Scaling instead requires more concurrent users. Also, I am told that 8 H100 cards can achieve 600 to 1000 tokens per second with concurrent users.
8 H100 could achieve lot more than 1000 token/sec.
> Memory bandwidth for inferencing does not scale with the number of GPU
It does
> For 969 tok/s in int8, you need 392 TB/s memory bandwidth
I think that math is only valid for batch size = 1. When these 969 tokens/second come from multiple sessions of the same batch, loaded model tensor elements are reused to compute many tokens for the entire batch. With large enough batches, you can even saturate compute throughput of the GPU instead of bottlenecking on memory bandwidth.
They claim to obtain that number with 8 to 20 concurrent users:
https://x.com/draecomino/status/1858998347090325846
From what I have read, it is a maximum of 23 kW per chip and each chip goes into a 16U. That said, you would need at least 460 kW power to run the setup you described.
As for retail pricing being $2.5 million, I read $2 million in a news article earlier this year. $2.5 million makes it sound even worse.
Thank you for the breakdown. Bit of an emotional journey.
"$500 in the future...? Oh, $30 million now, so that might be a while..."
It took 30 years for computers go from entire rooms to desktops, and another 30 years to go from desktops to our pockets.
I don't know if we can extrapolate, but I can imagine AI inference on our desktops for $500 in a few years...
well, we can AI inference on our desktops for $500 today, just with smaller models and far slower.
> Based on their S1 filing and public statements
Is it a good stock to buy :)
Given those details they seem not much better on cost per token than nvidia based systems.
That means it'll be close to affordable in 3 to 5 years if we follow the curve we've been on for the past decades.
How have power and cooling been doing with respect to chip improvements? Have power requirements per operation been coming down rapidly, as other features have improved?
My recollection from PC CPUs is that we've gotten many more operations per second, and many more operations per second per dollar, but that the power and corresponding cooling requirements for the CPUs have tended to go up as well. I don't really know what power per operation has looked like there. (I guess it's clearly improved, though, because it seems like the power consumption of a desktop PC has only increased by a single order of magnitude, while the computational capacity has increased by more than that.)
A reason that I wonder about this in this context is that people are saying that the power and cooling requirements for these devices are currently enormous (by individual or hobbyist standards, not by data center standards!). If we imagine a Moore's Law-style improvement where the hardware itself becomes 1/10 or 1/100 of its current price, would we expect the overall power consumption to be similarly reduced, or to remain closer to its current levels?
Mooers law in the consumer space seems to be pretty much asymptoting now, as indicated by Apple's amazing Macbooks with an astounding 8GB of RAM. Data center compute is arguable, as it tends to be catered to some niche, making it confusing (cerebras as an example vs GPU datacenters vs more standard HPC). Also Clusters and even GPUs don't really fit in to Mooers law as originally framed.
Apple doesn’t sell those anymore.
Aw man, are they selling only 4GB ones now?
More seriously, even 16GB was essentially the 'norm' in consumer PCs about 15 years ago.
Not really. These are wafer-scale chips, which (as far as I'm aware) were first introduced by Cerebras.
Cost reduction for cutting-edge products in the semiconductor industry has historically been driven by 1) reducing transistor size (by following the Dennard scaling laws), and 2) a variety of techniques (e.g. high-k dielectrics and strained silicon, or FinFETs and now GAAFETs) to improve transistor performance further. These techniques added more steps during manufacturing, but they were inexpensive enough that they allowed to reduce $/transistor still. In the last few years, we've had to pull off ever more expensive tricks which stopped the $/transistor progress. This is why the phrase "Moore's law is dead" has been circulating for a while.
In any case, higher performance transistors means that you can get the same functionality for less power and a smaller area, meaning that iso-functionality chips are cheaper to build in bulk. This is especially true for older nodes, e.g. look at the absurdly low price of most microcontrollers.
On the other hand, $/wafer is mostly a volume-related metric based on less scalable technology and more conventional manufacturing (relatively speaking). Cerebra's innovation was in making a wafer-scale chip possible, which is conventionally hard due to unavoidable manufacturing defects. But crucially, such a product (by definition) cannot scale like any other circuit produced so far.
It may for sure drop in price in the future, especially once it gets obsolete. But I don't expect it to ever reach consumer level prices.
Wafer-scale chips have been attempted for many decades, but none of the previous attempts before Cerebras has resulted in a successful commercial product.
The main reason why Cerebras has succeeded and the previous attempts have failed is not technical, but the existence of market demand.
Before ML/AI training and inference, there has been no application where wafer-scale chips could provide enough additional performance to make their high cost worthwhile.
Cerebras has a patent on the technique used to etch across scribe lines. Is there any prior work that would invalidate that patent?
By the way, I am a software developer, so you will not see me challenging their patent. I am just curious.
It will also mean 405B models will be uninteresting in 3 to 5 years if we follow the curve we've been on for the past decades.
I don't think they'll be uninteresting. They won't be cutting-edge anymore, sure, but much of the more practical applications of AI that we see today don't run on today's cutting-edge models, either. We're always going to have a certain compute budget, and if a smaller model does the job fine, why wouldn't you use it, and use the rest for something else (or use all of it to run the smaller model faster).
Yeah you can see the cooling requirements by looking at their product images. https://cerebras.ai/wp-content/uploads/2021/04/Cerebras_Prod...
Thing is nearly all cooling. And look at the diameter on the water cooling pipes. Airflow guides on the fans are solid steel. Apparently the chip itself measures 21.5cm^2. Insane.
Parent wishes 70b not 405b though
Yeah but what is in a 4090 is also comparable to a whole rack of servers a decade ago. The tech will get smaller.
One day is doing some heavy heavy lifting here, we’re currently off by ~3-4 orders of magnitude…
Thank you, for the reality check! :)
We have moved 2 orders of magnitude in the last year. Not that unreasonable
So 1000-10000 days? ;)
In a few thousand days (c) St. Altman
lol I almost said that too
You still have to pay for the memory. The Cerebras chip is fast because they use 700x more SRAM than, say, A100 GPUs. Loading the whole model in SRAM every time you compute one token is the expensive bit.
Maybe not $500, but $500,000
Ah, makes a lot more sense now.
also the WSE3 pulls 15kw. https://www.eetimes.com/cerebras-third-gen-wafer-scale-chip-...
but 8x h100 are ~2.6-5.2kw (I get conflicting info, I think based on pice vs smx) so anywhere between roughly even and up to 2x efficient.
They are doing it with custom silicon with several times more area than 8x H100s. I’m sure they are doing some sort of optimization at execution/runtime, but the primary difference is the sheer transistor count.
https://cerebras.ai/product-chip/
To be specific, a single WSE-3 has the same die area as about 57 H100s. It's a big chip.
It is worth splitting out the stacked memory silicon layers on both too (if Cerebras is set up with external DRAM memory). HBM is over 10 layers now so the die area is a good bit more than the chip area, but different process nodes are involved.
Amazing!
They have a chip the size of a dinner plate. Take a look at the pictures: https://cerebras.ai/product-chip/
21 petabytes per second. Can push the whole internet over that chip xD
The number for that is I believe 1 terabit or 125GB/s -- 21 petabytes is the speed from the SRAM (~registers) to the cores (~ALU) for the whole chip. It's not especially impressive for SRAM speeds. The impressive thing is that they have an enormous amount of SRAM
That's their on chip cache bandwidth. Usually that stuff isn't even measured in bandwidth but latency.
I'd love to see the heatsink for this lol
They call it the “engine block”!
https://www.servethehome.com/a-cerebras-cs-2-engine-block-ba...
what kind of yield do they get on that size?
Part of their technology is managing/bypassing defects.
It's near 100%. Discussed here:
https://youtu.be/f4Dly8I8lMY?t=95
Cerebras is a chip company. They are not using GPUs. Their chip uses wafer scale integration which means it's the physical size of a whole wafer, dozens of GPUs in one.
They have limited memory on chip (all SRAM) and it's not clear how much HBM bandwidth they have per wafer. It's a completely different optimization problem than running on GPU clusters.
They do not use HBM. Offchip memory is accessible at 150GB/sec.
they have about 125GB/s of off-chip bandwidth
Do they just not do HBM at all or
I'm not too up to date but as I recall there are a lot of weirdnesses because of how big their chip is (e.g. thermal expansion being a problem). I believe they have a single giant line in the middle of the chip for this reason. maybe this makes HBM etc. hard? certainly their chip would be more appealing if they cut down the # of cores by 10x, added matrix units and added HBM but looks like they're not going to go this way.
There are two big tricks: their chips are enormous, and they use sram as their memory, which is vastly faster than hbm ram being used by GPUs. In fact this is main reason it’s so fast. Groq has the speed because of the same reason.
how much memory do you need to run fp8 llama 3 70b - can it potentially fit 1 H100 GPU with 96GB RAM?
In other words, if you wanted to run 8 separate 70b models on your cluster, each of which would fit into 1 GPU, how much larger your overall token output could be than parallelizing 1 model per 8 GPUs and having things slowed down a bit due to NVLink?
It’s been a minute so my memory might be off but I think when I ran 70b at fp16 it just barely fit on a 2x A100 80GB cluster but quickly OOMed as the context/kv cache grew.
So if I had to guess a 96GB H100 could probably run it at fp8 as long as you didn’t need a big context window. If you’re doing speculative decoding it probably won’t fit because you also need weights and kv cache for the draft model.
It should work, I believe. And anything that doesn't fit you can leave on your system RAM.
Looks like an H100 runs about $30K online for one. Are there any issues with just sticking one of these in a stock desktop PC and running llama.cpp?
> Are there any issues with just sticking one of these in a stock desktop PC and running llama.cpp?
Cooling might be a challenge. The H100 has a heatsink designed to make use of the case fans. So you need a fairly high airflow through a part which is itself passive.
On a server this isn't too big a problem, you have fans in one end and GPU's blocking the exit on the other end, but in a desktop you probably need to get creative with cardboard/3d printed shrouds to force enough air through it.
Imagine if you could take Llama 3.1 405B and break it down to a tree of logical gates, optimizing out all the things like multiplies by 0 in one of the bits, etc... then load it into a massive FPGA like chip that had no von Neumann bottleneck, was just pure compute without memory access latency with a conservative 1 Ghz clock rate.
Such a system would be limited by the latency across the reported 126 layers worth of math involved, before it could generate the next token, which might be as much as 100 uSec. So it would be 10x faster, but you could have thousands of other independent streams pipelined through in parallel because you'd get a token per clock cycle out the end.
In summary, 1 Gigatoken/second, divided into 100,000 separate users each getting 10k tokens/second.
This is the future I want to build.
See Convolutional Differentiable Logic Gate Networks https://arxiv.org/abs/2411.04732 , which is a small step in that direction.
I'm actively trying to learn how to do exactly this, though I'm just getting started with FPGAs now so probably a very long range goal.
There is not enough memory attached to FPGAs to do this. Some FOGAs come with 16GB of HBM attached, but that is not enough and the bandwidth provided is not as high as it is on GPUs. You would need to work out how to connect enough memory chips simultaneously to get high bandwidth and enough capacity in order for a FPGA solution to be performance competitive with a GPU solution.
Instead of separate memory/compute, I propose to fuse them.
Nah. Try vLLM and 405B FP8 on that hardware. And make sure you’re benchmarking with some concurrency for max TPS.
Related recent discussion on twitter: https://x.com/Teknium1/status/1858987850739728635
Looks like other folks get 80 tok/s with max batch size, that's surprising to me but vLLM is definitely more optimized than my implementation.
Check out BaseTen for performant use of GPUs
I'm not sure if they're comparing apples to apples on the latency here. There are roughly three parts to the latency: the throughput of the context/prompt, the time spent queueing for hardware access, and the other standard API overheads (network, etc).
From what I understand, several, maybe all, of the comparison services are not based on provisioned capacity, which means that the measurements include the queue time. For LLMs this can be significant. The Cerebras number on the other hand almost certainly doesn't have some unbounded amount of queue time included, as I expect they had guaranteed hardware access.
The throughput here is amazing, but to get that throughput at a good latency for end-users means over-provisioning, and it's unclear what queueing will do to this. Additionally, does that latency depend on the machine being ready with the model, or does that include loading the model if necessary? If using a fine-tuned model does this change the latency?
I'm sure it's a clear win for batch workloads where you can keep Cerebras machines running at 100% utilisation and get 1k tokens/s constantly.
Everyone presumes this is under ideal conditions...and it's incredible.
It's bs=1. At 1,000 t/s. Of a 405B parameter model. Wild.
They claim it is with between 8 and 20 users:
https://x.com/draecomino/status/1858998347090325846
That said, they appear to be giving the per user performance.
Cerebras' benchmark is most likely under ideal conditions, but I'm not sure it's possible to test public cloud APIs under ideal conditions as it's shared infrastructure so you just don't know if a request is "ideal". I think you can only test these things across significant numbers of requests, and that still assumes that shared resource usage doesn't change much.
I'm not talking about that. I and many others here have spun up 8x or more H100 clusters and run this exact model. Zero other traffic. You won't come anywhere close to this.
In that case I'm misunderstanding you. Are you saying that it's "BS" that they are reaching ~1k tokens/s? If so, you may be misunderstanding what a Cerebras machine is. Also 8xH100 is still ~half the price of a single Cerebras machine, and that's even accounting for H100s being massively over priced. You've got easily twice the value in a Cerebras machine, they have nearly 1m cores on a single die.
Ha ha. He probably means ”at a batch size of 1”, i.e. not even using some amortization tricks to get better numbers.
Ah! That does make more sense!
Right, I'd assume most LLM benchmarks are run on dedicated hardware.
What you can do with current-gen models, along with RAG, multi-agent & code interpreters, the wall is very much model latency, and not accuracy any more.
There are so many interactive experiences that could be made possible at this level of token throughput from 405B class models.
How can a rule book help fixing incidents. I mean I hope every incident is novel. Since you solve the root issue. So every time you need to dig in the code, or recently deployed code and correlate it with your production metrics.
Or is the rulebook a simple rollback?
Like what..
You can create massive variants of OpenAI's 01 model. The "Chain of Thought" tools become way more useful when you can get when you can iterate 100x faster. Right now, flagship LLMs stream responses back, and barely beat the speed a human can read, so adding CoT makes it really slow for human-in-the-loop experiences. You can really get a lot more interesting "thoughts" (or workflow steps, or whatever) when it can do more, without slowing down the human experience of using the tool.
You can also get a lot fancier with tool-usage when you can start getting an LLM to use and reply to tools at a speed closer to the speed of a normal network service.
I've never timed it, but I'm guessing current LLMs don't handle "live video" type applications well. Imagine an LLM you could actually video chat with - it'd be useful for walking someone through a procedure, or advanced automation of GUI applications, etc.
AND the holy-grail of AI applications that would combine all of this - Robotics. Today, Cerebras chips are probably too power hungry for battery powered robotic assistants, but one could imagine a Star-Wars style robot assistant many years from now. You can have a robot that can navigate some space (home setting, or work setting) and it can see its environment and behavior, processing the video in real-time. Then, can reason about the world and its given task, by explicitly thinking through steps, and critically self-challenging the steps.
> barely beat the speed a human can read
4o is way faster than a human can read.
Imagine increasing the quality and FPS of those AI-generated minecraft clones and experiencing even more high-quality, realtime AI-generated gameplay
(yeah, I know they are doing textual tokens. but just sayin..)
edit: context is https://oasisaiminecraft.com/
To be clear a cerebras chip is consuming a whole wafer and has only 44 GB of SRAM on it. To fit a 405B model in bf16 precision (excluding kv cache and activation memory usage) you need 19 of these “chips” (and the requirement will grow as the sequence length increases for the kvcache). Looking online it seems on one wafer one can fit between 60 to 80 H100 chips, so it’s equivalent to using >1500 H100 using wafer manufacturing cost as a metric
The budget these companies spend on this tech is seriously mind boggling to me.
Is wafer cost a major factor in the actual chip price?
This is seriously impressive performance. I think there's a high probability Nvidia attempts to acquire Cerebras.
They're considering an IPO. I'd say an acquisition is unlikely. Even then, they'd be worth more to Facebook or MS.
No, they would make a capital infusion on paper and then make Cerebras buy more hw from that money on paper, thus showing huge revenues on Nvidia books.
Makes sense… or whatever
They're making custom chips right? Why would Cerebras buy hardware from Nvidia?
They have a waitlist for trying their API. You have to be a but skeptical when a company makes claims but does not offer their services to buy.
No mention of their direct competitor Groq?
I'm a happily-paying customer of Groq but they aren't competitive against Cerebras in the 405b space (literally at all).
Groq has paying customers below the enterprise-level and actually serves all their models to everyone in a wide berth, unlike Cerebras who is very selective, so they have that going for them. But in terms of sheer speed and in the largest models, Groq doesn't really compare.
Is this because 405b doesn't fit on Groq? If they perform better, I would also have liked to have seen.
When 405b first launched Groq ran it, it's not currently running due to capacity issues though
Sambanova is not often mentioned either [0]. One of his co-founder is known as “father of the multi-core processor” [1].
[0]: https://sambanova.ai/
[1]: https://en.wikipedia.org/wiki/Kunle_Olukotun
The fact that such a boost is possible with new hardware, I wonder what the ceiling is for improving performance for training via hardware as well.
Not enormous without significant changes to the ML. There are two pieces to this: improving efficiency and improving flops.
Improving flops is the most obvious way to improve speed, but I think we're pretty close to physical limits for a given process node and datatype precision. It's hard to give proof positive of this, but there are a few lines of evidence. One is that the fundamental operation of LLMs, matrix multiplications, are really simple (unlike e.g. CPU work) and so all the e.g. control flow logic is pretty minimized. We're largely spending electricity on doing the matrix multiplications themselves, and the matrix multiplications are in fact electricity-bound[1]. There are gains to be made by changing precision, but this is difficult and we're close to tapped out on it in my opinion (already very low precisions (fp8 can't represent 17), new research showing limitations).
Efficiency in LLM training is measured with a very punishing standard, "Model Flops Utilization" (MFU), where we divide the theoretical number of flops the hardware could provide with the theoretical number of flops necessary to implement the mathematical operation. We're able to get 30% without thinking (just FSDP) and 50-60% are not implausible/unheard of. The inefficiency is largely because 1) the hardware can't provide the number of flops it says on the tin for various reasons and 2) we have to synchronize terabytes of data across tens of thousands of machines. The theoretical limit here is 2x but in practice there's not a ton to eke out here.
There will be gains but they will be mostly focused on reducing NVIDIA's margin (TPU), on improving process node, on reducing datatype (B100), or on enlarging the size of a chip to reduce costly cross-chip communication (B100). There's not room for a 10x (again at constant precision and process node).
[1]: https://www.thonking.ai/p/strangely-matrix-multiplications
The ultimate solution would be to convert an LLM to a pure ASIC.
My guess is that would 10X the performance. But then it's a very very expensive solution.
There's some interesting research on using stacked flat lenses to build analog, physical neural network inference that operate directly on light (each lens is a hidden layer). If we managed to make this work for non-trivial cases, it could be absurdly fast.
Why would converting a specific LLM to an ASIC help you? LLMs are like 99% matrix multiplications by work and we already have things that amount to ASICs for matrix multiplications (e.g. TPU) that aren't cheaper than e.g. H100
An ASIC could have all of the weights baked into the design, completely eliminating the von Neumann bottleneck that plagues computation.
They are inherently parallel, so you might be able to get a token per clock cycle. A billion tokens per second opens quite a few possibilities.
It could also eliminate all of the multiplication or addition of bits that are 0 from the design, making each multiply smaller by 50 percent silicon area, on average.
However, an ASIC is a speculation that all the design tools work. It may require multiple rounds to get it right.
I doubt you could have a token per clock cycle unless it is very low clocked. In practice, even dedicated hardware for matrix-matrix multiplication does not perform the multiplication in a single clock cycle. Presumably, the circuit paths would be so large that you would need to have a very slow clock to make that work, and there are many matrix multiplications done per token. Furthermore, things are layered and must run through each layer. Presumably if you implement this you would aim for 1 layer per clock cycle, but even that seems like it would be quite long as far as circuit paths go.
I have some local code running llama 3 8B and matrix multiplications in it are being done by 2D matrices with dimensions ranging from 1024 to 4096. Let’s just go with a nice 1024x1024 matrix and do matrix-vector multiplication, which is the minimum needed to implement llama3. That is 1048576 elements. If you try to do matrix-vector multiplication in 1 cycle, you will need 1048576 fmadd units.
I am by no means a chip designer, so I asked ChatGPT to estimate how many transistors are needed for a bf16 fmadd unit. It said 100,000 to 200,000. Let’s go with 100,000 transistors per unit. Thus to implement a single matrix multiplication according to your idea, we would need over 100 billion transistors, and this is only a small part of the llama 3 8b model’s calculations. You would probably be well into the trillions of transistors if you implemented all of it in an ASIC and did 1 layer per cycle (don’t even think of 1 token per cycle). For reference, Nvidia’s H100 has 80 billion transistors. The CSE-3 has 4 trillion transistors and I am not sure if even that would be enough.
It is a nice idea, but I do not think it is feasible with current technology. That said, I do like your out of box thinking. This might be a bit too far out of the box, but there is probably a middle ground somewhere.
You're right in the numbers, I wasn't thinking of trying to push all of that into one chip, but if you can distribute the problem such that an array of chips can break the problem apart cleanly, the numbers fall within the range of what's feasible with modern technology.
The key to this, in my view, is to give up on the idea of trying to get the latency as low as possible for a given piece of computation, as it typically done, and instead try to make reliable small cells that are clocked so that you don't have to worry about getting data far or fast. Taking this idea to its limits has a completely homogeneous systolic array that operates on 4 bits at a time, using look up tables to do everything. No dedicated switching fabric, multipliers, or anything else.
It's the same tradeoff von Neumann made with the ENIAC, which slowed it down by a factor of 6 (according to wikipedia), but eliminating multiple weeks of human labor in setup by instead loading stored programs effectively instantly.
To multiply numbers, you don't have to do all of it at the same time, you just have to pipeline the steps so that all of them are taking place for part of the data, and it all stays synced (which the clocking again helps)
Since I'm working alone, right now I'm just trying to get something that other people can grok, and play with.
Ideally, I'd have chips with multiple channels of LVDS interfaces running at 10 Gbps or more each to allow meshing the chips. Mostly, they'd be vast strings of D flip flops and 16:1 multiplexers.
I'm well aware of the fact that I've made arbitrary choices, and they might not be optimal for real world hardware. I do remain steadfast in my opinion that providing a better impedance match between the computing substrate and the code that runs on it could allow multiple orders improvement in efficiency. Not to mention the ability to run the exact same code on everything from an emulator to every successive version/size of the chip, without recompilation.
Not to mention being able to route around bad cells, actually build "walls" around code with sensitive info, etc.
I am wondering how much cost is needed for serving at such a latency. Of course for customers, static cost depends on the pricing strategy. But still, the cost really determines how widely this can be adopted. Is it only for those business that really need the latency, or this can be generally deployed.
Maybe it could become standard for everyone to make giant chips and use SRAM?
How many SRAM manufacturers are there? Or does it somehow need to be fully integrated into the chip?
SRAM is usually produced on the same wafer as the rest of the logic. SRAM on an external chip would lose many of the advantages without being significantly cheaper.
Yes, the limiting factor for bandwidth is generally the number of pins which are not cheap and you can only have few 1000s in a chip. The absolute state of the art is 36 Gb/s/pin[1], and your $30 RAM could have 6 Gb/s/pin[2].
[1]: https://en.wikipedia.org/wiki/GDDR7_SDRAM
[2]: https://en.wikipedia.org/wiki/DDR5_SDRAM
the cost is not the memory technology per se but primarily the wires. SRAM is fast because it's directly inside the chip and so the connections with the logic that does the work is cheap because it's close.
I'd like to see a tokens / second / watt comparison.
Their hardware is cool and bizarre. It has to be seen in person to be believed. It reminds me of the old days when supercomputers were weird.
Don't leave us hanging, show us a weird computer!
https://web.archive.org/web/20230812020202/https://www.youtu...
Pretty amazing speed, especially considering this is bf16. But how many racks is this using? The used 4 racks for 70B, so this, what, at least 24? A whole data center for one model?!
Each Cerebras wafer scale chip has 44GB of SRAM. You need 972 GB of memory to run Llama 405b at fp16. So you need 22 of these.
I assume they're using SRAM only to achieve this speed and not HBM.
Normally, I don't think 1000 tokens/s is that much more useful than 50 tokens/s.
However, given that CoT makes models a lot smarter, I think Cerebras chips will be in huge demand from now on. You can have a lot more CoT runs when the inference is 20x faster.
Also, I assume financial applications such as hedge funds would be buying these things in bulk now.
> Also, I assume financial applications such as hedge funds would be buying these things in bulk now.
Please elaborate.. why?
I'm assuming hedge funds are using LLMs to dissect information from company news, SEC reports as soon as possible then make a decision on trading. Having faster inference would be a huge advantage.
I'm so curious to see some multi-agent systems running with inference this fast.
There's no good open source agent models at the moment unfortunately.
So out of all AI chip startups, Cerebras is probably the real deal
Groq is legitimate. Cerebras so far doesn't scale (wide) nearly as good as Groq. We'll see how it goes.
How exactly does groq scale wide well? Last I heard it was 9 racks!! to run llama-2 70b
Which is why they throttle your requests
Well, Cerebras pretty much needs a data center to simply fit the 405B model for inference.
I guess this just shows the insanity of venture led AI hardware hype and shady startup messaging practices
Google TPUs, Amazon, a YC funded ASIC/FPGA company, a Chinese Co. all have custom hardware too that might scale well.
just in time for their ipo
It got cancelled/postponed.
This gets tons of press and discussion here on HN, but frankly AMD has a better overall product with the upcoming MI325x [0].
I love to see the development and activity, but companies like Cerebras are trying to compete on a single usecase and doing a poor job of it because they can only offer a tightly controlled API.
Ask yourself how much capex + power/space/cooling (opex) it requires to run that model (and how many people it can really serve) and then compare that against what AMD is offering.
[0] https://www.amd.com/en/products/accelerators/instinct/mi300/...
The same could be said for Nvidia’s H100 and other GPUs.
Genuinely curious and willing to learn: what are the different inference approaches broadly? Is there any difference in the approach between Cerebras and simplismart.ai which claims to be the fastest?
Cerebras features in the internal OpenAI emails that recently came out. One example:
Ilya Sutskever to Elon Musk, Sam Altman, (cc: Greg Brockman, Sam Teller, Shivon Zilis) - Sep 20, 2017 2:08 PM
> In the event we decide to buy Cerebras, my strong sense is that it'll be done through Tesla. But why do it this way if we could also do it from within OpenAI?
Not open beta until Q1 2025
Holy bananas, the title alone is almost its own language.
How does binning work when your chip is the entire wafer?
They expect that some of the cores on the wafer will fail, so they have redundant links all throughout the chip, so they can seal off/turn off any cores that fail and still have enough cores to do useful work.
My understanding is that they mask off or otherwise disable a whole row+column of cores when one dies
That's way too wasteful.
Take a look at https://fuse.wikichip.org/news/3010/a-look-at-cerebras-wafer... and specifically the diagram https://fuse.wikichip.org/wp-content/uploads/2019/11/hc31-ce...
The fabric can effectively route signals diagonally to work around an individual defective core, with a displacement of one position for cores in the same row from that defect over to the nearest spare core. That's how they get away with a claimed "1–1.5%" of spare cores.
nvidia hates this one little trick
I laughed and upvoted, but if anything I bet they put their best people on it to replicate this offering.
What I take away from this is: we are just getting started. I remember in 2023 begging OpenAI to give us more than 7 tokens/second on GPT-4.
Nvidia’s target is performance across concurrent users and they are likely already outperforming Cerebras there as far as costs are concerned. They have no reason to try to beat the single user performance of this.
Damn that's a big model and that's really fast inference.
I wonder if Cerebras could generate video decent quality in real time
Transistor(GPU)-> Integrated Circuit (WSE-3)
is it just me or isn't the most important contender in speed, Groq, missing from the comparison ? not sure why does it matter to put azure there, no one uses it for speed.