hckrnws
Llama.cpp can do 40 tok/s on M2 Max, 0% CPU usage, using all 38 GPU cores
by samwillis
In only a few years we are going to have close to this level of hardware on all laptops/desktops and maybe even mobiles. Manufactures such as Apple will be driven by consumer desire for local LLMs, that are privet and fine tuned with personal data, to move in that direction.
Apple are particularly well placed to do this with the unified memory architecture of the M series processors.
Local LLMs are the future, not in the cloud, I really wouldn't be supprised if Apple drop something along these lines on Monday. I will be looking out for a "oh, just one more thing" after the headset demos.
Their hardware development cycles are anything between 2-7 years, and even if the M3 was literally twice the performance of the M2, they would be unlikely to boost baseline RAM and storage on _any_ device to do that kind of thing this year.
Over time, sure. Maybe even on the top-tier Macs as a demo. But I wouldn’t expect anything better than LLaMa.
I would, however, expect them to come to their senses regarding how inadequate Siri is.
I fully expect the Apple Silicon team to throw out the roadmap 2 years into the future and start over because of AI breakthroughs this year. Or concurrently, build a new team to develop the SoC of the future for AI.
I fully expect the Neural Engine inside Apple SoCs to take up 80% of the transistors in the future, up from ~10% today.
Apple is kinda hamstrung with the M1. Their base model is 8gb, only ~5 of which is addressable by the system if you have nothing else open. They need to bump up the minimum spec if they want local LLM support to feel seamless, especially on iPhone where the memory constraints are even more dire.
Honestly, Apple doesn't strike me as any better or worse off than their competitors. Maybe AMD, given how late to the party they are. Intel has AVX fallback acceleration through Haswell though, and libraries like OpenVINO offer extra acceleration paths for modern Intel CPUs. ARM has ARMnn for any multicore ARMv8 system, and of course Nvidia has CUDA support for multiple arches and OS. Projects like the ONNX Runtime are unifying all of these inferencing interfaces, giving developers multiple deployment options without resorting to walled garden implementations. WebGPU may obsolete them altogether.
Local LLMs are awesome, but Apple has some work to do if they want to lead the pack.
The Mac Studio can support 128 GB of VRAM and the MacBook Pro supports 96 GB. (This may change tomorrow.) Technically Intel and AMD also have unified memory but I wonder if it actually works.
AMD also has AVX support on all recent CPUs.
Unless you have a CPU with vCache, AMD doesn't have 800GB/s of memory bandwidth.
Dunno, the vast majority of phones, tablets, laptops, and desktops have had 64 or 128 bit wide memory interfaces for decades. Apple has a high profit market and a vertical market, unlike the competition. Apple skipped GPUs and did it their own, and the GPU is the primary justification for the 100, 200, 400, or 800GB/sec available with today's apple silicon.
Don't think Intel, AMD, or phone/laptop/desktop ARMs (from anyone but apple) have any near term plans for great memory bandwidth.
Sad. Obviously AMD can do it, they ship a much nicer memory interface on the PS5 and XboxX, but don't ship anything over 128 bit until you move up to $$$$ workstations with the threadripper (256 bit), threadripper pro (512 bit) or Epyc/Genoa (768 bit) wide memory. Even the Epyc/Genoa socket @ 460GB/sec doesn't match the M2 ultra (800GB/sec), despite the bare CPU being more expensive than most of the Mac studios sold.
"Unified memory" itself is not special. Many products with AMD/Intel/Qualcomm chip implements LPDDR5 and use it for both CPU and GPU. What's truly different is that only Apple can sell higher performance iGPU (with higher bandwidth memory bits). I wish it changes by this AI boom
I'm sure the comment was in regards to the amount available for training/inference. The fact you can buy an M2 Max w/ 96 GB available to the GPU is remarkable, when the far more powerful 4090 desktop card for instance only has 24 GB.
Ryzen 7940 supports up to 256Gb in a laptop form factor, with AMD's AI cores adopted from Xilinx (which seem to have some quite interesting features) as well as M2 beating CPU and GPU performance. I'm sure the race will continue, with wins for both sides and all of us. Unified memory is certainly a much less expensive way to enable large memories for GPUs and accelerators. It seems that all parties are introducing AI acceleration of some kind into CPU silicon within the next generation, which is exciting for the prospect of needing nothing more than a recent CPU to run inference on a large model.
>Ryzen 7940 supports up to 256Gb in a laptop form factor
This is CPU-only RAM. It's way too low bandwidth to be used in high performance AI inference. The point with unified memory is that it's high bandwidth and available to the CPU, NPU, and GPU at the same time.
It should beat the base M2. It's the highest-end AMD laptop chip compared to Apple's lowest end. However, I have doubts that the Ryzen 7940 has a faster GPU than the base M2.
And of course, the M2 is significantly better in perf/watt.
I'm not sure what you mean about CPU-only RAM, the Ryzen 7940 is an APU which integrates GPU and CPU and both draw from the same pool of ram, just like Apple silicon.
Benchmarks seem to put the 7940 ahead of even the M2 Pro: https://www.cpubenchmark.net/compare/5454vs5189/AMD-Ryzen-9-...
Performance per watt seems remarkably similar.
>I'm not sure what you mean about CPU-only RAM, the Ryzen 7940 is an APU which integrates GPU and CPU and both draw from the same pool of ram, just like Apple silicon.
Unified memory that is also high bandwidth.
>Benchmarks seem to put the 7940 ahead of even the M2 Pro:
Use Geekbench 6. It's closest to SPEC and optimizes well for both x86 and ARM.
Performance per watt is significantly in favor of the M2.
> Unified memory that is also high bandwidth.
I do think Apple is smart in using memory the way they are. It's the future. But AMD routinely posts higher performance numbers for CPU and GPU because their large caches boast similarly high memory bandwidth figures and AMD's cores have significantly higher IPC. I also think Apple's trick of extending the ARM ISA to allow for x86 style memory alignment is slick for fast emulation. Credit where it's due. Apple's caches are also very fast, just not as large as AMD's. In short, memory bandwidth is just one measurement of the bandwidth along the entire disk bandwidth <-> disk cache bandwidth <-> PCIe bandwidth <-> main memory bandwidth <-> L4 / L3 / L2 / L1 cache bandwidth hierarchy, and details about cache coherency, associativity, eviction strategies, branch prediction, and other sundries matter along the way.
My 8 core Ryzen 5800x3D certainly compiles code a LOT faster than my 8 core M1 Mac despite the difference in main memory bandwidth.
More memory is almost always going to mean that memory is slower, and Ryzen can handle more memory. Ryzen's architected with large caches and prefetching and all kinds of other goodies with that in mind.
> Use Geekbench 6. It's closest to SPEC and optimizes well for both x86 and ARM.
Geekbench is a synthetic benchmark. I'd prefer real workloads like compiling code or tokens / s inference.
>and AMD's cores have significantly higher IPC.
This is most certainly wrong. Apple Silicon has way higher IPC. AMD chips clock much higher, which uses much more energy. Hence, Apple Silicon is much more efficient.
>My 8 core Ryzen 5800x3D certainly compiles code a LOT faster than my 8 core M1 Mac despite the difference in main memory bandwidth.
Code compilation is not bound by memory bandwidth. But AI training and inference is. Hence, we're talking about it.
>Geekbench is a synthetic benchmark. I'd prefer real workloads like compiling code or tokens / s inference.
You send me a link to a synthetic benchmark. All I did was link you to a better and more accurate one.
Also, Geekbench performs real world workloads. You can easily look at individual scores for each test - besides the main score.
I looked around for AI specific benchmarks. Seems Openbenchmarking.org has quite a few. The 7940 is just making it's way onto the market, so I chose the closest desktop equivalent with a similar number of cores and memory bandwidth figures. Notably, the 7700 _does not_ have the AI acceleration cores of the 7940, meaning the 7940 will post much higher performance in that area.
https://openbenchmarking.org/vs/Processor/Apple+M2,AMD+Ryzen...
Sure seems like the M2 is very far behind the 7700 in performance in all areas. Farther than can be explained by the difference in Mhz or package power.
> In only a few years we are going to have close to this level of hardware on all laptops/desktops and maybe even mobiles
I'm surprised they have been holding back on improving their GPUs if this is true. Was there any technical achievement in the recent years to enable this leap?
> I really wouldn't be supprised if Apple drop something along these lines on Monday
Apple is really slow on catching up on things, while they are good in introducing "new" things.
I don't see a ton of demand for LLMs (especially offline/local ones) unless they fact check themselves. I don't see how there's a need for nonsense generators if you can't trust the output. I don't think Siri would've gotten popular if 50% of the time you asked her a question, she made up a fake/wrong response.
> I don't see how there's a need for nonsense generators if you can't trust the output.
There's lots of use cases where you know the answer and can verify it yourself. Knowledge lookup is not the only use. For example "rephrase this text", or "give me ideas for X" just save you time and if the answer is wrong you can try again.
Fact checking is important but a lot of use cases don't need fact checking; imagine LLM as an assistant to write fiction or poetry, or imagine even some kind of a rubber-duck debugging partner or brainstorming partner where the user supplies all pertinent facts.
I don't see LLMs replacing traditional Search, but they still have plenty of other use cases.
> or brainstorming partner where the user supplies all pertinent facts.
and then it recommends a 3rd party library to solve your problem that doesn't exist
Even writing fiction need some grip on reality. Supplying all pertinent facts is usually too tedious for users.
Comment was deleted :(
> driven by consumer desire for local LLMs
Citation needed. I don't think general public has any desire/care for that. Like they don't care about a locally run Siri.
I would love a local Google Home. The response time is abysmal and the results are often poor. I should at least be able to control my home lights without internet.
Individuals may not, but businesses absolutely care that their data isn't accessible to third parties
Really? Most businesses don’t seem to have any trouble uploading data to AWS.
There's a massive difference between companies providing you services where they interact with your data only and AWS which provides you infrastructure and doesn't care much what you do with it. There's not a single case I'm aware of where someone complained about Amazon accessing their private data in AWS and using it in any way.
Walmart refuses to use AWS, for what it is worth. And no, I don't think there is a functional difference between what AWS does and what is described above.
That makes sense for Walmart. They don't want to sponsor the competition and the risk of leaking secrets may be unacceptable, even if you trust AWS operations to behave properly. I think Walmart is a complete outlier in this topic and they wouldn't use AWS even if homomorphic encryption was realistically usable for all their software.
On the other hand, Netflix did (does?) use AWS.
I think local LLMs, if available and equivalent to GPT4, it will most certainly be preferred over ChatGPT.
> by consumer desire for local LLMs
Consumers don't care where the LLMs are run, but if companies can off-board the computing costs onto the consumers, then they will save millions per month, but the other big limiter for local llms is storage and network costs.
> Local LLMs are the future
I think the issue with local-llms will be how can businesses talk to them? How do I use my local LLM with HN, Twitter, Shopify. Can my local LLM help me request a refund from a Shopify order or would Shopify have to fine-tune their own model? Then what? I download a 300GB file just so I can refund an order?
Consumers care about where the LLM is run because of the UX: People hate how current assistants randomly break because you have a spotty signal, or take forever to respond because it's making a round trip across the internet to actually reply, etc.
They don't need to know how you solved that, but local LLMs are the way to solve it.
> Can my local LLM help me request a refund from a Shopify order or would Shopify have to fine-tune their own model? Then what? I download a 300GB file just so I can refund an order?
LLMs are like an implementation detail: Shopify wouldn't have a single "Shopify" model, but they might have models for specific things they need to do to generate value for themselves. Likewise you might have some local model that acts specifically like a personal assistant and is able to initiate a refund for you.
OpenAI's plugin spec already works for advertising capabilities to LLMs you don't own: https://server.shop.app/.well-known/ai-plugin.json
> Consumers don't care where the LLMs are run, but if companies can off-board the computing costs onto the consumers, then they will save millions per month
The incentive may actually be the other way around. If it has to be run in the cloud, it has to be a service. And selling a service is more attractive to companies (even Apple nowadays) than selling a piece of hardware.
> Can my local LLM help me request a refund from a Shopify order or would Shopify have to fine-tune their own model? Then what? I download a 300GB file just so I can refund an order?
Clearly this isn’t an XOR proposition. Use web query to get relevant data (ie to generate dynamic prompt) then use local (ie private data) to calculate. Repeat ad nauseam.
Example: please get me the same vegetables I ordered last week from Whole Foods.
"My god! It's full of API Keys."
Anyone have comparisons to how many tokens/s a laptop with a mobile Nvidia gpu gets? Considering the cheapest M2 Max laptop in the US costs almost $3k, you can easily get a laptop with a 3090/4090 Nvidia mobile GPU.
Yes I would like another laptop to compare this against as by itself it is a bit meaningless to me. Also total power draw would be interesting to know as well.
However I suspect the unified memory architecture of the M2 is a big benefit here and this M2 Max system is either a 64 or 96GB model?
Even the very highest end Nvidia dGPU powered laptops won't have anywhere near that amount of GPU memory available and I don't know what the performance impact is with using system memory vs GPU memory on such systems?
I like to use the OpenCL benchmarks as a rough point of comparison: https://browser.geekbench.com/opencl-benchmarks
M2 Max lands just around the 3060 mobile performance profile in this instance, it would be curious to see how the tokens/s reflect that.
It's a little bit more nuanced than that. First, OpenCL is basically deprecated at this point and only Metal is up to date.
Second, M2 Max might have a lot more RAM available than a 4090 because of its unified memory architecture. It can have up to 96GB vs 24GB for the 4090. So we need to look at different RAM configurations.
> OpenCL is basically deprecated at this point
By Apple.
> unified memory architecture
Not really a bottleneck when layering over PCI exists. You're only constrained by PCI bandwidth and memory speeds, neither of which are really slow enough to meaningfully impact AI inferencing performance. Honestly, disk speeds are the #1 AI bottleneck I've seen on older systems.
> It can have up to 96GB vs 24GB for the 4090
Good, at the M2 Max's price point I could almost afford 4x 3090s anyways. I'd only need one to beat it in inferencing performance though.
I find Apple's performance with LLM's pretty exciting. At the very least, it promises that we don't be stuck on the vram pricing and capacities offered by AMD and Nvidia to run local LLMs. We should see some big improvements in the near future as lpddr6 is around the corner and Apple has a big focus on GPU.
For those of us with Twitter.com blocked in local dns, what size model is it getting 40tok/s on?
That's a better link than the twitter posts.
The tweet is literally just this.
From the thread:
> Watching llama.cpp do 40 tok/s inference of the 7B model on my M2 Max, with 0% CPU usage, and using all 38 GPU cores.
> Getting 24 tok/s with the 13B model
> And 5 tok/s with 65B
> And 5 tok/s with 65B
Which is still incredible. I recently tested llama.cpp on a dual xeon server with 24 cores and 70gb RAM and on the 65B model it took like 15 seconds for one token.
GPU seems the way to go here
Exactly, 5 tok/s is just fast enough for real-time speech.
Gerganov is probably going to find himself hired away by Apple Inc. at this point.
Hopefully he can go somewhere where he can keep doing good public work, and not have to disappear into the void.
Either that, or he will hire Apple at some point :)
It doesn't say how much RAM the M2 Max has? How much does RAM matter? As we all know, M2 Max can have up to 96GB of unified memory, which is 4x more than an Nvidia 4090.
Comment was deleted :(
Imo, the future is millions of micro-GPUs on a chip, with simplified instruction set. Such mGPU clusters will be called neuromorphic chips.
Btw, it's time to measure performance of LLMs in watts.
Sounds like what George Hotz is trying to do with tiny corp
No longer trying, I suppose.
Crafted by Rajat
Source Code