hckrnws
"Instead of building a custom command processor, Intel uses Movidius’s 32-bit LEON microcontrollers. LEON use the SPARC instruction set and run a real time operating system."
Your Intel core now runs two ISAs and three OSs. Managment engine runs Minix (and used to use the ARC isa).
More than two. It has x86_64 on the main CPU cores, x86 on the ME, SPARC on this NPU and whatever the iGPU ISA is called. And I wouldn't be surprised if there's an ARM microcontroller or three in there as well.
Didn't some keyboards and mice use embedded Z80 or 6502 CPU's?
Clones of the 8051 are pretty popular in that scenario, I know for sure at least one mouse manufacturer uses them and probably a ton more
One part of me is secretly glad SPARC lives on
It's always weird to see Intel using "foreign" ISAs, but not too surprising for something they acquired. The ME's use of SPARC and ARC before that is more unusual; they could've at least used a 386/486-class core, or even the i860/960 if they wanted something RISC.
Pretty sure ME is Intel Quark based now. Quark is intel's weird 32 bit x86 as a microcontroller chip.
A 32-bit Linux-capable microcontroller at that. I still keep my Galileo board. Will probably assemble it into a picture frame and run a heterogeneous cluster with it and some of the other boards I accumulated over the years.
This is Gibson, the 4 RPi Zero W's running Docker Swarm: https://twitter.com/0xDEADBEEFCAFE/status/172986300775544462...
> This is Gibson, the 4 RPi Zero W's running Docker Swarm: https://twitter.com/0xDEADBEEFCAFE/status/172986300775544462...
You did it, you hacked the Gibson!
Please tell me it has a slick 90s neon low poly 3D interface.
No. It's just an unassuming IKEA picture frame, but I can ssh into them.
Row row row your boat
Quark is basically a 486, with some Pentium instructions added.
It must be a bit depressing for Intel engineers that they are forced to keep themselves tied to the ever extending x86 ISA and can't explore all the other ISAs they use (or invented). The 860/960 generated a lot of buzz back then and there was some pretty capable hardware using it.
Fortunately, with open-source becoming dominant in the data center and the relative ease of bootstrapping a new ISA on top of it, we will see some extra diversity in this space, but I don't think Intel will abandon x86 anytime soon (and "soon" is in almost geological scales here).
Intel has been thinking in detail about a streamlined x86 architecture that would ditch all of the cruft: 64-bit only, no memory segmentation, no real mode.
https://www.intel.com/content/www/us/en/developer/articles/t...
Intel could just as well become a pure fab and give up on designing processors. Switching to another ISA might tie them up for years before they can deliver state-of-the-art processors again. Intel can't afford that after the recent year's delays with switching to new nodes.
this is very common. lots of companies around (e.g. cadence) that sell SIPs as synthesizable RTL for easy integration into chip designs (and lots of hardware companies buy these SIPs).
Indeed, silicon design is long in its NPM left-pad phase. Which begs the question why it is still so open-source averse and patent-pilled.
Comment was deleted :(
The article talks about ai development workloads and how there's little need for a low power low performance accelerator compared to a high end GPU but I think the point is end user workloads.
Software is coming with AI integrations. IDEs, spreadsheets, word processors, graphics programs. These absolutely will benefit from low power ai acceleration. Sure if i want best in class acceleration I'll buy the damn 300w GPU. But really I just want some autocompletion without burning my lap or requiring cloud connectivity.
I always feel that accelerators are in an awkward spot, because at some point with a sufficiently powerful CPU most workloads finish quickly enough.
And let's be real, even lowly Core i3s these days are overkill for most people; let alone mid tier offerings like Core i5s which are the most popular for commercial users.
And let's be real, even lowly Core i3s these days are overkill for most people
They would be, if it weren't for the egregious insanity of software bloat, particularly the web-based stuff.
It makes systems "feel" way more responsive. It's a huge quality of life increase. Apple Silicon's user/product story has basically changed how people view this stuff forever.
What does the Apple Silicon (NPU?) accelerator do? I mean, what is it used for?
- Face and object detection in photos and videos
- Computational photography features like Night Mode and Smart HDR
- Real-time video processing for effects and stabilization
- Object and action recognition in videos
- Faster on-device speech recognition and natural language understanding for Siri
- Real-time translation, object detection, and scene understanding in AR
Isn’t a big advantage of an accelerator is that its more optimized to do this sort of work than a cpu? This npu is a tiny dot on the Meteor Lake die while the cpus take up a whole tile and accounts for much of the power usage of the soc at any given time.
That’s the idea but it looks like this might not pull it off unless Intel aggressively gives away optimized libraries. If it’s too hard to get to the point where it’s actually competitive, most developers are going to use the faster iGPU.
You hit the nail on the head. Software, libraries, dev tools, profiling resources, mindshare... And for Intel, a good deal of maintained interest. For those of us who used and programmed SHAVE cores, to see everything slowly close down behind openvino, no apparent interest from Intel in years, the writing was on the wall. Repeatedly, they taught us: use nothing from Intel except main-mainline x86_64.
What’s sad is that I remember the same lesson from Itanium: to preserve a small revenue stream from their compiler and optimized library business, they made almost every benchmark even worse at a time when they most needed businesses to be comfortable switching.
I don't think the NPU is tiny, just much more power efficient than the GPU or CPU for the right kind of tasks.
4 TFLOPs is significantly faster than your CPU. Even 1.35 TFLOPS represents a reasonable edge. The question should be why Intel added another processor instead of extending AVX.
I wouldn't call FP16 "FLOPS". Is anyone using them for non-AI workloads (e.g. compute)?
It's better than the TOPS nonsense where accuracy is completely thrown out the window. The quantization schemes used by llama.cpp do not operate directly on integers. They dequantize back to floats at runtime.
You have to forgive me, but these NPUs were specifically designed for inference. You can indeed use them as general purpose accelerators, but I think you will struggle to find an application for them.
Yeah I wish we'd start to say what kind of FLOPS one's talking about. And don't start me on 'AI FLOPS' or TOPS...
>The question should be why Intel added another processor instead of extending AVX.
It seems that efficient cores don't support much of AVX extensions. You have to turn those cores off if you want to use AVX512, for example.
It worked only on first batches/microcode versions of hybrid CPUs, since then Intel had completely disabled AVX-512 on them.
Oh that's easy.
The registers in this CPU are all quantumly entwined, with Intel having updates signed.
Thus, Intel can one-way update one or all CPUs remotely. No Faraday cage will protect you.
Further, this CPU has access to caches on all other cores, so Intel can dynamically inhibit, or modify code as they please.
We'll find this out in 2027, I dare not say more, they'll track me through time and afg#.<NO CARRIER>
So how do you program these things? I tried looking into the AMD equivalent and the development story was very primitive. Is the future going to be programming against a different opaque binary blob for each NPU one wants to support?
it is prob meant to be used through OpenVINO which is open-source, so I don't expect an obscure blob.
Looks like they repurposed Movidius SHAVE architecture and added a systolic MXU. Makes sense, if they can sort out their software story
That is the big "if", to be sure. I spent three years working for Intel after they acquired Vertex.AI, trying to help build that software story, and it was rough going; we didn't accomplish much, and it doesn't sound like they're using any of it.
I was impressed by their _CPU_ kernels in OpenVINO which I studied very closely after discovering I cannot beat their performance even on kernels specialized for my specific tensor shapes. Turns out the reason I can't beat them is because they jitted their kernels with xbyak, and specialized them for those shapes, too. They also obviously knew exactly how to maximize the ALU utilization, and counted every cycle, something that's very hard to do unless you specialize _just_ on CPU performance and have access to the internal documentation. Top notch work. But that team used to be in Russia back then, in Nizhny Novgorod, IDK if it's still a part of Intel.
Yes the _CPU_ perf of OpenVINO is a thing of beauty. Debuggability is a nightmare story but if it works you're GOLD. The GPU/Myriad story was very clunky last times I tried.
I hope some Intel PHB reads this and purely stochastically makes the right decision for the first time in his/her life. It seems pretty existential to be competitive at selling shovels during a gold rush.
Looking at OpenVINO repo activity, seems that most of them got relocated and still work at Intel.
There's a Matt Pharr-level story to tell here if you feel like it one day. I'm also looking at what the very cool maxas authors became. Intel seems like a dark hole of blackness and holeness for acquired tech these days.
1024 fp16 macs is pretty good but the 128b vector datapath is weak sauce. on the other hand 2MB SRAM is legit. i wonder how many tiles (i don't think it's in the post).
I think the main advantage of Movidius was its memory architecture. They had to disclose the HW arch for some sort of a tender in the EU which is how I stumbled upon their HW documentation years ago. Basically IIRC even though you’re right and 128 bit is weak sauce, the strength of that arch was that the memory was much “closer” to the cores and it was partitioned and accessed by multiple cores at the same time, boosting overall available bandwidth. The weakness of it was that it was (IMO) overly complicated for no good reason, and imposed a rather inflexible programmability model which if their software layer didn’t do certain things, you couldn’t do them at all. Which was a problem before transformers, because models tended to use a greater variety of ops
> Even economy class plane seats have power outlets these days. Hopefully Intel will iterate on both hardware and software to expand NPU use cases going forward.
I can totally see myself doing inference on my phone, but arguably - this is not intel's territory.
I am sure Intel will improve the shortcomings in the next generation, so it is a good start. But I wonder how this will compare with Qualcomm's new PC CPU (Snapdragon X Elite) which also has its own NPU.
They should all be equivalent by the end of the year, Qualcomm is launching this fall, Intel has plans, but we know about how well they deliver.
Not the best time for testing OpenVINO at least, they plan to release NPU support in 2024.1 as I understand, current version 2024.0 has very limited NPU support if any.
Looks like a great competitor to apple neural engine!
I just bought a new computer for the love of lord...
It's okay, his takeaway at the bottom is any PC with a midrange GPU is already an "AI PC".
> Even economy class plane seats have power outlets these days.
“Do you guys not have power outlets? Yeah, everybody’s got a power outlet.”
What a time to be alive.
Crafted by Rajat
Source Code