hckrnws
I've been working with people at GCP for months to get the right provisioning for TPUs for my company that spends many millions per year on GPU compute.
They take weeks to respond to anything, they change their minds constantly, you can never trust anything anyone says, their internal communication is a complete disaster, and someone recently told me they outsource a lot of their GCP personnel.
We went back to AWS and had a whole fleet of GPUs up and running within the week.
This is my 3rd extremely bad experience on Google Cloud. My last unicorn startup had several GCP-caused P0 production issues. They would update something internally with no announcement to customers and our production workloads would completely break out of the blue. It would usually take them days to weeks to fix it even with us spending tens of millions with them and calling support constantly. Everyone at our company was baffled at how bad the experience was compared to every other cloud provider.
I would not put anything serious there and I would never partner with GCP again.
You must not have been spending million per year. A friend's company spends 10 million/year with GCP, which isn't huge, and can have an engineer from any group in a meeting the next day after a high priority issue.
How frequently are you engaging your account reps? You should be able to get the ear of a PM within 48 hours in most cases.
I used to work on cloud infra at an org that spent many millions on GCP (I'm certain ~everyone reading this comment would recognize it by name) and was working on adding more zeros to that number. We spent years working on this and still ended up switching to AWS in the end, largely because of support but also GCP's anti-customer business practices in general. We had weekly meetings with our account reps in person at our office but even they weren't able to get things done internally. No fault of their own though!
We were trying to move to their platform because we needed high-scale TPU or GPU compute. We were not existing customers. We currently spend all those millions elsewhere (AWS, LambdaLabs, Coreweave, and many more) and will continue to do so.
At my previous company, we even had an announced partnership with Google and did lots of co-marketing. That doesn't mean much when a GCP engineer changes something in GCP and breaks your production.
I wish it worked out. I liked working in GCP outside of these problems. I really like GKE.
FWIW, my experience in AWS, as far as "engineer changes something and breaks your production," is not without its own set of issues. Perhaps the AWS version of this story hasn't impacted you, but AWS also constantly tweaks its services, sometimes to the detriment of its users.
I totally believe you. It could absolutely be bad luck / good luck.
I think if you search across the internet you find a lot more GCP horror stories than AWS ones. Considering the market ownership of the two platforms, I'd expect the opposite to be true.
References please.
You want me to provide references for my personal experience?
Not necessarily. I would like examples of where "...AWS also constantly tweaks its services, sometimes to the detriment of its users..." so I can be better informed, and make better decisions within my projects.
And you thought posting, “references please” would be the best way to get that info?
Also those are examples, or details, even. Not references.
Are you going to mention, describe or post some examples....Or...keep focusing on the way I asked for it?
No, because you've been a jerk during this conversation.
I've seen issues with RDS after an update. We were seeing CPU max out; adding more readers didn't help, and in the end, we had to rebuild the entire cluster. This somehow fixed the issue, and CPU use returned to normal levels; we didn't make any code changes to the app.
What company is this? Curious if you'd be open to talking about how you manage compute across providers.
I'm not GP but have been in this boat. We tried a number of different approaches, and kubernetes was by far the most successful. Terraform to provision the k8s clusters, workloads deployed to k8s. With OpenShift it gets even better, though I left the project before we finished implementation so I can't say how it went in prod. Early tests were very good though. If you're feeling bold you can do a "stretch" cluster which has nodes in different data centers (some on-prem for example, some in the cloud, or all in the cloud but different zones). The latency between the masters and nodes can cause problems though so I wouldn't separate them geographically very far.
Yeah, I spend a small fraction of that and have found GCP support through our account rep to be extremely good, at least on par with what AWS provides. Maybe the particular rep makes a big difference?
The parent said "my company that spends many millions per year".
They said that they spend millions per year on GPU compute, not that they spend millions per year with GCP. They later said that they did not spend that money on GCP.
That makes sense because GCP doesn't have all that much capacity for scale-out GPU compute (they offer TPUs for that).
We were trying to get TPUs. We work in audio AI (ASR, TTS, translation, etc). When we saw whisper-jax, we wanted to test the viability of the TPU platform. From our napkin math, the cost/performance ratio seemed great as they always mention on these TPU blog posts.
I posted this in another comment, but the default provisioning for V4 and V5 TPUs is ZERO. They don't tell you this anywhere. So when we'd try to allocate V5 TPUs on our GCP account, it would just fail with a generic error and a huge error number that led to nothing in a search.
So I reach out to our GCP rep we had been working with. After about 10 days and 3 follow-up emails/calls that went unanswered, she replies, "you have to fill out this form." I click on it and it's a Google Form. The same type of Google Form you and I can make.
I submit it. To this day I have heard nothing back. I reach out to various executives at GCP we had been talking with. They said, "you have to fill out a form." I tell them we did, and they say, "oh, it's usually pretty fast." I heard this so many times from so many people. It seems that not a single person who actually works in sales or account management knows how this process works.
When I finally got a response, they told me that "my account has no billing account associated with it." I showed them yes we do, they replied that they were trying to provision for the wrong account. It took another couple weeks to get a follow-up response.
Luckily they eventually connected us with one of their partnered consultants who was finally able to help us, but by then we just decided to go back to GPUs on other platforms because it was such a miserable experience and in that time, all of our providers came through with the volume we needed.
I understand that and am calling that claim into question.
and the call out turned out to be right. they clarified that they don't spend those millions with Google. So GP was being honest but the skepticism led to an important clarification, because I read it the same way as parent
I see, my bad!
If you’re spending millions with GCP you obviously have premium support and quite likely a dedicated technical account manager.
Whilst free GCP support is utter rubbish you will get several Googglers including direct line of comms to product management if you’re spending that much.
I don’t believe your claims
I should also add that the default provisioning for newer generation TPUs is ZERO, despite it not saying this anywhere on the sales pages. We had to dig deep to find this out.
How do you get access? Not by talking to anyone at Google, but by filling out a Google Form. Yes, a Google Form.
I spent many hours talking to everyone from support to executives at GCP. This is not an exaggeration. They all told me to fill out "the form." They said it "usually goes pretty fast." They don't even know the people or the department who approves it.
It took us months to get provisioned. By that time my company completely lost interest as we cannot move that slow in the most competitive space in technological history.
Also one last thing, we never even got notified once we were provisioned! There's no response to your form, you just have to keep checking.
Just absolutely wild amateur stuff.
Lots of things start out as a form with a human on each end. Before long, one or other side of the form automates stuff using scripts.
Before long, the form is the 'API', but both sides are using automation and the Google Forms backend is starting to rate limit responses and someone is forced to develop a proper API, try to track down all the form users, and try to persuade them all to migrate to the new API.
Busywork.
My current company is not spending millions with GCP, but we were trying to. We spend a lot on GPUs across AWS, Lambda Labs, and colo data centers.
At my previous company we had dedicated account managers of course. You are completely correct that we had line of communication to engineers working on the problem. That just doesn't mean very much when they don't know how to fix the problem quickly.
At that company we were multi-cloud (AWS, GCP, Azure, Rackspace, Hetzner) and had even more workloads on AWS. Never had a problem as bad as the problems we had on GCP.
We've spent much less than this and our support has been excellent.
I am surprised that they could have you up and running so fast.
Giventhe worldwide shortage of AI compute, and extreme demand, How is Amazon able to onboard new clients and have fleets of GPUs ready to go?
Does Azure/AWS/Google have a huge collection of AI compute that is just waiting?
Are these machines just the ones running the custom Amazon or Google accelerator chips?
There's a surprising amount of A100s available right now, everyone wants H100s very badly. Those are still really hard to find.
We also need GPUs for two things: training and inference. Our training needs a bunch of A100/H100 or large TPU nodes. But our final models are all pretty small (we're in ASR, TTS, translation, etc) relative to most commercial LLMs.
So we run a large amount of inference on GPUs like T4s, L4s, etc that are much easier to obtain.
Our intent was to try to find a unified, scalable platform that we could train/infer on that didn't have the cold boot problem AWS GPU instances have when our inference needs go beyond our dedicated hardware. With AWS the P99 cold boot times were over 5 minutes which means we'd have to queue inference and we'd end up paying for 5 extra minutes of GPU time. This isn't that big of a deal, but when you're spinning thousands of GPUs up and down it adds up to a ton of money every month.
Since our models require a lot less kernels than a lot of modern LLMs, TPUs seemed like a really good fit. But after dealing with GCP the last few months we abandoned our plans to port our model to JAX and decided to stick with GPUs for now.
Same poor experience in two independent contracts that were as large or larger. Google culture is just flatly against customer service. When it comes to Google Cloud, sorry you’re just too dumb to understand the beauty of our offerings. And then they admit it will take 6 months to get their internal k8s cluster upgraded.
I was much more impressed with Google before I started working with them when I was at my last company. It was a large software/service agreement with a team "dedicated" to us, and ... yeah, we ended up writing it and bringing it in-house, the response time was so bad.
GCP's support in a nutshell
Fuck yooooouuuuuuuu. Fuck you, fuck you, Fuck You. Drop whatever you are doing because it’s not important. What is important is OUR time. It’s costing us time and money to support our shit, and we’re tired of it, so we’re not going to support it anymore. So drop your fucking plans and go start digging through our shitty documentation, begging for scraps on forums, and oh by the way, our new shit is COMPLETELY different from the old shit, because well, we fucked that design up pretty bad, heh, but hey, that’s YOUR problem, not our problem.
We remain committed as always to ensuring everything you write will be unusable within 1 year.
Please go fuck yourself,
Google Cloud Platform
Source: https://steve-yegge.medium.com/dear-google-cloud-your-deprec...
Recently I've been using GCP to train a model, some notes:
* Like @leetharris, credits were pulled/not distributed, what was promised had to be cajoled out with most of it being sent to some weird SaaS product that we'll never use
* The GCP rep literally ghosted us halfway through the month where it when we had some expiring credits and were in the middle of training
* Not that the credits mattered, our quota requests for lifting GPU or TPU was rejected twice. It was impossible to get any GPUs that were within our credits, even writing a script to try to look for machines for weeks didn't work.
* Right after the credits expired, suddenly our last quota request, which was hanging around for weeks was approved. I assume they have an internal system setup to do that, but like we literally couldn't pay for GCP if we wanted to.
* Also, GCP rates are like 2-4X the market rate. Like you can get an H100-80 from Runpod (and actually get one) for what GCP charges for an A100-40.
Basically, the lesson learned was that no one should ever depend on GCP unless your time is worthless and you're not serious about getting any work done. They can go suck eggs.
This is the bigger headline than their Gemini release. AI is all about how much compute dollars it can generate for the cloud providers. Google is trying to make sure Microsoft doesn't monopolize AI compute.
Given the TPUv5 improves perf/$$$, it would seem to be at odds with your comment. I can now get more done with the same spend.
Kelsey Hightower told me at a GopherCon (many years ago) that Google doesn't run any internal workloads on third-party GPUs mainly because it costs significantly more (b/c cooling iirc), though they are happy to help you run your workloads on such GPUs.
Google runs internal workloads on third-party GPUs (especially as TPUs weren't very good at sparse for a long time). Hightower was simply wrong.
Hightower’s biggest skill is sounding very knowledgeable and as if he has the keys to a gate of useful secrets.
Kelsey is kind, thoughtful, and never puts another person down, I would say empathy is his best skill.
He also shares what he has learned for free, rather than putting paywalls in front of his content, which is quite rare these days.
Which (potentially) draws compute spend from Microsoft Azure to Google Cloud.
I should have quoted the generalized middle statement I was responding to
> AI is all about how much compute dollars it can generate for the cloud providers.
If the providers wanted to extract more money they would not create custom hardware which reduces overall costs and prices to users.
I would argue that this is actually more about ensuring NVidia doesn't have a monopoly on hardware and alleviates us from having to pay for Nvidia profits through our cloud providers.
Azure is going down the same path here: https://www.theverge.com/2023/11/15/23960345/microsoft-cpu-g...
> If the providers wanted to extract more money they would not create custom hardware which reduces overall costs and prices to users.
Extracting money is about margins, not revenues. If they reduce your costs (and their revenue) by 20% with a TPU, but they can produce TPUs for 50% less than buying gear from Nvidia, it's still a profitable move.
Exactly, if I end up paying less and the cloud also makes more money, seems like a win for everyone
The "extracting" word typically comes with abusive connotations when used in the context of money, which doesn't feel like the right word for the win-win outcomes imho
Jevons paradox.
Higher efficiency results in greater utilization.
Serious question, are there open source designs available for RISC today that do “good enough” matrix multiplication?
I have no doubt that Nvidia has extensive optimizations to get SOTA performance, but I am curious what is attainable off the shelf. If you could design a 5nm chip, would it be possible to hit 15% of a NVidia chip? Significantly more?
Of course, there is more to a GPU than just the matrix multiplication, but I am wondering how much effort it would take to get something off the ground for the well financed organization. Presumably China is actively finding such efforts.
Probably. There are several companies going with this approach. The real trick is the software and ecosystem around it. It does not matter if you can outperform an H100 if you spend weeks trying to get software to work or debugging if a NaN is a hardware error or your bad code :)
There were a bunch of presentations about matrix multiplication at the RISC-V summit last month. I'm not sure if any of the presented hardware is open source but maybe those videos are a good lead for tracking some down? https://www.youtube.com/playlist?list=PL85jopFZCnbMfMRR25ENc...
Rain is the one that most people (including Altman) are backing.
> large LLM models
I'm not usually one to point out redundancies like this but this one seems egregious.
This is in contrast to small large language models like GPT3
I think the redundant part is "models:
Large Large Language Model Models
Small Large Language Model Models
They're both redundant. There's no such thing as a small large model. It's a small model or a medium model or a large model.
Once upon a time there was a family of Bigfoot. Papa Bigfoot, Mommy Bigfoot and Little Bigfoot.
Papa Bigfoot was the biggest Bigfoot of them all. Mommy Bigfoot wasn't as big as her husband, but she was still a bigger Bigfoot than her daughter Little Bigfoot, who was the smallest Bigfoot of the family.
One day Little Bigfoot slipped in a stream and hurt her foot.
The little Little Bigfoot foot hurt so much and she cried a lot
How does it compare to Nvidia A100?
I'm sure they've done benchmarking comparisons with the A100 and H100, but since they're not sharing them it seems unlikely they show the TPU's in a favorable light. Plus, they pick up more spend every time someone runs these benchmarks for themselves.
The most important reason all the Big Tech companies spin their own AI chips is that NVIDIA margins are so insane. A purchase from NVIDIA at the hyperscaler volume is on the same order of magnitude as spinning your own chip.
Generally TPUs don't beat Nvidia per chip. The real reason Google likes them is they are much cheaper in TCO for Google. Of course they may not decide to pass on all the savings to you.
I wonder why they don't try to compete as much as they can.
They do - it's just a different design point.
Indirectly, using recent 50k tpuv5e run info [1], I'd guess it will give 50-70% of H100 in large-scale LLM jobs: MLCommons show that H100 give a bit more than 500 TFLOPs in FP8, and v5e gives 100 TOPs in INT8. v5p has 2.3 times more theoretical OPs, 3 times more interchip bandwidth and 3.3 times more memory bandwidth, so assuming you can extrapolate and given that bandwidth is usually a bottleneck, ~60% seems plausible.
[1] https://cloud.google.com/blog/products/compute/the-worlds-la...
v4 chips are basically the same in bf16 performance as A100(slower in int8).
Comment was deleted :(
Does Google rely on TSMC to build the TPU chips?
Maybe indirectly. I was surprised to learn (just Googled it!) that Google TPU chips are mostly designed by Broadcom, who (being mostly fabless themselves) do in turn use a variety of companies such TSMC, Global Foundaries, etc to make them. Not sure if TPUs use latest cutting edge nodes - if so then presumably it is specifically TSMC.
https://www.theinformation.com/articles/to-reduce-ai-costs-g...
Still no FP8 from Google. Surprising, given how effective it seems to be for both training and inference. Although it's not that surprising given that the primary customer of TPUs is Google itself, and they tend to stick themselves on weird little tech islands.
Even below FP8 works: https://arxiv.org/abs/2310.10537 (6 bit and lower)
I have seen FP4 as a proposed format, as well as batch floating point with FP8 numbers (batch floating point means n mantissas for every exponent - an old DSP trick), resulting in ~4-5 bits per number. I'm just disappointed that Google isn't taking quantization very seriously.
Edit: As the commenter below points out, "block floating point" is the common name, not "batch floating point."
The Microscaling paper and the following MX OCP spec is along the lines of what you call batch floating point, though I believe the original 1963(?) work on it called it "blocked" floating point.
Datatypes are really tricky. Hardware designers tend to be conservative in my experience, and don't want to waste die space on things that might not be useful.
Edit - The original: https://www.abebooks.com/first-edition/Rounding-Errors-Algeb...
I like how they launch hard before end of the year performance review.
> To request access
Google is so fucking lame these days
They’ve used this scheme since the launch of gmail
Lol. Without benchmarks against H100 you can’t take this seriously.
Crafted by Rajat
Source Code