Amazon holds engineering meeting following AI-related outages

Amazon holds engineering meeting following AI-related outages

by petethomas

urban_winter

palmotea

> Amazon’s ecommerce business has summoned a large group of engineers to a meeting on Tuesday for a “deep dive” into a spate of outages, including incidents tied to the use of AI coding tools.

> The online retail giant said there had been a “trend of incidents” in recent months, characterised by a “high blast radius” and “Gen-AI assisted changes” among other factors, according to a briefing note for the meeting seen by the FT.

> Under “contributing factors” the note included “novel GenAI usage for which best practices and safeguards are not yet fully established”.

> “Folks, as you likely know, the availability of the site and related infrastructure has not been good recently,” Dave Treadwell, a senior vice-president at the group, told employees in an email, also seen by the FT.

hansmayer

Also some SVP over there: '"folks", we'll measure your performance and bonus based on how much you use Gen AI:)'

rsynnott

59m

Yeah, “you must use LLMs, but also pls don’t use them for important stuff” is a difficult circle to square.

Gud

52m

Who said you can’t use it for important stuff? Just because SOME people are screwing up doesn’t mean everyone is.

VirusNewbie

GenAI at fault, and nothing to do with amazon laying off 30k people and having an overall shitty culture where people mostly don’t want to stay?

applfanboysbgon

> GenAI at fault, and nothing to do with amazon laying off 30k people

GenAI is literally the direct reasoning they used for laying off 30k people.

> “As we roll out more Generative AI and agents, it should change the way our work is done. We will need fewer people doing some of the jobs that are being done today, and more people doing other types of jobs,” [Amazon CEO Andy Jassy] bluntly admitted.

nixass

Absolutely correct. Now let's drop anothet few billions to make AI better and avoid such mistakes in the future. And we might lay off some more folks to make room in a budget for more AI

jiggawatts

Also, managers are incentivised to force AI onto the remaining staff to “boost productivity” but of course they won’t accept any of the responsibility or blame for that decision.

zihotki

Just tell the employees to make AI fully adopted in SDLC and make it secure and reliable. Don't make mistakes.

If it works for models, why not humans? /s

aerhardt

Maybe both, and possibly other causes too, but allow us a moment to revel in the schadenfreude of AI code slop at hyperscale, will you?

Comment was deleted :(

jqpabc123

Summary: AWS has voluteered to serve as a crash test dummy for vibe coding.

But don't tell anyone --- and if you do, don't blame AI because it's all the humans fault for not shaping their questions in the "right way".

arjie

For this particular experiment, regardless of phrasing, I think the guys with the most appetite for risk have to be Cloudflare. They're shipping at an astonishing pace but I think there have been far more outages than there were before in jgc era. Perhaps Anthropic's application side teams are faster and more cowboy[0] but they are super AI-native so that makes sense.

0: I think this is the eras cowboys win so they're (unsurprisingly) smart about doing this

Rohunyyy

I am surprised we haven't had an actual Y2K crash with these AI codes. Like how do you review a 1000 lines of Claude generated PR?

krilcebre

You don't. I can guarantee that 90% of the generated code will never receive a detailed review, simply because there's too much of a cognitive overhead, and too little time, everything moves too fast.

I remember having to do such a code review before an AI in a highly complex component, and it would take a full day of work to do it. In this day and age, most of the people i know take like half an hour and are mostly scanning for obvious mistakes, where the bigger problem are those sneaky non obvious ones.

kakacik

Exactly. Its same for reviewing somebody else's code. How many companies did this perfectly before llms came? I know mine didn't. But these days people that aren't senior enough do reviews of llm output, and do a quick mental path through the code, see the success and approve it.

What could work - llm creating a very good test suite, for their own code changes and overall app (as much as feasible), and those tests need a hardcore review. Then actual code review doesn't have to be that deep. But if everybody is shipping like there is no tomorrow, edge cases will start biting hard and often.

bootsmann

This wouldn't happen if they used my CLAUDE.md of course!

blitzar

They were holding it wrong.

nwmcsween

> Junior and mid-level engineers will now require more senior engineers to sign off any AI-assisted changes, Treadwell added.

Beatings will continue until senior engineers leave?

rsynnott

56m

I wonder what senior means here. Like, unless it’s fairly junior seniors, the ratios are going to make that impossible.

Comment was deleted :(

bravetraveler

When you hear "left behind", remember: is 'it' going to places you want?

MOSI3

And if it's going to get easier and easier for my work to be performed by AI, then what does it mean for me to "keep up"? Do I just need to create more slop than anyone else?

bravetraveler

Excellent consideration, probably. Sounds like a lot to do for very little in return. I'll leave with this, a sort of sick joke given context:

    Quit when the work is done

MOSI3

Fortunately my job is not based on generating plausible-sounding bs, so I should be safe.

bravetraveler

Hear, hear. There's a whole list of other silly games to avoid, unfortunately. Namely, "up or out".

For instance, I want to engineer [more]. Closer to management or sales due to scope creep. At this rate, by career-end, I'll be operating a small country by myself.

pinkmuffinere

> The group has disputed the claim that headcount cuts were responsible for an increase in recent outages.

It's a bit hard to believe this.

rhubarbtree

Some engineers will point to this and say, hey, AI is not gonna work. It doesn’t reason very well and it leads to these problems.

But what they’re missing is all code quality is going to tank, and we are just going to accept that. Just as artisanal goods were replaced in the Industrial Revolution with mass produced inferior ones.

People will accept bad code if it is cheap enough.

We’ve gotten used to aiming for great, even if we often only hit functional. The new bar is going to be so much lower. Welcome to the era of cheap bad code. Lots more software, lots more value overall, but much worse reliability. Every day the apps I use get buggier.

rendaw

I thought this too, but it's still weird.

Machines that make e.g. paper are great. They are immensely more efficient, but extremely consistent and superhuman (try making that perfectly smooth letter paper by hand).

Human written software is the same. Where you had N people copying data from spreadsheets for M suppliers into an internal database or whatever, you now have one program doing it. It can be scaled infinitely for a fraction of the cost. It _never_ messes up. The cost of the software developer is trivial in comparison. Software was a space where the marginal cost for quality was extremely cheap.

I don't get how AI fits in here. Software already had massive scale. You aren't replacing a massive data entry team with AI, you're replacing a reliable piece of software written by a human with a reliable (?) piece of software written by AI controlled by a human. There's no increase in scale. Until the reliability issues are fixed a very noticeable decrease in reliability (sure, some software was bad already, but now the good developers are also writing bad code).

This doesn't seem like a natural step to me at all. The best explanation I can come up with is AI is just being used as an excuse for destructive penny pinching.

rsynnott

55m

I don’t totally buy this. If you’re Amazon, there’s only so buggy you can get before you start losing huge amounts of money.

ozgrakkurt

You are comparing code to a tshirt but it is more similar to infrastructure like roads/bridges/buildings. It is like a platform that you build other stuff on top of

idiocratic

The economics of software are very different from physical goods. Margins on software (products) are orders of magnitude higher. Any cost shaving done at coding time is economically irrelevant in the long run, detrimental to quality/reputation and could almost be seen as a risk. Furthermore, assuming the bottleneck in this process has so far been coding is pure BS.

Ravus

> assuming the bottleneck in this process has so far been coding is pure BS.

This is the core insight for most businesses.

When evaluating the impact of AI on velocity, the first thing to consider is how long it takes for a one-line code change to get into production, including initial analysis and specs.

You can't get faster than this.

rhubarbtree

The cope island of objections will continue to shrink.

Being able to easy create apps means huge supply, which means commodification of software just like the commodification of physical goods. Mass supply means low prices. It won’t be economic to have artisan coders any more than to have artisan goods makers.

yladiz

And yet people still want artisan goods, artwork, high end food, things that aren’t “economic”.

gtsop

You are almost right. As I say since the beginning of this ai circus, this is the equivalent of flipping mcdonalds burgers (no insult intended for those workers). It is a thing, and people buy and eat them. But high quality burgers made by talented chefs will always be out there. That's my analogy, and i dont intend to be on the side of flipping mcdonalds burgers

rsynnott

53m

It’s really not. McDonald’s’ whole thing is consistency. It’s never going to be good, but not is it going to be that terrible.

That is, ah, very much not the case for AI slop.

rhubarbtree

There are a lot of McDonalds and very few Michelin starred restaurants.

Safety critical engineering and infrastructure layers will (eventually again) be rigorous. Everything else is headed to slop.

My craft died. I’m sad. Time to move on.

kakacik

Where I live, gourmet high quality burger joints definitely, and massively overwhelm McDonalds in number (Geneva, Switzerland). Even if I count in burger king. Shows that sometimes people pay for the quality even if they don't desperately need it. And its trivial to make better burgers than mcd, heck I can surpass them trivially at home with every ingredient, they are really the lowest level of quality, taste, looks, or (lack of) healthy components. You don't need Michelin * for that, far from it. Plus food is often cold outside of peak hours, something that never happened to me in proper restaurant.

Also, mcd ain't at the end much cheaper, just marginally, the choice of drinks is pathetic, usually no beer. The main reason folks go there because its easier/faster than getting table in real restaurant. But also the environment in mcd is absolute soulless cheap fugly shit. (there are kids corners to be fair, but they are often disgustingly dirty).

Its a very good analogy at the end IMHO, maybe just not tilting the way you intended, at least not here.

nottorp

> high quality burgers

There is also, you know, actual food. Done by real chefs.

gtsop

[dead]

mediumsmart

Is it only 45 dollars for the subscription? Does that cover the AI-related outages too or just the engineering meeting

Comment was deleted :(

jcgrillo

> Junior and mid-level engineers will now require more senior engineers to sign off any AI-assisted changes, Treadwell added.

Lol. Lmao. You have got to be joking. Seniors leaving in droves is how that plays out.

onei

I read that line and thought "so, the solution is code review?". What has to happen to your processes that code review is not only missing, but unironically claimed to be the solution?

I know there are some companies that never did code review, but this is Amazon. They should know better.

rendaw

It's _more_ code review. They already had senior code review.

wrxd

This is going to end either with seniors rubber-stamping absolutely everything without even reading or with seniors blocking most of the slop for no overall productivity gain

Ekaros

Or if review is actually done I think there will be productivity loss. Juniors with help of AI can generate more code than seniors have time to review in full working day. So they won't have time left for any other work...

Rohunyyy

Nope. NO ONE is quitting in the current market because they got asked to review extra PRs.

rsynnott

50m

If you’re a senior at Amazon and your whole job becomes reviewing slop, well, you can likely get another job which does not revolve around reviewing slop. The current market is not great, but it’s disproportionately painful for juniors.

kakacik

Top people definitely do if they feel like it, why the heck shouldn't they. There is no shortage of work for those. But its fine if company, via its actions, claims it doesn't want to even retain its top talent. Just market forces and all that.

o10449366

Paywalled

techterrier

paste headline into google, click first link

kqr

Huh, it has to be Google, specifically, too! There used to be a shortcut for this action on HN (a link under the submission saying "web" or something?), but it seems that has been removed.

jamiemallers

[dead]

shablulman

[dead]

shablulman

[dead]

andyjohnson0

https://archive.ph/wXvF3

kerim-ca

Full Article

Amazon’s ecommerce business has summoned a large group of engineers to a meeting on Tuesday for a “deep dive” into a spate of outages, including incidents tied to the use of AI coding tools.

The online retail giant said there had been a “trend of incidents” in recent months, characterised by a “high blast radius” and “Gen-AI assisted changes” among other factors, according to a briefing note for the meeting seen by the FT.

Under “contributing factors” the note included “novel GenAI usage for which best practices and safeguards are not yet fully established”.

“Folks, as you likely know, the availability of the site and related infrastructure has not been good recently,” Dave Treadwell, a senior vice-president at the group, told employees in an email, also seen by the FT.

The note ahead of Tuesday’s meeting did not specify which particular incidents the group planned to discuss.

Amazon’s website and shopping app went down for nearly six hours this month in an incident the company said involved an erroneous “software code deployment”. The outage left customers unable to complete transactions or access functions such as checking account details and product prices.

Treadwell, a former Microsoft engineering executive, told employees that Amazon would focus its weekly “This Week in Stores Tech” (TWiST) meeting on a “deep dive into some of the issues that got us here as well as some short immediate term initiatives” the group hopes will limit future outages.

He asked staff to attend the meeting, which is normally optional.

Junior and mid-level engineers will now require more senior engineers to sign off any AI-assisted changes, Treadwell added.

Amazon said the review of website availability was “part of normal business” and it aims for continual improvement.

“TWiST is our regular weekly operations meeting with a specific group of retail technology leaders and teams where we review operational performance across our store,” the company said.

Separately, the company’s cloud computing arm — Amazon Web Services — has suffered at least two incidents linked to the use of AI coding assistants, which the company has been actively rolling out to its staff.

AWS suffered a 13-hour interruption to a cost calculator used by customers in mid-December after engineers allowed the group’s Kiro AI coding tool to make certain changes, and the AI tool opted to “delete and recreate the environment”, the FT previously reported.

Amazon previously said the incident in December was an “extremely limited event” affecting only a single service in parts of mainland China. Amazon added that the second incident did not have an impact on a “customer facing AWS service”.

The FT previously reported multiple Amazon engineers said their business units had to deal with a higher number of “Sev2s” — incidents requiring a rapid response to avoid product outages — each day as a result of job cuts.

Amazon has undertaken multiple rounds of lay-offs in recent years, most recently eliminating 16,000 corporate roles in January. The group has disputed the claim that headcount cuts were responsible for an increase in recent outages.

scuff3d

Gonna see a lot more of this in the coming years. The real cost of LLM tools has a delay. Devs don't tend to notice it until they're neck deep in code then don't understand, swearing the next prompt will get them out. CEOs won't notice until it starts costing them money, and that of course assumes anyone will be willing to admit it. Lot of people have their careers on the line spending a metric shit ton of money on untested tools.

potetoooooo

nice domain

wiseowise

Hold a meeting?! No way! That’s a news worthy material!

Seriously, who even cares? It’s probably going to be “guys be careful but also continue to push slop kthx”.

Crafted by Rajat

Source Code

hckrnws

Amazon holds engineering meeting following AI-related outages