hckrnws
I'm becoming concerned with the rate at which major software systems seem to be failing as of late. For context, last year I only logged four outages that actually disrupted my work; this quarter alone I'm already on my fourth, all within the past few weeks. This is, of course, just an anecdote and not evidence of any wider trend (not to mention that I might not have even logged everything last year), but it was enough to nudge me into writing this today (helped by the fact that I suddenly had some downtime). Keep in mind, this isn't necessarily specific to this outage, just something that's been on my mind enough to warrant writing about it.
It feels like resiliency is becoming a bit of a lost art in networked software. I've spent a good chunk of this year chasing down intermittent failures at work, and I really underestimated how much work goes into shrinking the "blast radius", so to speak, of any bug or outage. Even though we mostly run a monolith, we still depend on a bunch of external pieces like daemons, databases, Redis, S3, monitoring, and third-party integrations, and we generally assume that these things are present and working in most places, which wasn't always the case. My response was to better document the failure conditions, and once I did, realize that there was many more than we initially thought. Since then we've done things like: move some things to a VPS instead of cloud services, automate deployment more than we already had, greatly improve the test suite and docs to include these newly considered failure conditions, and generally cut down on moving parts. It was a ton of effort, but the payoff has finally shown up: our records show fewer surprises which means fewer distractions and a much calmer system overall. Without that unglamorous work, things would've only grown more fragile as complexity crept in. And I worry that, more broadly, we're slowly un-learning how to build systems that stay up even when the inevitable bug or failure shows up.
For completeness, here are the outages that prompted this: the AWS us-east-1 outage in October (took down the Lightspeed R series API), the Azure Front Door outage (prevented Playwright from downloading browsers for tests), today’s Cloudflare outage (took down Lightspeed’s website, which some of our clients rely on), and the Github outage affecting basically everyone who uses it as their git host.
It's money, of course. No one wants to pay for resilience/redundancy. I've launched over a dozen projects going back to 2008, clients simply refuse to pay for it, and you can't force them. They'd rather pinch their pennies, roll the dice and pray.
> It's money, of course.
100%
> No one wants to pay for resilience/redundancy. I've launched over a dozen projects going back to 2008, clients simply refuse to pay for it, and you can't force them. They'd rather pinch their pennies, roll the dice and pray.
Well, fly by night outfits will do that. Bigger operations like GitHub will try to do the math on what an outage costs vs what better reliability costs, and optimize accordingly.
Look at a big bank or a big corporation's accounting systems, they'll pay millions just for the hot standby mainframes or minicomputers that, for most of them, would never be required.
> Bigger operations like GitHub will try to do the math on what an outage costs vs what better reliability costs, and optimize accordingly.
Used to, but it feels like there is no corporate responsibility in this country anymore. These monopolies have gotten so large that they don't feel any impact from these issues. Microsoft is huge and doesn't really have large competitors. Google and Apple aren't really competing in the source code hosting space in the same way GitHub is.
> Take the number of vehicles in the field, A, multiply it by the probable rate of failure, B, then multiply it by the result of the average out of court settlement, C. A times B times C equals X. If X is less than the cost of a recall, we don't do one.
I've worked at many big banks and corporations. They are all held together with the proverbial sticky tape, bubblegum, and hope.
They do have multiple layers of redundancies, and thus have the big budgets, but they won't be kept hot, or there will be some critical flaws that all of the engineers know about but they haven't been given permission/funding to fix, and are so badly managed by the firm, they dgaf either and secretly want the thing to burn.
There will be sustained periods of downtime if their primary system blips.
They will all still be dependent on some hyper-critical system that nobody really knows how it works, the last change was introduced in 1988 and it (probably) requires a terminal emulator to operate.
I've worked on software used by these and have been called in to help support from time to time. One customer which is a top single digit public company by market cap (they may have been #1 at the time, a few years ago) had their SAP systems go down once every few days. This wasn't causing a real monetary problem for them because their hot standby took over.
They weren't using mainframes, just "big iron" servers, but each one would have been north of $5 million for the box alone, I guess on a 5ish year replacement schedule. Then there's all the networking, storage, licensing, support, and internal administration costs for it which would easily cost that much again.
Now people will say SAP systems are made entirely of dict tape and bubblegum. But it all worked. This system ran all their sales/purchasing sites and portals and was doing a million dollars every couple of minutes so that all paid for itself many times over during the course of that bug. Cold standby would not have cut it. Especially since these big systems take many minutes to boot and HANA takes even longer to load from storage.
I agree that it's all money.
That's why it's always DNS right?
> No one wants to pay for resilience/redundancy
These companies do take it seriously, on the software side, but when it comes to configurations, what are you going to do:
Either play it by ear, or literally double your cloud costs for a true, real prod-parallel to mitigate that risk. It looks like even the most critical and prestigious companies in the world are doing the former.
> Either play it by ear, or literally double your cloud costs for a true, real prod-parallel to mitigate that risk.
There's also the problem that doubling your cloud footprint to reduce the risk of a single point of failure introduces new risks: more configuration to break, new modes of failure when both infrastructures are accidentally live and processing traffic, etc.
Back when companies typically ran their own datacenters (or otherwise heavily relied on physical devices), I was very skeptical about redundant switches, fearing the redundant hardware would cause more problems than it solved.
Complexity breeds bugs.
Which is why the “art” of engineering is reducing complexity while retaining functionality.
Why should they? Honestly most of what we do simply does not matter that much. 99.9% uptime is fine in 99.999% of cases.
This is true. But unfortunately the exact same process is used even for critical stuff (the crowdstrike thing for example). Maybe there needs to be a separate swe process for those things as well, just like there is for aviation. This means not using the same dev tooling, which is a lot of effort.
To agree with the comments it seems likely it's money which has begun to result in a slow "un-learning how to build systems that stay up even when the inevitable bug or failure shows up."
To be deliberately provocative, LLMs are being more and more widely used.
Word on the street is github was already a giant mess before the rise of LLMs, and it has not improved with the move to MS.
They are also in the process of moving most of the infra from on-prem to Azure. I'm sure will see more issues over the next couple months.
https://thenewstack.io/github-will-prioritize-migrating-to-a...
To be deliberately provocative, so is offshoring work.
imagine what it'll be like in 10 years time
Microsoft: the film Idiocracy was not supposed to be a manual
Good thing git was designed as a decentralized revision control system, so you don’t really need GitHub. It’s just a nice convenience
As long as you didn't go all in on GitHub Actions. Like my company has.
Then your CI host is your weak point. How many companies have multi-cloud or multi-region CI?
Do you think you'd get better uptime with your own solution? I doubt it. It would just be at a different time.
Uptime is much, much easier at low scale than at high scale.
The reason for buying centralized cloud solutions is not uptime, it's to safe the headache of developing and maintaining the thing.
It is easier until things go down.
Meaning the cloud may go down more frequently than small scale self deployments , however downtimes are always on average much shorter on cloud. A lot of money is at stake for clouds providers, so GitHub et al have the resources to put to fix a problem compared to you or me when self hosting.
On the other hand when things go down self hosted, it is far more difficult or expensive to have on call engineers who can actual restore services quickly .
The skill to understand and fix a problem is limited so it takes longer for semi skilled talent to do so, while the failure modes are simpler but not simple.
The skill difference between setting up something locally that works and something works reliably is vastly different. The talent with the latter are scarce to find or retain .
My reason for centralized cloud solutions is also uptime.
Multi-AZ RDS is 100% higher availability than me managing something.
Well, just a few weeks ago we weren't able to connect to RDS for several hours. That's way more downtime than we ever had at the company I worked for 10 years ago, where the DB was just running on a computer in the basement.
Anecdotal, but ¯\_(ツ)_/¯
An anecdote that repeats.
Most software doesn’t need to be distributed. But it’s the growth paradigm where we build everything on principles that can scale to world-wide low-latency accessibility.
A UNIX pipe gets replaced with a $1200/mo. maximum IOPS RDS channel, bandwidth not included in price. Vendor lock-in guaranteed.
“Your own solution” should be that CI isn’t doing anything you can’t do on developer machines. CI is a convenience that runs your Make or Bazel or Just or whatever you prefer builds, that your production systems work fine without.
I’ve seen that work first hand to keep critical stuff deployable through several CI outages, and also has the upside of making it trivial to debug “CI issues”, since it’s trivial to run the same target locally
Yes, this, but it’s a little more nuanced because of secrets. Giving every employee access to the production deploy key isn’t exactly great OpSec.
Compared to 2025 github yeah I do think most self-hosted CI systems would be more available. Github goes down weekly lately.
Aren't they halting all work to migrate to azure? Does not sound like an easy thing to do and feels quite easy to cause unexpected problems.
I recall the Hotmail acquisition and the failed attempts to migrate the service to Windows servers.
Yes, this is not the first time github trying to migrate to azure. It's like the fourth time or something.
Yes. I've quite literally run a self-hosted CI/CD solution, and yes, in terms of total availability, I believe we outperformed GHA when we did so.
We moved to GHA b/c nobody ever got fired ^W^W^W^W leadership thought eng running CI was not a good use of eng time. (Without much question into how much time was actually spent on it… which was pretty close to none. Self-hosted stuff has high initial cost for the setup … and then just kinda runs.)
Ironically, one of our self-hosted CI outages was caused by Azure — we have to get VMs from somewhere, and Azure … simply ran out. We had to swap to a different AZ to merely get compute.
The big upside to a self-hosted solution is that when stuff breaks, you can hold someone over the fire. (Above, that would be me, unfortunately.) With Github? Nobody really cares unless it is so big, and so severe, that they're more or less forced to, and even then, the response is usually lackluster.
It's fairly straightforward to build resilient, affordable and scalable pipelines with DAG orchestrators like tekton running in kubernetes. Tekton in particular has the benefit of being low level enough that it can just be plugged into the CI tool above it (jenkins, argo, github actions, whatever) and is relatively portable.
Doesn’t have to be an in house system, just basic redundancy is fine. eg a simple hook that pushes to both GitHub and gitlab
I mean yes. We've hosted internal apps that have four nines reliability for over a decade without much trouble. It depends on your scale of course, but for a small team it's pretty easy. I'd argue it is easier than it has ever been because now you have open source software that is containerized and trivial to spin up/maintain.
The downtime we do have each year is typically also on our terms, not in the middle of a work day or at a critical moment.
This escalator is temporarily stairs, sorry for the convenience.
Tbh, I personally don't trust a stopped escalator. Some of the videos of brake failures on them scared me off of ever going on them.
You've ruined something for me. My adult side is grateful but the rest of me is throwing a tantrum right now. I hope you're happy with what you've done.
I read a book about elevators accidents; don't.
elevators accidents or escalator accidents?
elevators. for escalators, make sure not to watch videos of people falling in "the hole".
Comment was deleted :(
I am genuinly sorry about that. And no, I am not happy about what I've done.
Not really comparable at any compliance or security oriented business. You can't just zip the thing up and sftp it over to the server. All the zany supply chain security stuff needs to happen in CI and not be done by a human or we fail our dozens of audits
Why is it that we trust those zany processes more than each other again? Seems like a good place to inject vulnerabilities to me...
Hi! My name is Jia Tan. Here's a nice binary that I compiled for you!
The issue is that GitHub is down, not that git is down.
Aren’t they the same thing? /sarc
You just lose the "hub" of connecting others and providing a way to collaborate with others with rich discussions.
All of those sound achievable by email, which, coincidently, is also decentralized.
Some of my open source work is done on mailing lists through e-mail
It's more work and slower. I'm convinced half of the reason they keep it that way is because the barrier to entry is higher and it scares contributors away.
Well it does prevent brigading.
Email at a company is very not decentralized. Most use Microsoft 365, also hosted in azure, i.e. the same cloud as github is trying to host its stuff in.
Comment was deleted :(
Wait, email is decentralised?
You mean, assuming everyone in the conversation is using different email providers. (ie. Not the company wide one, and not gmail... I think that covers 90% of all email accounts in the company...)
For sure.
You can commit, branch, tag, merge, etc and be just fine.
Now, if you want to share that work, you have to push.
You can push to any other Git server during a GitHub outage to still share work, trigger a CI job, deploy etc, and later when GitHub is reachable again you push there too.
Yes you lose some convenience (like GitHub's pull requests UI can't be used, but you can temporarily use the other Git server's UI for that.
I think their point was that you're not fully locked in to GitHub. You have the repo locally and can mirror it on any Git remote.
I'm on HackerNews because I can't do my job right now.
I'm on HN because I don't want to do my job right now.
I work in the wrong time zone. Good night.
I don’t use GitHub that much. I think the thing about “oh no you have centralized on GitHub” point is a bit exaggerated.[1] But generally, thinking beyond just pushing blobs to the Internet, “decentralization” as in software that lets you do everything that is Not Internet Related locally is just a great thing. So I can never understand people who scoff at Git being decentralized just because “um, actually you end up pushing to the same repository”.
It would be great to also have the continuous build and test and whatever else you “need” to keep the project going as local alternatives as well. Of course.
[1] Or maybe there is just that much downtime on GitHub now that it can’t be shrugged off
SSH also down
My pushing was failing for reasons I hadn't seen before. I then tried my sanity check of `ssh git@github.com` (I think I'm supposed to throw a -t flag there, but never care to), and that worked.
But yes ssh pushing was down, was my first clue.
My work laptop had just been rebooted (it froze...) and the CPU was pegged by security software doing a scan (insert :clown: emoji), so I just wandered over to HN and learned of the outage at that point :)
SSH works fine for me. I'm using it right now. Just not to GitHub!
SSH is as decentralized as git - just push to your own server? No problem.
Well sure but you can't get any collaborators commits that were only pushed to GitHub before it went down.
Well you can with some effort. But there's certainly some inconvenience.
Curious whether you actually think this, or was it sarcasm?
It was sarcasm, but git itself is Decentralized VCS. Technically speaking, every git checkout is a repo of itself. GitHub doesn't stop me from having the entire repo history up to last pull, and I still can push either to the company backup server or my coworker directly.
However, since we use github.com fore more than just a git hosting it is SPOF in most cases, and we treat it as a snow day.
Yep, agreed - Issues being down would be a bit of a killer.
So was this caused by the Cloudflare duplicate features file, causing 5xx errors internally? They said they shipped a fix, but no details yet
I have a serious question, not trying to start a flame war.
A. Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now? It seems like we see major issues across AWS, GCP, Azure, Github, etc. at least monthly now and I don't remember that being the case in the past.
B. If it's becoming more common, what are the reasons? I can think of a few, but I don't know the answer, so if anyone in-the-know has insight I'd appreciate it.
Operations budget cuts/layoffs? Replacing critical components/workflows with AI? Just overall growing pains, where a service has outgrown what it was engineered for?
Thanks
> A. Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now? It seems like we see major issues across AWS, GCP, Azure, Github, etc. at least monthly now and I don't remember that being the case in the past.
FWIW Microsoft is convinced moving Github to Azure will fix these outages
Everything old is new again.
https://www.zdnet.com/article/ms-moving-hotmail-to-win2000-s...
From the second link:
> In 2002, the amusement continued when a network security outfit discovered an internal document server wide open to the public internet in Microsoft's supposedly "private" network, and found, among other things, a whitepaper[0] written by the hotmail migration team explaining why unix is superior to windows.
Hahaha, that whitepaper is pure gold!
[0]: https://web.archive.org/web/20040401182755/http://www.securi...
And 25 years later, a significant portion of the issues in that whitepaper remain unresolved. They were still shitting on people like Jeffrey Snover who were making attempts to provide more scalable management technologies. Such a clown show.
The same Azure that just had a major outage this month?
Microsoft is a company that hasn't even figured out how to get system updating working consistently on their premier operating system in three decades. It seems unlikely to me that somehow moving to Azure is going to make anything more stable.
Microsoft is also convinced that its works are a net benefit for humanity, so I would take that with a grain of salt.
I think it would be pretty hard to argue against that point of view, at least thus far. If DOS/Windows hadn't become the dominant OS someone would have, and a whole generation of engineers cut their teeth on their parents' windows PCs.
There are some pretty zany alternative realities in the Multiverses I’ve visited. Xerox Parc never went under and developed computing as a much more accessible commodity. Another, Bell labs invented a whole category of analog computers that’s supplanted our universe’s digital computing era. There’s one where IBM goes directly to super computers in the 80s. While undoubtedly Microsoft did deliver for many of us, I am a hesitant to say that that was the only path. Hell, Steve Jobs existed in the background for a long while there!
I wish things had gone differently too, but a couple of nitpicks:
1.) It's already a miracle Xerox PARC escaped their parent company's management for as long as they did.
3.) IBM was playing catch-up on the supercomputer front since the CDC 6400 in 1964. Arguably, they did finally catch up in the mid-late 80's with the 3090.
AT&T sold Unix machines (actually a rebadged Olivetti for the hardware) and Microsoft has Xenix when windows wasn't a thing.
So many weird paths we could have gone down it's almost strange Microsoft won.
Yeah, I'm absolutely not saying it was the only path. It's just the path that happened. If not MS maybe it would have been Unix and something else. Either way most everyone today uses UX based on Xerox Parc's which was generously borrowed by, at this point, pretty much everyone.
If Microsoft hadn't tried to actively kill all its competition then there's a good chance that we'd have a much better internet. Microsoft is bigger than just an operating system, they're a whole corporation.
Instead they actively tried to murder open standards [1] that they viewed as competitive and normalized the antitrust nightmare that we have now.
I think by nearly any measure, Microsoft is not a net good. They didn't invent the operating system, there were lots of operating systems that came out in the 80's and 90's, many of which were better than Windows, that didn't have the horrible anticompetitive baggage attached to them.
[1] https://en.wikipedia.org/wiki/Embrace,_extend,_and_extinguis...
I'm not sure I understand this logic. You're saying that the gap would have been filled even if their product didn't exist, which means that the net benefit isn't that the product exists. How are you concluding that whatever we might have gotten instead would have been worse?
DOS and Windows kept computing behind for a VERY long time, not sure what you're trying to argue here?
What’s funny is that we were some bad timing away from IBM giving the DOS money to Gary Kildall and we’d all be working with CP/M derivatives!
Gary was on a flight when IBM called up the Digital Research looking for an OS for the IBM-PC. Gary’s wife, Dorothy, wouldn’t sign an NDA without it going through Gary, and supposedly they never got negotiations back on track.
What if that alternate someone had been better than DOS/Windows and then engineers cut their teeth on that instead?
Then my comment may have been about a different OS. Or I might never have been born. Who knows?
I'm not convinced of your first point. Just because something seems difficult to avoid given the current context does not mean it was the only path available.
Your second point is a little disingenuous. Yes, Microsoft and Windows have been wildly successful from a cultural adoption standpoint. But that's not the point I was trying to argue.
My first comment is simply pointing out that there's always a #1 in anything you can rank. Windows happened to be what won. And I learned how to use a computer on Windows. Do I use it now? No. But I learned on it as did most people whose parents wanted a computer.
The comment you were replying to was about Microsoft.
Even if Windows weren't a dogshit product, which it is, Microsoft is a lot more than just an operating system. In the 90's they actively tried to sabotage any competition in the web space, and held web standards back by refusing to make Internet Explorer actually work.
And how does it follow that microsoft is the good guy in a future where we did it with some other operating system? You could argue that their system was so terrible that its displacement of other options harmed us all with the same level of evidence.
Been on GitHub for a long time. It feels like they're more often. It used to be yearly if at all that GitHub was noticably impacted. Now it's monthly, and recently, seemingly weekly.
Definitely not how I remember. First, I remember seeing unicorn page multiple times a day some weeks. There were also time when webhook delivery didn't work, so circle ci users couldn't kick off any builds.
What change is how many services GitHub can be having issues.
I suspect that the Azure migration is influencing this one. Just a bunch of legacy stuff being moved around along with Azure not really being the most reliable on top... I can't imagine it's easy.
there has been 5 between actions and push pull issues just this month. it is more often
In the early days of GitHub (like before 2010) outages were extremely common.
I agree, for what that's worth.
However, this is an unexpected bell curve. I wonder if GitHub is seeing more frequent adversarial action lately. Alternatively, perhaps there is a premature reliance on new technology at play.
I pulled my project off github and onto codeberg a couple months ago but this outage still screws me over because I have a Cargo.toml w/ git dependency into github.
I was trying to do a 1.0 release today. Codeberg went down for "10 minutes maintenance" multiple times while I was running my CI actions.
And then github went down.
Cursed.
I think it was generally news when there were upages and the site was up. Similar with twitter for that matter.
Not from my recollection. Not like this. BitBucket on the other hand had a several day outage at one point. That one I do recall.
I remember periods of time when GitHub was down every few weeks, my impression is that it's become more stable over the years.
Comment was deleted :(
> If it's becoming more common, what are the reasons?
Someone answered this morning, while Cloudflare outage, it's AI vibe coding and I tend to think there is something true in this. At some point there might be some tiny grain of AI engaged which starts the avalanche ending like this.
well layoffs across tech probably havent helped
https://techrights.org/n/2025/08/12/Microsoft_Can_Now_Stop_R...
ever since Musk greenlighted firing people again.. CEOs can't wait to pull the trigger
It certainly feels that way, though it may be an instance of availability bias. Not sure what's causing it - maybe extra load from AI bots (certainly a lot of smaller sites complain about it, maybe major providers feel the pain too), maybe some kind of general quality erosion... It's certainly something that is waiting for a serious research.
Github isn't in the same reliability class as the hyperscalars or cloudflare; its comically bad now, to the point that at a previous job we invested in building a readonly cache layer specifically to prevent github outages from bringing our system down.
Years ago on hackernews I saw a link about probability describing a statistical technique that one could use to answer a question about if a specific type of event was becoming more common or not. Maybe related to the birthday paradox? The gist that I remember is that sometimes a rare event will seem to be happening more often, when in reality there is some cognitive bias that makes it non-intuitive to make that decision without running the numbers. I think it was a blog post that went through a few different examples, and maybe only one of them was actually happening more often.
If the events are independent, you could use a binomial distribution. Not sure if you can consider these kinds of events to be independent, though.
End of year, pre-holiday break, code/project completion for perf review rush.
Be good to your Stability reliability engineers for the next few months... it's downtime season!
I’m more interested in how this and the Cloudflare outage occurred on the same day. Is it really just a coincidence?
I suspect there is more tech out there. 20 years ago we didn't have smartphones. 10 years ago, 20mbit on mobile was a good connection. Gigabit is common now, infrastructure no longer has the hurdles it used to, AI makes coding and design much easier, phones are ubiquitous and usage of them at all times (in the movies, out and dinner, driving) has become super normalised.
I suspect (although have not researched) that global traffic is up, by throughput but also by session count.
This contributes to a lot more awareness. Slack being down wasn't impactful when most tech companies didn't use Slack. An AWS outage was less relevant when the 10 apps (used to be websites) you use most didn't rely on a single AZ in AWS or you were on your phone less.
I think as a society it just has more impact than it used to.
> Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now?
I think that "more coverage" is part of it, but also "more centralization." More and more of the web is centralized around a tiny number of cloud providers, because it's just extremely time-intensive and cost-prohibitive for all but the largest and most specialized companies to run their own datacenters and servers.
Three specific examples: Netflix and Dropbox do run their own datacenters and servers; Strava runs on AWS.
> If it's becoming more common, what are the reasons? I can think of a few, but I don't know the answer, so if anyone in-the-know has insight I'd appreciate it.
I worked at AWS from 2020-2024, and saw several of these outages so I guess I'm "in the know."
My somewhat-cynical take is that a lot of these services have grown enormously in complexity, far outstripping the ability of their staff to understand them or maintain them:
- The OG developers of most of these cloud services have moved on. Knowledge transfer within AWS is generally very poor, because it's not incentivized, and has gotten worse due to remote work and geographic dispersion of service teams.
- Managers at AWS are heavily incentivized to develop "new features" and not to improve the reliability, or even security, of their existing offerings. (I discovered numerous security vulnerabilities in the very-well-known service that I worked for, and was regularly punished-rather-than-rewarded for trying to get attention and resources on this. It was a big part of what drove me to leave Amazon. I'm still sitting on a big pile of zero-day vulnerabilities in ______ and ______.)
- Cloud services in most of the world are basically a 3-way oligopoly between AWS, Microsoft/Azure, and Google. The costs of switching from one provider to another are often ENORMOUS due to a zillion fiddly little differences and behavior quirks ("bugs"). It's not apparent to laypeople — or even to me — that any of these providers are much more or less reliable than the others.
Looking around, I noticed that many senior, experienced individuals were laid off, sometimes replaced by juniors/contractors without institutional knowledge or experience. That's especially evident in ops/support, where the management believes those departments should have a smaller budget.
1/ Most of the big corporations moved to big cloud providers in the last 5 years. Most of them started 10 years ago but it really accelerated in the last 5 years. So there is for sure more weight and complexity on cloud providers, and more impact when something goes wrong.
2/ Then we cannot expect big tech to stay as sharp as in the 2000s and 2010s.
There was a time banks had all the smart people, then the telco had them, etc. But people get older, too comfortable, layers of bad incentive and politics accumulate and you just become a dysfunctional big mess.
> B. If it's becoming more common, what are the reasons?
Among other mentioned factors like AI and layoffs: mass brain damage caused by never-ending COVID re-infections.
Since vaccines don't prevent transmission, and each re-infection increases the chances of long COVID complications, the only real protection right now is wearing a proper respirator everywhere you go, and basically nobody is doing that anymore.
There are tons of studies to back this line of reasoning.
I think it's cancer, and it's getting worse.
One possibility is increased monitoring. In the past, issues that happened weren't reported because they went under the radar. Whereas now, those same issues which only impact a small percentage of users would still result in a status update and postmortem. But take this with a grain of salt because it's just a theory and doesn't reflect any actual data.
A lot of people are pointing to AI vibe coding as the cause, but I think more often than not, incidents happen due to poor maintenance of legacy code. But I guess this may be changing soon as AI written code starts to become "legacy" faster than regular code.
At least with GitHub it's hard to hide when you get "no healthy upstream" on a git push.
I thought I was going crazy when I couldn't push changes but now it seems it's time to just call it for the day. Back at it tomorrow.
Seeing auth succeed but push fail was an exercise in hair pulling.
Same, even started adding new ssh keys to no avail... (I was getting some nondescript user error first, then unhealthy upstream)
Would love to see a global counter for the number of times ‘ssh -T git@github.com’ was invoked.
same, i've started pulling my hair out, was about to nuke my setup and set it up all from scratch
lol same. Hilarious when this shit goes down that we all rely on like running water. I'm assuming GitHub was hacked by the NSA because someone uploaded "the UFO files" or sth.
GitHub is pretty easily the most unreliable service I've used in the past five years. Is GitLab better in this regard? At this point my trust in GitHub is essentially zero - they don't deserve my money any longer.
My company self-hosts GitLab. Gitaly (the git server) is a weekly source of incidents, it doesn't scale well (CPU/memory spikes which end up taking down the web interface and API). However we have pretty big monorepos with hundreds of daily committers, probably not very representative.
We self host gitlab, so its very stable. But Gitlab also kind of is enterprise software. It hits every feature checkbox, but they aren't well integrated, and they are kind of half way done. I don't think its as smooth of an experience as Github personally, or as feature rich. But Gitlab can self host your project repos, cicd, issues, wikis, etc. and it does it at least okay.
I would argue GitLab CI/CD is miles ahead of the dumpster fire that is GitHub Actions. Also the homepage is actually useful, unlike GitHub's.
Frequently use both `github.com` and self-hosted Gitlab. IMHO, it's just... different.
Self-hosted Gitlab periodically blocks access for auto-upgrades. Github.com upgrades are usually invisible.
Github.com is periodically hit with the broad/systemic cloud-outage. Self-hosted Gitlab is more decentralized infra, so you don't have the systemic outages.
With self-hosted Gitlab, you likely to have to deal with rude bots on your own. Github.com has an ops team that deals with the rude bots.
I'm sure the list goes on. (shrug)
You can make it as reliable as you want by hosting it on prem.
> as reliable as you want
We self-host GitLab but the team owning it is having hard time scaling it. From my understanding talking to them, the design of gitaly makes it very hard to scale it beyond certain repo size and # of pushes per day (for reference: our repos are GBs in size, ~1M commits, hundreds of merges per day)
Comment was deleted :(
Comment was deleted :(
Flashbacks to me pushing hard for GitLab self hosting a few months ago. The rest of the team did not feel the lift was worth it.
I utterly hate being at the mercy of a third party with an after thought of a "status page" to stare at.
[dead]
Another GitLab self-hosting user here, we've run it on Kubernetes for 6 years. It's never gone down for us, maybe an hour of downtime yearly as we upgrade Postgres to a new version.
Same. Also running on-prem for 6+ years with no issues. GitLab CI is WAY better than GitHub too.
We've been self hosting GitLab for 5 years and it's the most reliable service in our organization. We haven't had a single outage. We use Gitlab CI and security scanning extensively.
Ditto, self-hosted for over eight years at my last job. SCM server and 2-4 runners depending on what we needed. Very impressive stability and when we had to upgrade their "upgrade path" tooling was a huge help.
Forgejo, my dudes.
Do we know its uptime statistics?
What do you mean. Forgejo is self-hosted, uptime is up to you.
Mea culpa. I forgot, perhaps thinking of codeberg.
Gitlab has regular issues (we use Saas) and the support isn’t great. They acknowledge problems, but the same ones happen again and again. It’s very hard to get anything on their roadmap etc.
Couldn't log into it this morning when cloudflare was down so there's that.
There’s this Gitlab incident https://www.youtube.com/watch?v=tLdRBsuvVKc
I didn't really want to work today anyways. First cloudflare, now this... Seems like a sign to get some fresh air
we depend too much on usa centralized tech.
we need more soverenity and decentralization.
How is this related to them being located in the USA?
Please check out radicle.dev, helping hands always welcome!
> Repositories are replicated across peers in a decentralized manner
You lost me there
"Replicated across peers in a decentralized manner" could just as easily be written about regular Git. Radicle just seems to add a peer-to-peer protocol on top that makes it less annoying to distribute a repository.
So I don't get why the project has "lost you", but I also suspect you're the kind of person any project could readily afford to lose as a user.
What this is trying to say: - "peers": participants in the network are peers, i.e. both ends of a connection run the same code, in contrast to a client-and-server architecture, where both sides often run pretty different code. To exemplify: The code GitHub's servers run is very different from the code that your IDE with Git integration runs. - "replicated across peers": the Git objects in the repository, and "social artifacts" like discussions in issues and revisions in patches, is copied to other peers. This copy is kept up to date by doing Git fetches for you in the background. - "in a decentralized manner": Every peer/node in the network gets to locally decide which repositories they intend to replicate, i.e. you can talk to your friends and replicate their cool projects. And when you first initialize a repository, you can decide to make it public (which allows everyone to replicate it), or private (which allows a select list of nodes identified by their public key to replicate). There's no centralized authority which may tell you which repositories to replicate or not.
I do realize that we're trying to pack quite a bit of information in this sentence/tagline. I think it's reasonably well phrased, but for the uninitiated might require some "unpacking" on their end.
If we "lost you" on that tagline, and my explanation or that of hungariantoast (which is correct as well) helped you understand, I would appreciate if you could criticize more constructively and suggest a better way to introduce these features in a similarly dense tagline, or say what else you would think is a meaningful but short explanation of the project. If you don't care to do that, that's okay, but Radicle won't be able to improve just based on "you lost me there".
In case you actually understood the sentence just fine and we "lost you" for some other reason, I would appreciate if you could elaborate on the reason.
The sad part is both the web and git were developed as decentralized technologies, both of which we foolishly centralized later.
The underlying tech is still decentralized, but what good does that do when we've made everything that uses it dependent on a few centralized services?
FYI in an emergency you can edit files directly on Github without the need to use git.
Edit: ugh... if you rely on GH Actions for workflows though actions/checkout@v4 is also currently experiencing the git issues, so no dice if you depend on that.
FYI in an emergency you can `git push` to and `git pull` from any SSH-capable host without the need to use GitHub.
FYI in an emergency you can SSH to your server and edit files and the DB directly.
Where is your god now, proponents of immutable filesystems?!
I love when people do that because they always say "I will push the fix to git later". They never do and when we deploy a version from git things break. Good times.
I started packing things into docker containers because of that. Makes it a bit more of a hassle to change things in production.
Depends on the org, the big ones I've worked for regular Devs even seniors don't have anything like the level of access to be able to pull a stunt like that.
At the largest place I did have prod creds for everything because sometimes they are necessary and I had the seniority (sometimes you do need them in a "oh crap" scenario).
They where all setup on a second account in my work Mac which had a danger will Robinson wallpaper because I know myself, far far too easy to mentally fat finger when you have two sets of creds.
FYI in an emergency, you can buy a plane ticket and send someone to access the server directly.
I actually had the privilege of being sent to the server.
Had a coworker have to drive across the country once to hit a power button (many years ago).
Because my suggestion they have a spare ADSL connection for out of channel stuff was an unnecessary expense... Til he broke the firewall knocked a bunch of folks offline across a huge physical site and locked himself out of everything.
The spare line got fitted the next month.
I'm actually getting "ERROR: no healthy upstream" on `git pull`.
They done borked it good.
If your remote is set to a git@github.com remote, it won't work. They're just pointing out that you could use git to set origin/your remote to a different ssh capable server, and push/pull through that.
Yup, we were just trying to hotfix prod and ran into this. What is happening to the internet lately.
We're not using Github Actions, but CircleCI is also failing git operations on Github (it doesn't recognise our SSH keys).
True that, and this time Github AI actually have a useful answer to check for githubstatus.com
Can you create a branch through GitHub UI?
Yes. Just start editing a file and when you hit the "commit changes" button it will ask you what name to use for the branch.
Reflecting on the last decade, with my career spanning big tech and startups, I've seen a common arch:
Small and scrappy startup -> taking on bigger customers for greater profits / ARR -> re-architecting for "enterprise" customers and resiliency / scale -> more idealism in engineering -> profit chasing -> product bloat -> good engineers leave -> replaced by other engineers -> failures expand.
This may be an acceptable lifecycle for individual companies as they each follow the destiny of chasing profits ultimately. Now picture it though for all the companies we've architected on top of (AWS, CloudFlare, GCP, etc.) Even within these larger organizations, they are comprised of multiple little businesses (eg: EC2 is its own business effectively - people wise, money wise)
Having worked at a $big_cloud_provider for 7 yrs, I saw this internally on a service level. What started as a foundational service, grew in scale, complexity, and architected for resiliency, slowly eroded its engineering culture to chase profits. Fundamental services becoming skeletons of their former selves, all while holding up the internet.
There isn't a singular cause here, and I can't say I know what's best, but it's concerning as the internet becomes more centralized into a handful of players.
tldr: how much of one's architecture and resiliency is built on the trust of "well (AWS|GCP|CloudFlare) is too big to fail" or "they must be doing things really well"? The various providers are not all that different from other tech companies on the inside. Politics, pressure, profit seeking.
Well said. I definitely agree (you’re absolutely right!) that the product will get worse through that re-architecting for enterprise transition.
But the small product also would not be able to handle any real amount of growth as it was, because it was a mess of tech debt and security issues and manual one-off processes and fragile spaghetti code that only Jeff knows because he wrote it in a weekend, and now he’s gone.
So by definition, if a service is large enough to serve a zillion people, it is probably big and bloated and complex.
I’m not disagreeing with you, I liked your comment and I’m just rambling. I have worked with several startups and was surprised at how poorly their tech scaled (and how riddled with security issues they were) as we got into it.
Nothing will shine a flashlight on all the stress cracks of a system like large-scale growth on the web.
> So by definition, if a service is large enough to serve a zillion people, it is probably big and bloated and complex.
Totally agree with your take as well.
I think the unfortunate thing is that there can exist a "goldie locks zone" to this, where the service is capable of serving a zillion people AND is well architected. Unfortunately it can't seem to last forever.
I saw this in my career. More product SKUs were developed, new features/services defined by non-technical PMs, MBAs entered the chat, sales became the new focus over availability, and the engineering culture that made this possible eroded day by day.
The years I worked in this "goldie locks zone" I'd attribute to:
- strong technical leadership at the SVP+ level that strongly advocated for security, availability, then features (in that order).
- a strong operational culture. Incidents were exciting internally, post mortems shared at a company wide level, no matter how small.
- recognition for the engineers who chased ambulances and kept things running, beyond their normal job, this inspired others to follow in their footsteps.
There was a comment on another GitHub thread that I replied to. I got a response saying it’s absurd how unreliable Gh is when people depend on it for CI/CD. And I think this is the problem. At GitHub the developers think it’s only a problem because their ci/cd is failing. Oh no, we broke GitHub actions, the actions runners team is going to be mad at us! Instead of, oh no, we broke GitHub actions, half the world is down!
That larger view held only by a small sliver of employees is likely why reliability is not a concern. That leads to the every team for themselves mentality. “It’s not our problem, and we won’t make it our problem so we don’t get dinged at review time” (ok that is Microsoft attitude leaking)
Then there’s their entrenched status. Real talk, no one is leaving GitHub. So customers will suck it up and live with it while angry employees grumble on an online forum. I saw this same attitude in major companies like Verio and Verisign in the early 2000s. “Yeah we’re down but who else are you going to go to? Have a 20% discount since you complained. We will only be 1% less profitable this quarter due to it” The kang and kodos argument personified.
These views are my own and not related to my employer or anyone associated with me.
> We are seeing failures for some git http operations and are investigating
It's not just HTTPS, I can't push via SSH either.
I'm not convinced it's just "some" operations either; every single one I've tried fails.
I'm convinced the people who write status pages are incapable of escaping the phrasing "Some users may be experiencing problems". Too much attempting to save face by PR types, instead of just being transparent with information (… which is what would actually save face…)
And that's if you get a status page update at all.
A friend of mine was able to get through a few minutes ago, apparently. Everyone else I know is still fatal'ing.
https://www.githubstatus.com/incidents/5q7nmlxz30sk
it's up now (the incident, not the outage)
A lot of failures lately during the aI ReVoLuTiOn.
GitHub has a long history of garbage reliability that long predates AI
What's the local workaround for this?
Git is distributed, it should be possible to put something between our servers and github which pulls from github when it's running and otherwise serves whatever it used to have. A cache of some sort. I've found the five year old https://github.com/jonasmalacofilho/git-cache-http-server which is the same sort of idea.
I've run a git instance on a local machine which I pull from, where a cron job fetches from upstream into it, which solved the problem of cloning llvm over a slow connection, so it's doable on a per-repo basis.
I'd like to replace it globally though because CI looks like "pull from loads of different git repos" and setting it up once per-repo seems dreadful. Once per github/gitlab would be a big step forward.
Looks like Gemini 3 figured out the best way to save costs on its compute time was to shut down github!
I'm also getting this. Cannot pull or push but can authenticate with SSH
myrepo git:(fix/context-types-settings) gp
ERROR: user:1234567:user
fatal: Could not read from remote repository.
myrepo git:(fix/context-types-settings) ssh -o ProxyCommand=none git@github.com
PTY allocation request failed on channel 0
Hi user! You've successfully authenticated, but GitHub does not provide shell access.
Connection to github.com closed.same
[flagged]
I cannot push/pull to any repos. Scared me for a second, but of course I then checked here.
It is insane how many failures we've been getting lately, especially related to actions.
* jobs not being picked up
* jobs not being able to be cancelled
* jobs running but showing up as failed
* jobs showing up as failed but not running
* jobs showing containers as pushed successfully to GitHub's registry, but then we get errors while pulling them
* ID token failures (E_FAIL) and timeouts.
I don't know if this is related to GitHub moving to Azure, or because they're allowing more AI generated code to pass through without proper reviews, or something else, but as a paying customer I am not happy.Same! The current self-hosted runner gets hung every so often
Probably because AI-generated reviews has made qa way worse.
github has had a few of these as of late, starting to get old
Remember talking about the exact same thing with very similar wording sometime pre-COVID
MSFT intentionally degrading operations to get everyone to move onto Azure… oh, wait, they just moved GitHub there, carry on my wayward son!
GitHub hasn't been moved onto azure yet, they just announced it's their goal to move over in 2026
The last outage was a whole 5 days ago https://news.ycombinator.com/item?id=45915731
Didn't I hear github is moving to Microsoft Azure ? I wonder if these outages are related to the move.
Remember hotmail :)
Huh? What were they on before? The acquisition by MSFT is 7 years ago, they maintained their own infrastructure for that long?
The Github CEO did step down a few months ago, they never named a successor. Could have something to do with the recent issues. https://news.ycombinator.com/item?id=44865560
Yes they are/were on their own hardware. The outages will only get worse with this move.
After restarting my computer, reinstalling git, almost ready to reinstall my os, I find out it's not even my fault
We live in a house of cards. I hope that eventually people in power realize this. However, their incentive structures do not seem to be a forcing function for that eventuality.
I have been thinking about this a lot lately. What would be a tweak that might improve this situation?
Not exactly for this situation, but I've been thinking about distributed caching of web content.
Even if a website is down, someone somewhere most likely has it cached. Why can't I read it from their cache? If I'm trying to reach a static image file, why do I have to get it from the source?
I guess I want torrent DHT for the web.
That is genuinely interesting. But, let's put all "this nerd talk" into terms that someone in the average C-suite could understand.
How can C-suite stock RSU/comp/etc be tweaked to make them give a crap about this, or security?
---
Decades ago, I was a teenager and I realized that going to fancy hotel bars was really interesting. I looked old enough, and I was dressed well. This was in Seattle. I once overheard a low-level cellular company exec/engineer complain about how he had to climb a tower, and check the radiation levels (yes non-ionizing). But this was a low level exec, who had to take responsibility.
He joked about how while checking a building on cap hill, he waved his wand above his head, and when he heard the beeps... he noped tf out. He said that it sucked that he had to do that, and sign-off.
That is actually cool, and real engineering/responsibility at the executive level.
Can we please get more of that type of thing?
Using p2p or self hosted, and accepting the temporary tradeoffs of no network effects.
It's weird to think all of our data lives on physical servers (not "in the cloud") that are falliable and made and maintained by falliable humans, and could fail at any moment. So long to all the data! Good ol' byzantine backups.
It's not only http, also ssh.
Same for me, fatal: unable to access 'https://github.com/repository_example.git/': The requested URL returned error: 500
I remember a colleague setting up a CI/CD system (on an aaS obviously) depending on Docker, npm, and who knows what else... I thought "I wonder what % of time all those systems are actually up at the same time"
Comment was deleted :(
side effect that isn't immediately obvious: all raw.githubusercontent.com content responds with a "404: Not Found" response.
this has broken a few pipeline jobs for me, seems like they're underplaying this incident
yeah something major is borked and they're unwilling to admit it. The status page initially claimed "https git operations are affected" when it was clear that ssh were too (its updated to reflect that now).
Haha I don't know if its a good test or not but I could not figure out why git pull was failing and Claude just went crazy trying so many random things.
Gemini 3 Pro after 3 random things announced Github was the issue.
This is incredibly annoying. I've been trying to fix a deployment action on GitHub for a the past bit, so my entire workflow for today has been push, wait, check... push, wait, check... et cetera.
You should really check out (pun intended) `act` https://github.com/nektos/act
(epistemic status: frustrated, irrational)
This is what happens when they decide that all the budget should be spent on AI stuff rather than solid infra and devops
I wonder how much of this stuff has been caused by AI agents running on the infra? Claude Code is amazing for devops, until it kubectl deletes your ArgoCD root app
With Microsoft (Behind GitHub) going full AI mode, expected things to get worse.
I worked for one of the largest company in my country, they had "catch-up" with GitHub and it is not longer about GitHub as you folks are used to but AI aka CoPilot.
We are seeing major techs such as but not limited to Google, AWS and Azure going under after making public that their code is 30% AI generated (Google).
Even Xbox(Microsoft) and its gaming studio got destroyed (COD BO7) for heavily dependency on AI.
Don't you find it coincidence all of these system outage worldwide happening right after they proudly shared heavily dependency on AI??
Companies aren't using AI/ML to improve processes but to replace people, full stop. The AI stock market is having a massive meltdown as we speak with indications that the AI bubble went live.
If you as a company wanna keep your productivity at 99.99% from now on:
* GitLab: Self-hosted GitLab/runners * Datacenter: AWS/GCP/Azure is no longer a safe option or cheaper, we have data center companies such as Equinix which have a massive backup plan in place. I have visited one, they are prepared for a nuclear war and I am not even being dramatic. If I was starting a new company in 2025, I would go back to datacenter over AWS/GCP/Azure * Self-host everything you can, and no, it does not require 5 days in the office to manage all of that.
I didn't see a case made for self-hosting as the better option, instead I see that proposition being assumed true. Why would it be better for my company to roll its own CI/CD?
I worked at a bank that self-hosted GitLab/runners.
As the AI bubble goes sideways, you don't know how your company data is being held, CoPilot uses GitHub to train its AI for instance. Yes, the big company I work for had a clause to forbids GitHub from using the company's repo from AI training.
How many companies can afford having a dedicated GitHub team to speak to?? How many companies read the contracts or have any saying??
Not many really.
Yeah sure, cloud is easier, you just pay the bills, but at what cost??
The internet is having one heck of a day! we focus on ecommerce technology and I can't help but think our customers will be getting nervous pre-BFCM.
Seeing "404: Not Found" for all raw files
Mandatory break time has officially been declared. Please step away from your keyboard, hydrate, and pretend you were productive today.
Cloudflare this morning, and now this. A bunch of work isn't getting done today.
Maybe this will push more places towards self-hosting?
I really can't believe this. I had issues with CircleCI too earlier, soon after the incident with Cloudflare resolved.
this is actually the 5-6th time this month. actions have been degraded constantly now push and pull breaks back to back
can’t go down is better than won’t go down.
the problem isn’t with centralized internet services, the problem is a fundamental flaw with http and our centralized client server model. the solution doesn’t exist. i’ll build it in a few years if nobody else does.
Same issue here for me. Downdetector [1] agrees, and github status page was just updated now.
Just as I was wondering why my `git push` wasn't working all of a sudden :D
Centralized internet continues to show it's wonderful benefits.
At least microsoft decided we all deserve a couple hour break from work.
Why are there outages everywhere all the time now? AWS, Azure, GitHub, Cloudflare, etc. Is this the result of "vibe coding"? Because before "vibe coding", I don't remember having this many outages around the clock. Just saying.
I think it has more to do with layoffs.
"Why do we need so many people to keep things running!?! We never have downtime!!"
Which the true reason for AI, reducing payroll costs.
The reason I detest those who push AI as a technological solution. I think AI as a field is interesting but highly immature, but it's been over hyped to the point of absurdity, and now it is having real negative pressure on wages. That pressure has carry over effects and I agree that we're starting to observe those.
Has to be a mix of both.
They fired a ton of employees with no rhyme or reason to cut costs, this was the predictable outcome. It will get worse if it ever gets better.
The funny thing is that the over hiring during the pandemic also had the predictable result of mass lay-offs.
Whoever manages HR should be the ones fired after two back to back disasters like this.
And yet we keep paying the company
Comment was deleted :(
Could also be the hack and slash layoffs are starting to show its results. Removing crucial personnel, teams spread thin, combined with low morale industrywide and you've got the perfect recipe for disaster.
AI use being pushed, team sizes being reduced, continued lack of care towards quality… enshittification marches on, gaining speed every day
Seeing "ERROR: no healthy upstream" in push/pull operations
Good thing I already moved away from gh to a selfhosted Forgejo instance.
What is today and who do I blame for it
Computers are great at solving problems that wouldn’t have existed without computers
Computers and alcohol.
Having a self hosted gitea server is a godsend in times like this!
Mercury is in retrograde
Github is down a lot...
Pain
My guess is that it has to do with the Cloudflare outage this morning.
I wish I could say something smart such as “People/Organisations should host their own git servers“, but as someone who had the misfortune of doing that in the past I rather have a non-functional GitHub.
I've found Gitea to be pretty rock solid, at least for a small team.
Would even recommend Forgejo (the same project Codeberg also uses as the base for their service)
I'm curious to learn from your mistakes, can you please elaborate what went wrong?
Comment was deleted :(
Almost one hour down now. What differs this from recent AWS and Cloudflare issues is that this appears to be a global issue?
Seem images on GitHub web also not showing
What a day...
It's working again now.
I have said this before, and I will say this again: GitHub stars[1] are the real lock-in for GitHub. That's why all open-core startups are always requesting you to "star them on GitHub".
The VCs look at stars before deciding which open-core startup to invest in.
The 4 or 5 9s of reliability simply do not matter as much.
I'm going to awkwardly bring up that we have avoided all github downtime and bugs and issues by simply not using github.
Our git server is hosted by Atlassian. I think we've had one outage in several years?
Our self hosted Jenkins setup is similarly robust, we've had a handful of hours of "Can't build" in again, several years.
We are not a company made up of rockstars. We are not especially competent at infrastructure. None of the dev teams have ever had to care about our infrastructure (occasionally we read a wiki or ask someone a question).
You don't have to live in this broken world. It's pretty easy not to. We had self hosted Mercurial and jenkins before we were bought by the megacorp, and the megacorp's version was even better and more reliable.
Self host. Stop pretending that ignoring complexity is somehow better.
It used to be having GitHub in the critical path for deployment wasn't so bad, but these days you'd have to be utterly irresponsible to work that way.
They need to get a grip on this.
Eh, the lesson from us-east-1 outage is that you should cling to the big ones instead. You get the convenience + nobody gets mad at you over the failure.
Everything will have periods of unreliability. The only solution is to be multi-everything (multi-provider for most things), but the costs for that are quite high and hard to see the value in that.
yes, but if you are going to provide assurances like SLAs, you need to be aware of your own allow for them. if you're customers require working with known problem areas, you should add a clause exempting those areas when they are the cause.
Ton of people in the comments here wanting to blame AI for these outages. Either you are very new to the industry or have forgotten how frequently they happen. Github in particular was a repeat offender before the MS acquisition. us-east-1 went down many times before LLMs came about. Why act like this is a new thing?
Same.
ERROR: no healthy upstream fatal: Could not read from remote repository.
Cloudflare, GitHub...
Git push and pull not working. Getting a 500 response.
just realized my world stops, when github does.
Seeing "ERROR: no healthy upstream" in push/pull.
can the internet work for 5 minutes, please?
Same issue, and I need to complete my work :(
hell yea brother
We gonna need xkcd "compiling" but with "cloudflare||github||chatgpt||spotify down".
Git pull and push not working
Man, I sound like a broken record, but... Love that for them.
How many more outages until people start to see that farming out every aspect of their operations maybe, might, could have a big effect on their overall business? What's the breaking point?
Then again, the skills to run this stuff properly are getting more and more rare so we'll probably see more and more big incidents popping up more frequently like this as time goes on.
Its back
It would be nice if this was actually broken down bit-by-bit after it happened, if only for paying customers of these cloud services.
These companies are supposed to have the top people on site reliability. That these things keep happening and no one really knows why makes me doubt them.
Alternatively,
The takeaway for today: clearly, Man was not meant to have networked, distributed computing resources.
We thought we could gather our knowledge and become omniscient, to be as the Almighty in our faculties.
The folly.
The hubris.
The arrogance.
So that’s how the Azure migration is going.
Comment was deleted :(
Spooky day today on the internet. Huge CF outage, Gemini 3 launches now I can't push anything to my repos.
can't do git pull or git push 503 and 500 errors
Cherry on top will be another aws outage
Funny you should say that, I'm here looking because our monitoring server is seeing 80-90% packet loss on our wireguard from our data center to EC2 Oregon...
FYI: Not AWS. Been doing some more investigation, it looks like it's either at our data center, or something on the path to AWS, because if I fail over to our secondary firewall it takes a slightly different path both internally and externally, but the packet loss goes away.
Is it just me or it seems that there is an increased frequency of these types of incidents as of late.
ICE keeps finding immigrants in the server cabinets.
Gemini 3 = Skynet ?
what else is out there like github?
Gitlab, Forgejo, Gitea, Gogs, ... or you can just push to your own VPS over SSH, with or without an HTTP server. We had a good discussion of this last option here three weeks ago: https://news.ycombinator.com/item?id=45710721
GitLab is probably the next-largest competitor. Unlike GitHub it's actually open source, so you can use their managed offering or self-host.
Give https://codeberg.org/ a go
Comment was deleted :(
can't do git pull or push 503 and 500 errors
Obviously just speculation, but maybe don't let AI write your code...
Microsoft CEO says up to 30% of the company’s code was written by AI https://techcrunch.com/2025/04/29/microsoft-ceo-says-up-to-3...
It's degraded availability of Git operations.
The enterprise cloud in EU, US, and Australia has no issues.
If you look at the incident history disruptions happen often in the public cloud for years already. Before AI wrote code for them.
The enterprise cloud runs on older stable versions of GitHub's backend/frontend code.
That sounds very bad, but I guess it depends also on which code it is. And whether Nadella actually knows what he's talking about, too.
Maybe AI is the tech support too
Sweet. 30% of Microsoft's code isn't protected by copyright.
Time to leak that.
What a ridiculous comment, as if these outages didn't happen before LLMs became more commonplace.
I admit it was a bit ridiculous. However, if Microsoft is going to brag about how much AI code they are using but not also brag about how good the code is, then we are left to speculate. The two outages in two weeks are _possible_ data points and all we have to go on unless they start providing data.
What a ridiculous comment, as if these outages haven't been increasing in quantity since LLMs became more commonplace
[dead]
Crafted by Rajat
Source Code