hckrnws
Out of curiosity, I gave it the latest project euler problem published on 11/16/2025, very likely out of the training data
Gemini thought for 5m10s before giving me a python snippet that produced the correct answer. The leaderboard says that the 3 fastest human to solve this problem took 14min, 20min and 1h14min respectively
Even thought I expect this sort of problem to very much be in the distribution of what the model has been RL-tuned to do, it's wild that frontier model can now solve in minutes what would take me days
The fact that Gemini 3 is so far ahead of every other frontier model in math might be telling us something more general about the model itself.
It scored 23.4% on MathArena Apex, compared with 0.5% for Gemini 2.5 Pro, 1.6% for Claude Sonnet 4.5 and 1.0% for GPT 5.1.
This is not an incremental advance. It is a step change. This indicates a new discovery, not just more data or more compute.
To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.
You need to verify what you're doing, detect when you make a mistake, and backtrack to try a different approach.
The SimpleQA benchmark is another datapoint that we're probably looking at a research breakthrough, not just more data or more compute. Gemini 3 Pro achieved more than double the reliability of GPT-5.1 (72.1% vs. 34.9%).
This isn't an incremental gain, it's a step-change leap in reducing hallucinations.
And it's exactly what you'd expect to see if there's an underlying shift from probabilistic token prediction to verified search, with better error detection and backtracking when it finds an error.
That could explain the breakout performance on math, and reliability, and even operating graphical user interfaces (ScreenSpot-Pro at 72.7% vs. 3.5% for GPT-5.1).
I usually ask a simple question that ALL the models get wrong: List of mayor of my city [Londrina]. ALL the models (offine) get wrong. And I mean, all the models. The best that I could, it's o3 I believe, saying it couldn't give a good answer for that, and told to access the city website.
Gemini 3 somehow is able to give a list of mayors, including details on who got impeached, etc.
This should be a simple answer, because all the data is on wikipedia, that certainly the models are trained on, but somehow most models don't manage to give that answer right, because... it's just a irrelevant city in a huge dataset.
But somehow, Gemini 3 did it.
Edit: Just asked "Cool places to visit in Londrina" (In portuguese), and it was also 99% right, unlike other models, who just create stuff. The only thing wrong here, it mentioned sakuras in a lake... Maybe it confused with Brazilian ipês, which are similar, and indeed the city it's full of them.
It seems to have a visual understanding, imo.
Ha, I just did the same with my hometown (Guaiba, RS), a city that is 1/6th of Londrina, and its wikipedia page in English hasn't been updated in years, and still has the wrong mayor (!).
Gemini 3 nailed on the first try, included political affiliation, and added some context on who they competed with and won over in each of the last 3 elections. And I just did a fun application with AI Studio, and it worked on first shot. Pretty impressive.
(disclaimer: Googler, but no affiliation with Gemini team)
Pure fact-based, niche questions like that aren't really the focus of most providers any more from what I've heard, since they can be solved more reliably by integrating search tools (and all providers now have search).
I wouldn't be surprised if the smallest models can answer fewer such (fact-only) questions over time offline as they distill/focus them more thoroughly on logic etc.
thanks for sharing, very interesting example
Thanks for reporting these metrics and drawing the conclusion of an underlying breakthrough in search.
In his Nobel Prize winning speech, Demis Hassabis ends by discussing how he sees all of intelligence as a big tree-like search process.
It tells me that the benchmark is probably leaking into training data, and going to the benchmark site :
> Model was published after the competition date, making contamination possible.
Aside from eval on most of these benchmarks being stupid most of the time, these guys have every incentive to cheat - these aren't some academic AI labs, they have to justify hundreds of billions being spent/allocated in the market.
Actually trying the model on a few of my daily tasks and reading the reasoning traces all I'm seeing is same old, same old - Claude is still better at "getting" the problem.
[flagged]
Hmmm, I wrote those words myself, maybe I've spent too much time with LLMs and now I'm talking like them??
I'd be interested in any evidence-based arguments you might have beyond attacking my writing style and insinuating bad intent.
I found this commenter had sage advice about how to use HN well, I try to follow it: https://news.ycombinator.com/item?id=38944467
I’ll take you at your word, sorry for the incorrect callout. Your comment format appeared malicious, so my response wasn’t an attempt at being “snarky”, just acting defensively. I like the HN Rules/Guidelines.
This is something that is happening to me too, and frankly I'm a little concerned. English is not my first language, so I use AI for checking and writing many things. And I spend a lot of time with coding tools. And now I need sometimes to do a conscient effort to avoid mimicking some LLM patterns...
You mentioned "step change" twice. Maybe a once over next time? My favorite Mark Twain quote is (very paraphrased) "My apologies, had I more time, I would have written a shorter letter".
“If you gaze long into an abyss, the abyss also gazes into you.”
Is that you Nietzsche? Or are you Magog https://andromeda.fandom.com/wiki/Spirit_of_the_Abyss
You seem very comfortable making unfounded claims. I don't think this is very constructive or adds much to the discussion. While we can debate the stylistic changes of the previous commenter, you seem to be discounting the rate at which the writing style of various LLMs has backpropagated into many peoples' brains.
Also discounting the fact that people actually do talk like that. In fact, these days I have to modify my prose to be intentionally less LLM-like lest the reader thinks it's LLM output.
1) Models learn these patterns from common human usage. They are in the wild, and as such there will be people who use them naturally.
2) Now, given its for-some-reason-ubiquitous choice by models, it is also a phrasing that many more people are exposed to, every day.
Language is contagious. This phrasing is approaching herd levels, meaning models trained from up-to-the-moment web content will start to see it as less distinctly salient. Eventually, there will be some other high-signal novel phrase with high salience, and the attention heads will latch on to it from the surrounding context, and then that will be the new AI shibboleth.
It's just how language works. We see it in the mixes between generations when our kids pick up new lingo, and then it stops being in-group for them when it spreads too far.. Skibidi, 6 7, etc.
It's just how language works, and a generation ago the internet put it on steroids. Now? Even faster.
I also used Gemini 3 Pro Preview. It finished it 271s = 4m31s.
Sadly, the answer was wrong.
It also returned 8 "sources", like stackexchange.com, youtube.com, mpmath.org, ncert.nic.in, and kangaroo.org.pk, even though I specifically told it not to use websearch.
Still a useful tool though. It definitely gets the majority of the insights.
Prompt: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
> It also returned 8 "sources"
well, there's your problem. it behaves like a search summary tool and not like a problem solver if you enable google search
Exactly this - and how chatGPT behaves too. After a few conversations with search enabled you figure this out, but they really ought to make the distinction clearer.
Terrence Tao claims [0] contributions by the public are counter-productive since the energy required to check a contribution outweighs its benefit:
> (for) most research projects, it would not help to have input from the general public. In fact, it would just be time-consuming, because error checking
Since frontier LLMs make clumsy mistakes, they may fall into this category of 'error-prone' mathematician whose net contributions are actually negative, despite being impressive some of the time.
The requested prompt does not exist or you do not have access. If you believe the request is correct, make sure you have first allowed AI Studio access to your Google Drive, and then ask the owner to share the prompt with you.
I thought this was a joke at first. It actually needs drive access to run someone else's prompt. Wild.
On iOS safari, it just says “Allow access to Google Drive to load this Prompt”. When I run into that UI, my first instinct is that the poster of the link is trying to phish me. That they’ve composed some kind of script that wants to read my Google Drive so it can send info back to them. I’m only going to click “allow” if I trust the sender with my data. IMO, if that’s not what is happening, this is awful product design.
After ChatGPT accidentally indexed everyones shared chats (and had a cache collision in their chat history early on) and Meta build a UI flow that filled a public feed full of super private chats... seems like a good move to use a battle tested permission system.
Imagine the metrics though. "this quarter we've had a 12% increase on people using AI solutions in their google drive".
Google Drive is one of the bigger offenders when it comes to “metrics-driven user-hostile changes”, in gsuite, and its Google Meet is one of its peers.
In The Wire they asked Bunny to "juke the stats" - and he was having none of that.
Not really, that's just basic access control. If you've used Colab or Cloud Shell (or even just Google Cloud in general, given the need to explicitly allow the usage of each service), it's not surprising at all.
Why is this sad. You should bw rooting for these LLMs to be as bad as possible..
If we've learned anything so far it's that the parlor tricks of one-shot efficacy only gets you so far. Drill into anything relatively complex with a few hundred thousand tokens of context and the models all start to fall apart roughly the same. Even when I've used Sonnet 4.5 with 1M token context the model starts to flake out and get confused with a codebase of less than 10k LoC. Everyone seems to keep claiming these huge leaps and bounds, but I really have to wonder how many of these are just shilling for their corporate overlord. I asked Gemini 3 to solve a simple, yet not well documented problem in Home Assistant this evening. All it would take is 3-5 lines of YAML. The model failed miserably. I think we're all still safe.
Same. I've been needing to update an userscript (JS) that takes stuff like "3 for the price of 1", "5 + 1 free", "35% discount!" from a particular site and then converts the price to a % discount and the price per item / 250 grams.
Its an old userscript so it is glitchy and halfway works. I already pre-chewed the work by telling Gemini 3 exactly which new HTML elements it needs to match and which contents it needs to parse. So basically, the scaffolding is already there, the sources are already there, it just needs to put everything in place.
It fails miserably and produces very convincing looking but failing code. Even letting it iterate multiple times does nothing, nor does nudging it in the correct direction. Mind you that Javascript is probably the most trained-on language together with Python, and parsing HTML is one of the most common usecases.
Another hilarious example is MPV, which has very well-documented settings. I used to think that LLMs would mean you can just tell people to ask Gemini how to configure it, but 9 out of 10 times it will hallucinate a bunch of parameters that never existed.
It gives me an extremely weird feeling when other people are cheering that it is solving problems at superhuman speeds or that it coded a way to ingest their custom XML format in record time, with relatively little prompting. It seems almost impossible that LLMs can both be so bad and so good at the same time, so what gives?
1. Coding with LLMs seems to be all about context management. Getting the LLM to deal with the minimum amount of code needed to fix the problem or build the feature, carefully managing token limits and artificially resetting the session when needed so the context handover is managed, all that. Just pointing an LLM at a large code base and expecting good things doesn't work.
2. I've found the same with Gemini; I can rarely get it to actually do useful things. I have tried many times, but it just underperforms compared to the other mainstream LLMs. Other people have different experiences, though, so I suspect I'm holding it wrong.
It depends on your definition of safe. Most of the code that gets written is pretty simple — basic crud web apps, WP theme customization, simple mobile games… stuff that can easily get written by the current gen of tooling. That already has cost a lot of people a lot of money or jobs outright, and most of them probably haven’t reached their skill limit a as developers.
As the available work increases in complexity, I reckon more will push themselves to take jobs further out of their comfort zone. Previously, the choice was to upskill for the challenge and greater earnings, or stay where you are which is easy and reliable; the current choice is upskill or get a new career. Rather than switch careers to something you have zero experience in. That puts pressure on the moderately higher-skill job market with far fewer people, and they start to upskill to outrun the implosion, which puts pressure on them to move upward, and so on. With even modest productivity gains in the whole industry, it’s not hard for me to envision a world where general software development just isn’t a particularly valuable skill anymore.
Everything in tech is cyclical. AI will be no different. Everyone outsourced, realized the pain and suffering and corrected. AI isn't immune to the same trajectory or mistakes. And as corporations realize that nobody has a clue about how their apps or infra run, you're one breach away from putting a relatively large organization under.
The final kicker in this simple story is that there are many, many narcissistic folks in the C-suite. Do you really think Sam Altman and Co are going to take blame for Billy's shitty vibe coded breach? Yeah right. Welcome to the real world of the enterprise where you still need an actual throat to choke to show your leadership skills.
I absolutely don’t think vibe coding or barely supervised agents will replace coders, like outsourcing claimed to, and in some cases did and still does. And outsourcing absolutely affected the job market. If the whole thing does improve and doesn’t turn out to be too wildly unprofitable to survive, what it will do is allow good quality coders— people who understand what can and can’t go without being heavily scrutinized— to do a lot more work. That is a totally different force than outsourcing, which to some extent, assumed software developers were all basically fungible code monkeys at some level.
That ship has sailed long ago.
I'm rooting for biological cognitive enhancement through gene editing or whatever other crazy shit. I do not want to have some corporation's AI chip in my brain.
Comment was deleted :(
Confirmed less wrong psyop victim
Generally, any expert hopes their tool/paintbrush/etc is as performant as possible.
And in general I'm all for increasing productivity, in all areas of the economy.
> You should bw rooting for these LLMs to be as bad as possible..
Why?
Rooting is useless. We should be taking conscious action to reduce the bosses' manipulation of our lives and society. We will not be saved by hoping to sabotage a genuinely useful technology.
How is it useful other than for people making money off token outout. Continue to fry your brain.
They’re fantastic learning tools, for a start. What you get out of them is proportional to what you put in.
You’ve probably heard of the Luddites, the group who destroyed textile mills in the early 1800s. If not: https://en.wikipedia.org/wiki/Luddite
Luddites often get a bad rap, probably in large part because of employer propaganda and influence over the writing of history, as well as the common tendency of people to react against violent means of protest. But regardless of whether you think they were heroes, villains, or something else, the fact is that their efforts made very little difference in the end, because that kind of technological progress is hard to arrest.
A better approach is to find ways to continue to thrive even in the presence of problematic technologies, and work to challenge the systems that exploit people rather than attack tools which can be used by anyone.
You can, of course, continue to flail at the inevitable, but you might want to make sure you understand what you’re trying to achieve.
Arguably the Luddites don't get a bad enough rep. The lump of labour fallacy was as bad then as it is now or at any other time.
Again, that may at least in part be a function of how history was written. The Luddite wikipedia link includes this:
> Malcolm L. Thomas argued in his 1970 history “The Luddites” that machine-breaking was one of the very few tactics that workers could use to increase pressure on employers, undermine lower-paid competing workers, and create solidarity among workers. "These attacks on machines did not imply any necessary hostility to machinery as such; machinery was just a conveniently exposed target against which an attack could be made."[10] Historian Eric Hobsbawm has called their machine wrecking "collective bargaining by riot", which had been a tactic used in Britain since the Restoration because manufactories were scattered throughout the country, and that made it impractical to hold large-scale strikes.
Of course, there would have been people who just saw it as striking back at the machines, and leaders who took advantage of that tendency, but the point is it probably wasn’t as simple as the popular accounts suggest.
Also, there’s a kind of corollary to the lump of labor fallacy, which is arguably a big reason the US is facing such a significant political upheaval today: when you disturb the labor status quo, it takes time - potentially even generations - for the economy to adjust and adapt, and many people can end up relatively worse off as a result. Most US factory workers and miners didn’t end up with good service industry jobs, for example.
Sure, at a macro level an economist viewing the situation from 30,000 feet sees no problem - meanwhile on the ground, you end up with millions of people ready to vote for a wannabe autocrat who promises to make things the way they were. Trying to treat economics as a discipline separate from politics, sociology, and psychology in these situations can be misleading.
Today, we found better ways to prevent machines from crushing children, e.g., more regulation from democracy.
[flagged]
are you pretending to be confused?
I see millions of kids cheating on their schoolwork, many adults substituting reading and thinking to GPUs. There's like 0.001% of people that use them to learn responsibly. You are genuinely a fool.
Hey, I wrote a long response to your other reply to me, but your comment seems to have been flagged so I can no longer reply there. Since I took the time to write that, I'm posting it here.
I'm glad I was able to inspire a new username for you. But aren't you concerned that if you let other people influence you like that, you're frying your brain? Shouldn't everything originate in your own mind?
> They don't provide any value except to a very small percentage of the population who safely use them to learn
There are many things that only a small percentage of the population benefit from or care about. What do you want to do about that? Ban those things? Post exclamation-filled comments exhorting people not to use them? This comes back to what I said at the end of my previous comment:
You might want to make sure you understand what you’re trying to achieve.
Do you know the answer to that?
> A language model is not the same as a convolution neural network finding anomalies on medical imagining.
Why not? Aren't radiologists "frying their brains" by using these instead of examining the images themselves?
The last paragraph of your other comment was literally the Luddite argument. (Sorry I can't quote it now.) Do you know how to weave cloth? No? Your brain is fried!
The world changes, and I find it more interesting and challenging to change with it, than to fight to maintain some arbitrary status quo. To quote Ghost in the Shell:
All things change in a dynamic environment. Your effort to remain what you are is what limits you.
For me, it's not about "getting ahead" as you put it. It's about enjoying my work, learning new things. I work in software development because I enjoy it. LLMs have opened up new possibilities for me. In that 5 year future you mentioned, I'm going to have learned a lot of things that someone not using LLMs will not have.
As for being dependent on Altman et al., you can easily go out and buy a machine that will allow you to run decent models yourself. A Mac, a Framework desktop, any number of mini PCs with some kind of unified memory. The real dependence is on the training of the models, not running them. And if that becomes less accessible, and new open weight models stop being released, the open weight models we have now won't disappear, and aren't going to get any worse for things like coding or searching the web.
> Keep falling for lesswrong bs.
Good grief. Lesswrong is one of the most misleadingly named groups around, and their abuse of the word "rational" would be hilarious if it weren't sad. In any case, Yudkowsky advocated being ready to nuke data centers, in a national publication. I'm not particular aware of their position on the utility of AI, because I don't follow any of that.
What I'm describing to you is based on my own experience, from the enrichment I've experienced from having used LLMs for the past couple of years. Over time, I suspect that kind of constructive and productive usage will spread to more people.
Out of respect the time you put into your response, I will try to respond in good faith.
> There are many things that only a small percentage of the population benefit from or care about. What do you want to do about that?
---There are many things from our society that I would like to ban that are useful to a small percentage of the population, or at least should be heavily regulated. Guns for example. A more extreme example would be cars. Many people drive 5 blocks when they could walk to their (and everyone else's) detriment. Forget the climate, it impacts everyone ( break dust, fumes, pedestrian deaths). Some cities create very expensive tolls / parking fees to prevent this, this angers most people and is seen as irrational by the masses but is necessary and not done enough. Open Free societies are a scam told to us by capitalist that want to exploit without any consequences.
--- I want to air-gap all computers in classrooms. I want students to be expelled for using LLMs to do assignments, as they would have been previously for plagiarism (that's all an llm is, a plagiarism laundering machine).
---During COVID there was a phenomenon where some children did not learn to speak until they were 4-5 years old, and some of those children were even diagnosed with autism. In reality, we didn't understand fully how children learned to speak, and didn't understand the importance of the young brain's need to subconsciously process people's facial expressions. It was Masks!!! (I am not making a statement on masks fyi) We are already observing unpredictable effects that LLMs have on the brain and I believe we will see similar negative consequences on the young mind if we take away the struggle to read, think and process information. Hell I already see the effects on myself, and I'm middle aged!
> Why not? Aren't radiologists "frying their brains" by using these instead of examining the images themselves?
--- I'm okay with technology replacing a radiologist!!! Just like I'm okay with a worker being replaced in an unsafe textile factory! The stakes are higher in both of these cases, and obviously in the best interest of society as a whole. The same cannot be said for a machine that helps some people learn while making the rest dependent on it. Its the opposite of a great equalizer, it will lead to a huge gap in inequality for many different reasons.
We can all say we think this will be better for learning, that remains to be seen. I don't really want to run a worldwide experiment on a generation of children so tech companies can make a trillion dollars, but here we are. Didn't we learn our lesson with social media/porn?
If Uber's were subsidized and cost only $20.00 a month for unlimited rides, could people be trusted to only use it when it was reasonable or would they be taking Uber's to go 5 blocks, increasing the risk for pedestrians and deteriorating their own health. They would use them in an irresponsible way.
If there was an unlimited pizza machine that cost $20.00 a month to create unlimited food, people would see that as a miracle! It would greatly benefit the percentage of the population that is food insecure, but could they be trusted to not eat themselves into obesity after getting their fill? I don't think so. The affordability of food, and the access to it has a direct correlation to obesity.
Both of these scenarios look great on the surface but are terrible for society in the long run.
I could go on and on about the moral hazards of LLMs, there are many more outside of just the dangers of learning and labor. We are being told they are game changing by the people who profit off them..
In the past, empires bet their entire kingdom's on the words of astronomers and magicians who said they could predict the future. I really don't see how the people running AI companies are any different than those astronomers (they even say they can predict the future LOL!)
They are Dunning Kruger plagiarism laundering machines as I see it. Text extruding machines that are controlled by a cabal of tech billionaires who have proven time and time again they do not have societies best interest at heart.
I really hope this message is allowed to send!
Ok, so there’s a clear pattern emerging here, which is that you think we should do much more to manage our use of technology. An interesting example of that is the Amish. While they take it to what can seem like an extreme, they’re doing exactly what you’re getting at, just perhaps to a different degree.
The problem with such approaches is that it involves some people imposing their opinions on others, “for their own good”. That kind of thing often doesn’t turn out well. The Amish address that by letting their children leave to experience the outside world, so that their return is (arguably) voluntary - they have an opportunity to consent to the Amish social contract.
But what you seem to be doing is making a determination of what’s good for society as a whole, and then because you have no way to effect that, you argue against the tools that we might abuse rather than the tendencies people have to abuse them. It seems misplaced to me. I’m not saying there are no societal dangers from LLMs, or problems with the technocrats and capitalists running it all, but we’re not going to successfully address those issues by attacking the tools, or people who are using them effectively.
> In the past, empires bet their entire kingdom's on the words of astronomers and magicians who said they could predict the future.
You’re trying to predict the future as well, quite pessimistically at that.
I don’t pretend to be able to predict the future, but I do have a certain amount of trust in the ability of people to adapt to change.
> that's all an llm is, a plagiarism laundering machine
That’s a possible application, but it’s certainly not all they are. If you genuinely believe that’s all they are, then I don’t think you have a good understanding of them, and it could explain some of our difference in perspective.
One of the important features of LLMs is transfer learning: their ability to apply their training to problems that were not directly in their training set. Writing code is a good example of this: you can use LLMs to successfully write novel programs. There’s no plagiarism involved.
Your post made me curious to try a problem I have been coming back to ever since ChatGPT was first released: https://open.kattis.com/problems/low
I have had no success using LLM's to solve this particular problem until trying Gemini 3 just now despite solutions to it existing in the training data. This has been my personal litmus test for testing out LLM programming capabilities and a model finally passed.
ChatGPT solves this problem now as well with 5.1. Time for a new litmus test.
Comment was deleted :(
To be fair a lot of the impressive Elo scores models get are simply due to the fact that they're faster: many serious competitive coders could get the same or better results given enough time.
But seeing these results I'd be surprised if by the end of the decade we don't have something that is to these puzzles what Stockfish is to chess. Effectively ground truth and often coming up with solutions that would be absolutely ridiculous for a human to find within a reasonable time limit.
I’d love if anyone could provide examples of such AND(“ground truth”, “absolutely ridiculous”) solutions! Even if they took clever humans a long time to create.
I’m curious to explore such fun programming code. But I’m also curious to explore what knowledgeable humans consider to be both “ground truth” as well as “absolutely ridiculous” to create within the usual time constraints.
I'm not explaining myself right.
Stockfish is a superhuman chess program. It's routinely used in chess analysis as "ground truth": if Stockfish says you've made a mistake, it's almost certain you did in fact make a mistake[0]. Also, because it's incomparably stronger than even the very best humans, sometimes the moves it suggests are extremely counterintuitive and it would be unrealistic to expect a human to find them in tournament conditions.
Obviously software development in general is way more open-ended, but if we restrict ourselves to puzzles and competitions, which are closed game-like environments, it seems plausible to me that a similar skill level could be achieved with an agent system that's RL'd to death on that task. If you have base models that can get there, even inconsistently so, and an environment where making a lot of attempts is cheap, that's the kind of setup that RL can optimize to the moon and beyond.
I don't predict the future and I'm very skeptical of anybody who claims to do so, correctly predicting the present is already hard enough, I'm just saying that given the progress we've already made I would find plausible that a system like that could be made in a few years. The details of what it would look like are beyond my pay grade.
---
[0] With caveats in endgames, closed positions and whatnot, I'm using it as an example.
Yeah, it is often pointed out as a brilliance in game analysis if a GM makes a move that an engine says is bad and turns out to be good. However, it only happens in very specific positions.
> Yeah, it is often pointed out as a brilliance in game analysis if a GM makes a move that an engine says is bad and turns out to be good.
Do you have any links? I haven't seen any such (forget GM, not even Magnus), barring the opponent making mistakes.
Here’s a chess stackexchange of positions that stump engines
https://chess.stackexchange.com/questions/29716/positions-th...
It basically comes down to “ideas that are rare enough that they were never programmed into a chess engine”.
Blockades or positions where no progress is possible are a common theme. Engines will often keep tree searching where a human sees an obvious repeating pattern.
Here’s also an example where 2 engines are playing, and deep mind finds a move that I think would be obvious to most grandmasters, yet stockfish misses it https://youtu.be/lFXJWPhDsSY?si=zaLQR6sWdEJBMbIO
That being said, I’m not sure that this necessarily correlates with brilliancy. There are a few of these that I would probably get in classical time and I’m not a particularly brilliant player.
Maybe he means not the best move but an equally almost strong move?
Because ya, that doesn't happen lol.
Does that happen because the player understands some tendency of their opponent that will cause them to not play optimally? Or is it genuinely some flaw in the machine’s analysis?
Both, but perhaps more often neither.
From what I've seen, sometimes the computer correctly assesses that the "bad" move opens up some kind of "checkmate in 45 moves" that could technically happen, but requires the opponent to see it 45 moves ahead of time and play something that would otherwise appear to be completely sub-optimal until something like 35 moves in, at which point normal peak grandmasters would finally go "oh okay now I get the point of all of that confusing behavior, and I can now see that I'm going to get mated in 10 moves".
So, the computer is "right" - that move is worse if you're playing a supercomputer. But it's "wrong" because that same move is better as long as you're playing a human, who will never be able to see an absurd thread-the-needle forced play 45-75 moves ahead.
That said, this probably isn't what GP was referring to, as it wouldn't lead to an assignment of a "brilliant" move simply for failing to see the impossible-to-actually-play line.
This is similar to game theory optimal poker. The optimal move is predicated on later making optimal moves. If you don’t have that ability (because you’re human) then the non-optimal move is actually better.
Poker is funny because you have humans emulating human-beating machines, but that’s hard enough to do that you have players who don’t do this win as well.
I think this is correct for modern engines. Usually, these moves are open to a very particular line of counterplay that no human would ever find because they rely on some "computer" moves. Computer moves are moves that look dumb and insane but set up a very long line that happens to work.
It does happen that the engine doesn't immediately see that a line is best, but that's getting very rare those days. It was funny in certain positions a few years back to see the engine "change its mind" including in older games where some grandmaster found a line that was particularly brilliant, completely counter-intuitive even for an engine, AND correct.
But mostly what happens is that a move isn't so good, but it isn't so bad either, and as the computer will tell you it is sub-optimal, a human won't be able to refute it in finite time and his practical (as opposed to theoretical) chances are reduced. One great recent example of that is Pentala Harikrishna's recent queen sacrifice in the world cup, amazing conception of a move that the computer say is borderline incorrect, but leads to such complications and a very uncomfortable position for his opponent that it was practically a great choice.
It can be either one. In closed positions, it is often the latter.
It's only the later if it's a weak browser engine, and it's early enough in the game that the player had studied the position with a cloud engine.
I would love to examine Stockfish play that seemed extremely counterintuitive but which ended up winning. How can I do so? (I don't inhabit any of the current chess spaces so have no idea where to look, but my son is approaching the age where I can start to teach him...).
That said, chess is such a great human invention. (Go is up there too. And texas no-limit hold'em poker. Those are my top 3 votes for "best human tabletop games ever invented". They're also, perhaps not uncoincidentally, the hardest for computers to be good at. Or, were.)
The problem is that Stockfish is so strong that the only way to have it play meaningful games is to put it against other computers. Chess engines play each other in automated competitions like TCEC.
If you look on Youtube there are many channels where strong players analyze these games. As Demis Hassabis once put it, it's like chess from another dimension.
I recommend Matthew Sadler's Game Changer and The Silicon Road To Chess Improvement.
You explained yourself right. The issue is that you keep qualifying your statements.
> it suggests are extremely counterintuitive and it would be unrealistic to expect a human to find them...
> ... in tournament conditions.
I'm suggesting that I'd like to see the ones that humans have found - outside of tournament conditions. Perhaps the gulf between us arises from an unspoken reference to solutions "unrealistic to expect a human to find" without the window-of-time qualifier?
I can wreck stockfish in chess boxing. Mostly because stockfish can't box, and it's easy for me to knock over a computer.
If it runs on a mainframe you would lose both the chess and the boxing.
The point of that qualifier is that you can expect to see weird moves outside of tournament conditions because casual games are when people experiment when that kind of thing.
How are they faster? I don’t think any ELO report actually comes from participating at a live coding contest on previously unseen problems.
My background is more on math competitions, but all of those things are essentially speed contests. The skill comes from solving hard problems within a strict time limit. If you gave people twice the time, they'd do better, but time is never going to be an issue for a computer.
Comparing raw Elo ratings isn't very indicative IMHO, but I do find it plausible that in closed, game-like environments models could indeed achieve the superhuman performance the Elo comparison implies, see my other comment in this thread.
Just to clarify the context for future readers: the latest problem at the moment is #970: https://projecteuler.net/problem=970
I just had chatgpt explain that problem to me (I was unfamiliar with the mathematical background). It showed how to solve closed form answers for H(2) and H(3) and then numerical solutions using RK4 for higher values. Truly impressive, and it explained the derivations beautifully. There are few maths experts I've encountered who could have hand-held me through it as good.
Comment was deleted :(
I tried it with gpt-5.1 thinking, and it just searched and found a solution online :p
Is there a solution to this exact problem, or to related notions (renewal equation etc.)? Anyway seems like nothing beats training on test
Are you sure it did not retrieve the answer using websearch?
gpt-5.1 gave me the correct answer after 2m 17s. That includes retrieving the Euler website. I didn't even have to run the Python script, it also did that.
Did it search the web?
Yeah, LLMs used to not be up to par for new Project Euler problems, but GPT-5 was able to do a few of the recent ones which I tried a few weeks ago.
The problem is these models are optimized to solve the benchmarks, not real world problems.
I asked Grok to write a Python script to solve this and it did it in slightly under ten minutes, after one false start where I'd asked it using a mode that doesn't think deeply enough. Impressive.
So when does the developer admit defeat? Do we have a benchmark for that yet?
According to a bunch of philosophers (https://ai-2027.com/), doom is likely imminent. Kokotajlo was on Breaking Points today. Breaking Points is usually less gullible, but the top comment shows that "AI" hype strategy detection is now mainstream (https://www.youtube.com/watch?v=zRlIFn0ZIlU):
AI researcher: "Just another trillion dollars. This time we'll reach superintelligence, I swear."
Every Ai researcher calls it quits one YOLO run away from inventing a machine that turns all matter in the Universe into paperclips
Does it matter if it is out of the training data? The models integrate web search quite well.
What if they have an internal corpus of new and curated knowledge that is constantly updated by humans and accessed in a similar manner? It could be active even if web search is turned off.
They would surely add the latest Euler problems with solutions in order to show off in benchmarks.
Wow. Sounds pretty impressive.
This is wild. I gave it some legacy XML describing a formula-driven calculator app, and it produced a working web app in under a minute:
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
I spent years building a compiler that takes our custom XML format and generates an app for Android or Java Swing. Gemini pulled off the same feat in under a minute, with no explanation of the format. The XML is fairly self-explanatory, but still.
I tried doing the same with Lovable, but the resulting app wouldn't work properly, and I burned through my credits fast while trying to nudge it into a usable state. This was on another level.
This is exactly the kind of task that LLMs are good at.
They are good at transforming one format to another. They are good at boilerplate.
They are bad at deciding requirements by themselves. They are bad at original research, for example developing a new algorithm.
> They are good at transforming one format to another. They are good at boilerplate.
You just described 90% of coding
They’re bad at 90% of coding, but for other reasons. That said if you babysit them incessantly they can help you move a bit faster through some of it.
90% of writing code, sure. But most professionnel programmers write code maybe 20% of the time. A lot of the time is spent clarifying requirements and similar stuff.
The more I hear about other developers' work, the more varied it seems. I've had a few different roles, from one programmer in a huge org to lead programmer in a small team, with a few stints of technical expert in-between. For each the kind of work I do most has varied a lot, but it's never been mostly about "clarifying requirements". As a grunt worker I mostly just wrote and tested code. As a lead I spent most time mentoring, reviewing code, or in meetings. These days I spend most of my time debugging issues and staring at graphics debugger captures.
Well, I tried a variation of a prompt I was messing with in Flash 2.5 the other day in a thread about AI-coded analog clock faces. Gemini Pro 3 Preview gave me a result far beyond what I saw with Flash 2.5, and got it right in a single shot.[0] I can't say I'm not impressed, even though it's a pretty constrained example.
> Please generate an analog clock widget, synchronized to actual system time, with hands that update in real time and a second hand that ticks at least once per second. Make sure all the hour markings are visible and put some effort into making a modern, stylish clock face. Please pay attention to the correct alignment of the numbers, hour markings, and hands on the face.
[0] https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
This is quite likely to be in the training data, since it's one of the projects in Wes Bos's free 30 days of Javascript course[0].
I was under the impression for this to work like that, training data needs to be plenty. One project is not enough since it’s too "sparse".
But maybe this example was used by many other people and so it proliferated?
The repo[0] currently has been forked ~41300 times.
The subtle "wiggle" animation that the second hand makes after moving doesn't fire when it hits 12. Literally unwatchable.
In its defence, the code actually specifically calls that edge case out and justifies it:
// Calculate rotations
// We use a cumulative calculation logic mentally, but here simple degrees work because of the transition reset trick or specific animation style.
// To prevent the "spin back" glitch at 360->0, we can use a simple tick without transition for the wrap-around,
// but for simplicity in this specific React rendering, we will stick to standard 0-360 degrees.
// A robust way to handle the spin-back on the second hand is to accumulate degrees, but standard clock widgets often reset.The Swiss and German railway clocks actually work the same way and stop for (half a?) second while the minute handle progresses.
Station clocks in Switzerland receive a signal from a master clock each minute that advances the minute hand, the seconds hand moves completely independent from the minute hand. This allows them to sync to the minute.
> The station clocks in Switzerland are synchronised by receiving an electrical impulse from a central master clock at each full minute, advancing the minute hand by one minute. The second hand is driven by an electrical motor independent of the master clock. It takes only about 58.5 seconds to circle the face; then the hand pauses briefly at the top of the clock. It starts a new rotation as soon as it receives the next minute impulse from the master clock.[3] This movement is emulated in some of the licensed timepieces made by Mondaine.
The video shows closer to 2 seconds for it to finally throw itself over in what could only be described as a "Thunk". I figured it would be a little more smooth.
Fixed with prompt "Second hand doesn't shake when it lands on 12, fix it." and 131 seconds. With a bunch of useState()-s and a useEffet()
in defense of 2.5 (Pro, at least), it was able to generate for me a metric UNIX clock as a webpage which I was amused by. it uses kiloseconds/megaseconds/etc. there are 86.4ks/day. The "seconds" hand goes around 1000 seconds, which ticks over the "hour" hand. Instead of saying 4am, you'd say it's 14.
as a calendar or "date" system, we start at UNIX time's creation, so it's currently 1.76 gigaseconds AUNIX. You might use megaseconds as the "week" and gigaseconds more like an era, e.g. Queen Elizabeth III's reign, persisting through the entire fourth gigasecond and into the fifth. The clock also displays teraseconds, though this is just a little purple speck atm. of course, this can work off-Earth where you would simply use 88.775ks as the "day"; the "dates" a Martian and Earthling share with each other would be interchangeable.
I can't seem to get anyone interested in this very serious venture, though... I guess I'll have to wait until the 50th or so iteration of Figure, whenever it becomes useful, to be able to build a 20-foot-tall physical metric UNIX clock in my front yard.
https://ai.studio/apps/drive/1oGzK7yIEEHvfPqxBGbsue-wLQEhfTP...
I made a few improvements... which all worked on the first try... except the ticking sound, which worked on the second try (the first try was too much like a "blip")
This is cool. Gemini 2.5 Pro was also capable of this. Gemini was able to recreate famous piece of clock artwork in July: https://gemini.google.com/app/93087f373bd07ca2
"Against the Run": https://www.youtube.com/watch?v=7xfvPqTDOXo
https://ai.studio/apps/drive/1yAxMpwtD66vD5PdnOyISiTS2qFAyq1... <- this is very nice, I was able to make seconds smooth with three iterations (it used svg initially which was jittery, but eventually this).
That is not the same prompt as the other person was using. In particular this doesn't provide the time to set the clock to, which makes the challenge a lot simpler. This also includes javascript.
The prompt the other person was using is:
``` Create HTML/CSS of an analog clock showing ${time}. Include numbers (or numerals) if you wish, and have a CSS animated second hand. Make it responsive and use a white background. Return ONLY the HTML/CSS code with no markdown formatting. ```
Which is much more difficult.
For what it's worth, I supplied the same prompt as the OG clock challenge and it utterly failed, not only generating a terrible clock, but doing so with a fair bit of typescript: https://ai.studio/apps/drive/1c_7C5J5ZBg7VyMWpa175c_3i7NO7ry...
URL not found :(
"Allow access to Google Drive to load this Prompt."
.... why? For what possible reason? No, I'm not going to give access to my privately stored file share in order to view a prompt someone has shared. Come on, Google.
You don't want to give Google access to files you've stored in Google Drive? It's also only access to an application specific folder, not all files.
Well, you also have to allow it to train on your data. Although this is not explicitly about your Google drive data, and probably requires you to submit a prompt yourself, the barriers here are way to weak/fuzzy for me consider granting access via any account with private info.
I'm assuming because AI Studio persisted, including shared, prompts are stored in Drive, and prompt sharing is implemented on top of Drive file sharing, so if AI Studio doesn't have access to Drive it doesn't have access to the shared prompt.
Because most likely (at least according to Hanlon's razor) they somehow decided that using Google Drive as the only persistent storage backing AI studio was a reasonable UX decision.
It probably makes some sense internally in big tech corporation logic (no new data storage agreements on top of the ones the user has already agreed to when signing up for Drive etc.), but as a user, I find it incredibly strange too – especially since the text chats are in some proprietary format I can't easily open on my local GDrive replica, but the images generated or uploaded just look like regular JPEGs and PNGs.
It looks quite nice, though to nitpick, it has “quartz” and “design & engineering” for no reason.
Just like actual cheap but not bottom of the barrel clocks
holy shit! This is actually a VERY NICE clock!
Having seen the page the other day this is pretty incredible. Does this have the same 2000 token limit as the other page?
No, and also the other page was pure HTML and CSS. This clock is using React and Javascript, so it's not a fair comparison.
This isn't using the same prompt or stack as the page from that post the other day; on aistudio it builds a web app across a few different files. It's still fairly concise but I don't think it's that much so.
It also includes javascript which was verboten in the original prompt, and doesn't specify the time the clock should be set too.
Static Pelican is boring. First attempt:
Generate SVG animation of following:
1 - There is High fantasy mage tower with a top window a dome
2 - Green goblin come in front of tower with a torch
3 - Grumpy old mage with beard appear in a tower window in high purple hat
4 - Mage sends fireball that burns goblin and all screen is covered in fire.
Camera view must be from behind of goblin back so we basically look at tower in front of us:
After few more attempts longer animation with a story from my gamedev inspired mind:
https://codepen.io/Runway/pen/zxqzPyQ
PS: but yeah thats attempt #20 or something.
This is bloody magical. I cannot believe it.
Seizure warning for the above link
edit: flashing lights at the end seem to be mostly becauseo f darkreader extension
we are so cooked
That SVG is impressive, but wouldn’t be usable in a real product as-is.
more than the goblin?
This is honestly incredible
Wow looks like total shit and eventually very hard to take on and actually improve it, given the convoluted code it generated, YET people are impressed. What world are we living in...
You can criticize the code but "wow looks like total shit" is such an embarrassing thing to say considering the context. Imagine going back a few years and show them a tool outputting this from text. No-one would believe it.
Let me double down to get even more downvotes, here's a snippet of the code:
> setTimeout(() => showSub("Ah, Earl Grey.", 2000), 1000);
> setTimeout(() => showSub("Finally some peace.", 2000), 3500);
> // Scene 2
> setTimeout(() => showSub("Armor Clanking", 2000), 7000);
> setTimeout(() => showSub("Heavy Breathing", 2000), 10000);
If we will lose our jobs to this dumb slop I'd rather be happy doing something else
How would you do it?
I would properly separate data and code so that I can easily change the dialogue and its timing without having to rewrite all of the numbers in all of the code.
Why bother doing that when a non-engineer can just change the prompt and output a different result? :shrug:
Your desired setup is just a single prompt away...
You are missing the point of this exercise. This is not about code quality - its about capacity of model to generate visuals with no guidance.
For the code quality it can really be as good or as bad ad as you desire. In this case it is what it is because I put zero effort into it.
we are returning to flash animations after 20 years
This reminded me of https://youtube.com/playlist?list=PLSq76P-lbX8VQmtv7gcAPkqlj...
Wow, that's very impressive
Holy crap. That's actually kind of incredible for a first attempt.
I'm sure this is a very impressive model, but gemini-3-pro-preview is failing spectacularly at my fairly basic python benchmark. In fact, gemini-2.5-pro gets a lot closer (but is still wrong).
For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, gpt-5-instant fails, sonnet-4.5 passes, opus-4.1 passes (lesser claude models fail).
This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks. A lot of people are going to say "wow, look how much they jumped in x, y, and z benchmark" and start to make some extrapolation about society, and what this means for others. Meanwhile.. I'm still wondering how they're still getting this problem wrong.
edit: I've a lot of good feedback here. I think there are ways I can improve my benchmark.
>>benchmarks are meaningless
No they’re not. Maybe you mean to say they don’t tell the whole story or have their limitations, which has always been the case.
>>my fairly basic python benchmark
I suspect your definition of “basic” may not be consensus. Gpt-5 thinking is a strong model for basic coding and it’d be interesting to see a simple python task it reliably fails at.
Comment was deleted :(
they are not meaningless, but when you work a lot with LLMs and know them VERY well, then a few varied, complex prompts tell you all you need to know about things like EQ, sycophancy, and creative writing.
I like to compare them using chathub using the same prompts
Gemini still calls me "the architect" in half of the prompts. It's very cringe.
It’s very different to get a “vibe check” for a model than to get an actual robust idea of how it works and what it can or can’t do.
This exact thing is why people strongly claimed that GPT-5 Thinking was strictly worse than o3 on release, only for people to change their minds later when they’ve had more time to use it and learn its strengths and weaknesses. It takes time for people to really get to grips with a new model, not just a few prompt comparisons where luck and prompt selection will play a big role.
Gemini still calls me "the architect" in half of the prompts. It's very cringe.
Can't say I've ever seen this in my own chats. Maybe it's something about your writing style?I get that one can perhaps have an intuition about these things, but doesn't this seem like a somewhat flawed attitude to have all things considered? That is, saying something to the effect of "well I know its not too sycophantic, no measurement needed, I have some special prompts of my own and it passed with flying colors!" just sounds a little suspect on first pass, even if its not like totally unbelievable I guess.
Using a single custom benchmark as a metric seems pretty unreliable to me.
Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.
after taking a walk for a bit i decided you’re right. I came to the wrong conclusion. Gemini 3 is incredibly powerful in some other stuff I’ve run.
This probably means my test is a little too niche. The fact that it didn’t pass one of my tests doesn’t speak to the broader intelligence of the model per se.
While i still believe in the importance of a personalized suite of benchmarks, my python one needs to be down weighted or supplanted.
my bad to the google team for the cursory brush off.
Walks are magical. But also this reads partially like you got sent to a reeducation camp lol.
Comment was deleted :(
> This probably means my test is a little too niche.
> my python one needs to be down weighted or supplanted.
To me, this just proves your original statement. You can't know if an AI can do your specific task based on benchmarks. They are relatively meaningless. You must just try.
I have AI fail spectacularly, often, because I'm in a niche field. To me, in the context of AI, "niche" is "most of the code for this is proprietary/not in public repos, so statistically sparse".
I feel similarly. If you're working with some relatively niche APIs on services that don't get seen by the public, the AI isn't one-shotting anything. But I still find it helpful to generate some crap that I can then feel good about fixing.
No, do not share it. The bigger black hole these models are in, the better.
I like to ask "Make a pacman game in a single html page". No model has ever gotten a decent game in one shot. My attempt with Gemini3 was no better than 2.5.
Something else to consider. I often have much better success with something like: Create a prompt that creates a specification for a pacman game in a single html page. Consider edge cases and key implementation details that result in bugs. <take prompt>, execute prompt. It will often yield a much better result than one generic prompt. Now that models are trained on how to generate prompts for themselves this is quite productive. You can also ask it to implement everything in stages and implement tests, and even evaluate its tests! I know that isn't quite the same as "Implement pacman on an HTML page" but still, with very minimal human effort you can get the intended result.
I thought this kind of chaining was already part of these systems.
It made a working game for me (with a slightly expanded prompt), but the ghosts got trapped in the box after coming back from getting killed. A second prompt fixed it. The art and animation however was really impressive.
Your benchmarks should not involve IP.
The only intellectual property here would be trademark. No copyright, no patent, no trade secret. Unless someone wants to market the test results as a genuine Pac-Man-branded product, or otherwise dilute that brand, there's nothing should-y about it.
It's not an ethics thing. It's a guardrails thing.
That's a valid point, though an average LLM would certainly understand the difference between trademark and other forms of IP. I was responding to the earlier comment, whose author later clarified that it represented an ethical stance ("stealing the hard work of some honest, human souls").
Why? This seems like a reasonable task to benchmark on.
Because you hit guard rails.
Sure, reasonable to benchmark on if your goal is to find out which companies are the best at stealing the hard work of some honest, human souls.
correction: pacman is not a human and has no soul.
Why do you have to willfully misinterpret the person you're replying to? There's truth in their comment.
[flagged]
How can you be sure that your benchmark is meaningful and well designed?
Is the only thing that prevents a benchmark from being meaningful publicity?
I didn't tell you what you should think about the model. All I said is that you should have your own benchmark.
I think my benchmark is well designed. It's well designed because it's a generalization of a problem I've consistently had with LLMs on my code. Insofar that it encapsulates my coding preferences and communication style, that's the proper benchmark for me.
I asked a semi related question in a different thread [0] -- is the basic idea behind your benchmark that you specifically keep it secret to use it as an "actually real" test that was definitely withheld from training new LLMs?
I've been thinking about making/publishing a new eval - if it's not public, presumably LLMs would never get better at them. But is your fear that generally speaking, LLMs tend to (I don't want to say cheat but) overfit on known problems, but then do (generally speaking) poorly on anything they haven't seen?
Thanks
> if it's not public, presumably LLMs would never get better at them.
Why? This is not obvious to me at all.
You're correct of course - LLMs may get better at any task of course, but I meant that publishing the evals might (optimistically speaking) help LLMs get better at the task. If the eval was actually picked up / used in the training loop, of course.
That kind of “get better at” doesn’t generalize. It will regurgitate its training data, which now includes the exact answer being looked for. It will get better at answering that exact problem.
But if you care about its fundamental reasoning and capability to solve new problems, or even just new instances of the same problem, then it is not obvious that publishing will improve this latter metric.
Problem solving ability is largely not from the pretraining data.
Yeah, great point.
I was considering working on the ability to dynamically generate eval questions whose solutions would all involve problem solving (and a known, definitive answer). I guess that this would be more valuable than publishing a fixed number of problems with known solutions. (and I get your point that in the end it might not matter because it's still about problem solving, not just rote memorization)
> This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks.
Yeah I have my own set of tests and the results are a bit unsettling in the sense that sometimes older models outperform newer ones. Moreover, they change even if officially the model doesn't change. This is especially true of Gemini 2.5 pro that was performing much better on the same tests several months ago vs. now.
I wonder whether it could be related to some kind of over-fitting, i.e. a prompting style that tends to work better with the older models, but performs worse with the newer ones.
I maintain a set of prompts and scripts for development using Claude Code. They are still all locked to using Sonnet 4 and Opus 4.1, because Sonnet 4.5 is flaming hot garbage. I’ve stopped trusting the benchmarks for anything.
A lot of newer models are geared towards efficency and if you add the fact that more efficent models are trained on the output of less efficent (but more accurate) models....
GPT4/3o might be the best we will ever have
I moved to using the model from python coding to golang coding and got incredible speedups in writing the correct version of the code
Is observed speed meaningful for a model preview? Isn’t it likely to go down once usage goes up?
Google reports a lower score for Gemini 3 Pro on SWEBench than Claude Sonnet 4.5, which is comparing a top tier model with a smaller one. Very curious to see whether there will be an Opus 4.5 that does even better.
I agree that benchmarks are noise. I guess, if you're selling an LLM wrapper, you'd care, but as a happy chat end-user, I just like to ask a new model about random stuff that I'm working on. That helps me decide if I like it or not.
I just chatted with gemini-3-pro-preview about an idea I had and I'm glad that I did. I will definitely come back to it.
IMHO, the current batch of free, free-ish models are all perfectly adequate for my uses, which are mostly coding, troubleshooting and learning/research.
This is an amazing time to be alive and the AI bubble doomers that are costing me some gains RN can F-Off!
and models are still pretty bad at playing tic-tac-toe, they can do it, but think way too much
it's easy to focus on what they can't do
Everything is about context. When you just ask non-concrete task it's still have to parse your input and figure what is tic-tac-toe in this context and what exactly you expect it to do. This is why all "thinking".
Ask it to implement tic-tac-toe in Python for command line. Or even just bring your own tic-tac toe code.
Then make it imagine playing against you and it's gonna be fast and reliable.
prompt was very concrete: draw a tic tac toe ASCII table and let's play. gemini 2.5 thought for pages particular moves
Comment was deleted :(
curious if you tried grok 4.1 too
What's the benchmark?
I don't think it would be a good idea to publish it on a prime source of training data.
He could post an encrypted version and post the key with it to avoid it being trained on?
What makes you think it wouldn't end up in the training set anyway?
I wouldn't underestimate the intelligence of agentic AI, despite how stupid they are today.
Every AI corp has people reading HN.
Good personal benchmarks should be kept secret :)
nice try!
you already sent the prompt to gemini api - and they likely recorded it. So in a way they can access it anyway. Posting here or not would not matter in that aspect.
Could also just be rollout issues.
Could be. I'll reply to my comment later with pass/fail results of a re-run.
I'm dying to know what you're giving to it that's choking on. It's actually really impressive if that's the case.
I find this hard to understand. I have AI completely choke on my code constantly. What are you doing where it performs so well? Web?
I constantly see failures in trivial vectors projections, broken bash scripts that don't properly quote variables (fail if space in filenames), and near completely inability to do relatively basic image processing tasks (if they don't rely on template matches).
I accidentally spent $50 on Gemeni 2.5 Pro last week, with Roo, trying to make a simple Mock interface for some lab equipment. The result: it asks permission to delete everything it did and start over...
that's why everyone using AI for code should code in rust only.
Here are my notes and pelican benchmark, including a new, harder benchmark because the old one was getting too easy: https://simonwillison.net/2025/Nov/18/gemini-3/
Considering how important this benchmark has become to the judgement of state of the art AI models, I imagine each AI lab has a dedicated 'pelican guy', a a highly accomplished and academically credentialed person, who's working around the clock on training the model to make better and better SVG pelicans on bikes.
That would mean my dastardly scheme has finally come to fruition: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
Pelican guy may be one of the last jobs to be automated
They've been training for months to draw that pelican, just for you to move the goalposts.
It's a pelican on a bike, not a goalpost. And bikes move. Well, pelicans move, too.
It's interesting that you mentioned on a recent post that saturation on the pelican benchmark isn't a problem because it's easy to test for generalization. But now looking at your updated benchmark results, I'm not sure I agree. Have the main labs been climbing the Pelican on a bike hill in secret this whole time?
Considering how many other "pelican riding a bicycle" comments there are in this thread, it would be surprising if this was not already incorporated in the training data. If not now, soon.
I don't think the big labs would waste their time on it. If a model is great at making the pelican but sucks at all other svg it becomes obvious. But so far the good pelicans are strong indicators of good general SVG ability.
Unless training on the pelican increases all SVG ability, then good job.
I absolutely think they would given the amount of money and hype being pumped into it.
I updated my benchmark of 30 pelican-bicycle alternatives that I posted here a couple of weeks ago:
https://gally.net/temp/20251107pelican-alternatives/index.ht...
There seem to be one or two parsing errors. I'll fix those later.
You should add ChatGPT.
I tried the first one and 5 Pro gives this: https://imgur.com/a/EhYroCE
I was interested (and slightly disappointed) to read that the knowledge cutoff for Gemini 3 is the same as for Gemini 2.5: January 2025. I wonder why they didn't train it on more recent data.
Is it possible they use the same base pre-trained model and just fine-tuned and RL-ed it better (which, of course, is where all the secret sauce training magic is these days anyhow)? That would be odd, especially for a major version bump, but it's sort of what having the same training cutoff points to?
The model card says: https://storage.googleapis.com/deepmind-media/Model-Cards/Ge...
> This model is not a modification or a fine-tune of a prior model.
I'm curious why they decided not to update the training data cutoff date too.
Maybe that date is a rule of thumb for when AI generated content became so widespread that it is likely to have contaminated future data. Given that people have spoofed authentic Reddit users with Markov chains, it probably doesn’t go back nearly far enough.
My favorite benchmark is to analyze a very long audio file recording of a management meeting and produce very good notes along with a transcript labeling all the speakers. 2.5 was decently good at generating the summary, but it was terrible at labeling speakers. 3.0 has so far absolutely nailed speaker labeling.
My audio experiment was much less successful — I uploaded a 90-minute podcast episode and asked it to produce a labeled transcript. Gemini 3:
- Hallucinated at least three quotes (that I checked) resembling nothing said by any of the hosts
- Produced timestamps that were almost entirely wrong. Language quoted from the end of the episode, for instance, was timestamped 35 minutes into the episode, rather than 85 minutes.
- Almost all of what is transcribed is heavily paraphrased and abridged, in most cases without any indication.
Understandable that Gemini can't cope with such a long audio recording yet, but I would've hoped for a more graceful/less hallucinatory failure mode. And unfortunately, aligns with my impression of past Gemini models that they are impressively smart but fail in the most catastrophic ways.
Now try an actual speech model like ElevenLabs or Soniox, not something not made for it.
I wonder if you could get around this with a slightly more sophisticated harness. I suspect you're running into context length issues.
Something like
1.) Split audio into multiple smaller tracks. 2.) Perform first pass audio extraction 3.) Find unique speakers and other potentially helpful information (maybe just a short summary of where the conversation left off) 4.) Seed the next stage with that information (yay multimodality) and generate the audio transcript for it
Obviously it would be ideal if a model could handle the ultra long context conversations by default, but I'd be curious how much error is caused by a lack of general capability vs simple context pollution.
The worst when it fails to eat simple pdf documents and lies and gas lights in an attempt to cover it up. Why not just admit you can’t read the file?
This is specifically why I don't use Gemini. The gaslighting is ridiculous.
I'd do the transcript and the summary parts separately. Dedicated audio models from vendors like ElevenLabs or Soniox use speaker detection models to produce an accurate speaker based transcript while I'm not necessarily sure that Google's models do so, maybe they just hallucinate the speakers instead.
Agreed. I don’t see the need for Gemini to be able to do this task, although it should be able to offload it to another model.
What prompt do you use for that?
I just tried "analyze this audio file recording of a meeting and notes along with a transcript labeling all the speakers" (using the language from the parent's comment) and indeed Gemini 3 was significantly better than 2.5 Pro.
3 created a great "Executive Summary", identified the speakers' names, and then gave me a second by second transcript:
[00:00] Greg: Hello.
[00:01] X: You great?
[00:02] Greg: Hi.
[00:03] X: I'm X.
[00:04] Y: I'm Y.
...
Super impressive!Does it deduce everyone's name?
It does! I redacted them, but yes. This was a 3-person call.
I made a simple webpage to grab text from YouTube videos: https://summynews.com Great for this kind of testing? (want to expand to other sources in the long run)
It's not even THAT hard. I am working on a side project that gets a podcast episode and then labels the speakers. It works.
Parakeet TDT v3 would be really good at that
Yes, this is the best solution for that goal. Use the MacWhisper app + Parakeet 3.
I love it that there's a "Read AI-generated summary" button on their post about their new AI.
I can only expect that the next step is something like "Have your AI read our AI's auto-generated summary", and so forth until we are all the way at Douglas Adams's Electric Monk:
> The Electric Monk was a labour-saving device, like a dishwasher or a video recorder. Dishwashers washed tedious dishes for you, thus saving you the bother of washing them yourself; video recorders watched tedious television for you, thus saving you the bother of looking at it yourself. Electric Monks believed things for you, thus saving you what was becoming an increasingly onerous task, that of believing all the things the world expected you to believe.
- from "Dirk Gently's Holistic Detective Agency"
Excellent reference Tried to name an AI project at work Electric Monk but too 'controversial'
Had to change to Electric Mentor....
SMBC had a pretty great take on this: https://www.smbc-comics.com/comic/summary
There was another comic where one worker uses AI to turn their prompt in to a verbose email, then on the receiver side they use AI to turn the verbose email in to a short summary.
This feels too real to laugh at
Now let’s hope that it will also save labour on resolving cloud infrastructure downtimes too.
after outsource developer job, we can outsource all of manager job and leaving CEO with AI agentic code as its servant
Not sure what you mean here, but the only real jobs at risk from AI right now are middle/upper management.
Not a single engineer has ever been laid off because of AI. Any company claiming this is the case is trying to cover up bad decisions.
"Were automating with AI" sounds better to investors than "We over hired and now need to downsize" or "We made some bad market bets, now need to free up cash flow"
> Not sure what you mean here, but the only real jobs at risk from AI right now are middle/upper management.
> Not a single engineer has ever been laid off because of AI. Any company claiming this is the case is trying to cover up bad decisions.
I don't suppose these assertions are based on anything. If "AI" reduces the amount of time an engineer spends writing crud, boilerplate, test cases, random scripts, etc., and they have 5% more time to do other things, then all else being equal a project can be done with 5% fewer engineers.
Does AI result in greater productivity for engineers, and does greater productivity per person mean demand can be satisfied with fewer people?
> Does AI result in greater productivity for engineers, and does greater productivity per person mean demand can be satisfied with fewer people?
Between the disagreements regarding performance metrics, the fact that AI will happily increase its own scope of work as well as facilitate increasing any task, sprint, or projects scope of work, and Jevons Paradox, the world may never know the answer to either of these questions.
It does improve productivity, just like a good IDE. But engineers didn't get replaced by IDEs and they haven't yet been replaced by AI.
By the time its good enough to replace actual engineers, any job done in front of a computer will be at risk. I'm hoping that will happen at the same time as AI embodiment in robots, then every job will be automated, not just computer based ones.
Your assertion was not that "an engineer has never been replaced by AI". It is that no engineer has been laid off because of AI.
You agree AI improves engineer productivity. So last remaining question is, does greater productivity mean that fewer people are required to satisfy a given demand?
The answer is yes of course. So at this point, supporting the assertion requires handwaving about shortages and induced demand and demand for engineers to develop and support AI and so on. Which are all reasonable, but it should become pretty apparent that you can't be confident in an assertion like that. I would say it's pretty likely that AI has resulted in engineers being laid off in specific instances if not the net numbers.
this is true
AI powered developer make 3x times the workload of "traditional" dev into one single developer
therefore company didnt need to hire 3 people as a result, it literally kills job count
"Not a single engineer has ever been laid off because of AI."
are you insane??? big tech literally make one of the most biggest layoff for the past few months
That's because of overhiring and other non-ai related reasons (i.e. Higher interest rates means less VC funding available).
In reality, getting AI to do actual human work, as of the moment, takes much more effort and cost than you get back in cost savings. These companies will claim they are using AI, even if its just a few engineers using Windsurf.
The companies claim AI is the reason they laid off engineers to make it look like they're innovating, not downsizing, which makes them look better in the eyes of investors and shareholders.
It still failed my image identification test ([a photoshopped picture of a dog with 5 legs]...please count the legs) that so far every other model has failed agonizingly, even failing when I tell them they are failing, and they tend to fight back at me.
Gemini 3 however, while still failing, at least recognized the 5th leg, but thought the dog was...well endowed. The 5th leg however is clearly a leg, despite being where you would expect the dogs member to be. I'll give it half credit for at least recognizing that there was something there.
Still though, there is a lot of work that needs to be done on getting these models to properly "see" images.
Perception seems to be one of the main constraints on LLMs that not much progress has been made on. Perhaps not surprising, given perception is something evolution has worked on since the inception of life itself. Likely much, much more expensive computationally than it receives credit for.
I strongly suspect it's a tokenization problem. Text and symbols fit nicely in tokens, but having something like a single "dog leg" token is a tough problem to solve.
The neural network in the retina actually pre-processes visual information into something akin to "tokens". Basic shapes that are probably somewhat evolutionarily preserved. I wonder if we could somehow mimic those for tokenization purposes. Most likely there's someone out there already trying.
(Source: "The mind is flat" by Nick Chater)
It's also easy to spot as when you are tired you might misrecognize objects, I caught myself with this when doing long roadtrips
I think in this case, tokenization and percpetion are somewhat analogous. I think it is probably the case our current tokenization schemes are really simplistic compared to what nature is working with. If you allow the analogy.
Why should it have to be expensive computationally? How do brains do it with such a low amount of energy? I think catching the brain abilities even of a bug might be very hard, but that does not mean that there isn't a way to do it with little computational power. It requires having the correct structures/models/algorithms or whatever is the precise jargon.
> How do brains do it with such a low amount of energy?
Physical analog chemical circuits whose physical structure directly is the network, and use chemistry/physics directly for the computations. For example, a sum is usually represented as the number of physical ions present within a space, not some ALU that takes in two binary numbers, each with some large number of bits, requiring shifting electrons to and from buckets, with a bunch of clocked logic operations.
There are a few companies working on more "direct" implementations of inference, like Etched AI [1] and IBM [2], for massive power savings.
[1] https://en.wikipedia.org/wiki/Etched_(company)
[2] https://spectrum.ieee.org/neuromorphic-computing-ibm-northpo...
This is the million dollar question. I'm not qualified to answer it, and I don't really think anyone out there has the answer yet.
My armchair take would be that watt usage probably isn't a good proxy for computational complexity in biological systems. A good piece of evidence for this is from the C. elegans research that has found that the configuration of ions within a neuron--not just the electrical charge on the membrane--record computationally-relevant information about a stimulus. There are probably many more hacks like this that allow the brain to handle enormous complexity without it showing up in our measurements of its power consumption.
My armchair is equally comfy, and I have an actual paper to point to:
Jaxley: Differentiable simulation enables large-scale training of detailed biophysical models of neural dynamics [1]
They basically created sofware to simulate real neurons and ran some realistic models to replicate typical AI learning tasks:
"The model had nine different channels in the apical and basal dendrite, the soma, and the axon [39], with a total of 19 free parameters, including maximal channel conductances and dynamics of the calcium pumps."
So yeah, real neurons are a bit more complex then ReLU or Sigmoid.
[1] https://www.biorxiv.org/content/10.1101/2024.08.21.608979v2....
Following the trend of discovering smaller and smaller phenomena that our brains use for processing, it would not be surprising if we eventually find that our brains are very nearly "room temperature" quantum computers.
ah interesting. I wonder if this is a "safety guardrails blindspot" due to the placement.
Op is right.
https://imgcdn.stablediffusionweb.com/2024/4/19/8e54925a-004...
For the above pic I asked "What is wrong with the image?"
Result: - It totally missed the most obvious one - six fingers.
Instead it said this:
Based on a close inspection of the image, here are the things that are "wrong" or anomalous, indicating that it is likely an AI-generated image:
- The Thumb Anatomy: This is the most obvious error. The digit in the thumb's position (far left) looks exactly like a long index finger. A natural human thumb has two segments (phalanges) and is shorter and thicker, opposing the other fingers. This digit appears to have three segments and is roughly the same length and shape as the other fingers, giving the hand the appearance of having five fingers and no thumb.
- Unnatural Skin Texture: The skin is unnaturally smooth, waxy, and free of normal human imperfections like distinct pores, fine wrinkles, or hair. It has a "plastic" or "airbrushed" look typical of generative AI.
- Palm Lines: The creases on the palm are vague and generic. They don't follow the distinct pattern of major flexion creases (like the heart, head, and life lines) found on a real human hand.
- Wrist Connection: The connection between the hand and the wrist looks slightly distorted, with tendons that appear too prominent and smooth for the pose.
[Edit: 3.0 is same as 2.5 - both answered almost identically]
JFYI, Qwen managed to recognize the sixth finger:
Max: https://chat.qwen.ai/c/ca671562-7a56-4e2f-911f-40c37ff3ed79
VL-235B: https://chat.qwen.ai/c/21cc5f4e-5972-4489-9787-421943335150
I am personally impressed by the continued improvement in ARC-AGI-2, where Gemini 3 got 31.1% (vs ChatGPT 5.1's 17.6%). To me this is the kind of problem that does not lend itself well to LLMs - many of the puzzles test the kind of thing that humans intuit because of millions of years of evolution, but these concepts do not necessarily appear in written form (or when they do, it's not clear how they connect to specific ARC puzzles).
The fact that these models can keep getting better at this task given the setup of training is mind-boggling to me.
The ARC puzzles in question: https://arcprize.org/arc-agi/2/
What I would do if I was in the position of a large company in this space is to arrange an internal team to create an ARC replica, covering very similar puzzles and use that as part of the training.
Ultimately, most benchmarks can be gamed and their real utility is thus short-lived.
But I think this is also fair to use any means to beat it.
I agree that for any given test, you could build a specific pipeline to optimize for that test. I supposed that's why it is helpful to have many tests.
However, many people have worked hard to optimize tools specifically for ARC over many years, and it's proven to be a particularly hard test to optimize for. This is why I find it so interesting that LLMs can do it well at all, regardless of whether tests like it are included in training.
The real strength of current neural nets/transformers relies on huge datasets.
ARC do not provide this kind of dataset, only a small public one and a private one where they do the benchmarks.
Building your own large private ARC set does not seem too difficult if you have enough resources.
Comment was deleted :(
Doesn't even matter at this point.
We have a global RL Pipeline on our hand.
If there is something new a LLM/AI model can't solve today, plenty of humans can't either.
But tomorrow every LLM/AI model can solve it and again plent of humans still can't.
Even if AGI is just the sum of companies adding more and more trainingdata, as long as this learning pipeline becomes faster and easier to train with new scenarios, that will start to bleed out humans in the loop.
That's ok; just start publishing your real problems to solve as "AI benchmarks" and then it'll work in ~6 months.
Is "good at benchmarks instead of real world tasks" really something to optimize for? What does this achieve? Surely people would be initially impressed, try it out, be underwhelmed and then move on. That's not great for Google
If they're memory/reference constrained systems that can't directly "store" every solution, then doing well on benchmarks should result in better real world/reasoning performance, since lack of memorized answer requires understanding.
Like with humans [1], generalized reasoning ability lets you skip the direct storage of that solution, and many many others, completely! You can just synthesize a solution when a problem is presented.
Benchmarks are intended as proxy for real usage, and they are often useful to incrementally improve a system, especially when the end-goal is not well-defined.
The trick is to not put more value in the score than what it is.
Initial impressions are currently worth a lot. In the long run I think the moat will dissolve, but currently its a race to lock-in users to your model and make switching costs high.
Comment was deleted :(
> internal team to create an ARC replica, covering very similar puzzles
they can target benchmark directly, not just replica. If google or OAI are bad actors, they already have benchmark data from previous runs.
The 'private' set is just a pinkie promise not to store logs or not to use the logs when the evaluator uses the API to run the test, so yeah. It's trivially exploitable.
Not only do you have the financial self-interest to do it (helps with capital raising to be #1), but you are worried that your competitors are doing it, so you may as well cheat to make things fair. Easy to do and easy to justify.
Maybe a way to make the benchmark more robust to this adversarial environment is to introduce noise and random red herrings into the question, and run the test 20 times and average the correctness. So even if you assume they're training on it, you have some semblance of a test still happening. You'd probably end up with a better benchmark anyway which better reflects real-world usage, where there's a lot of junk in the context window.
they have two sets:
- semi-private, which they use to test proprietary models and which could be leaked
-private: used to test downloadable open source models.
ARG-AGI prize itself is for open source models.
My point is that it does not matter if the set is private or not.
If you want to train your model you'd need more data than the private set anyway. So you have to build a very large training set on your own, using the same kind of puzzles.
It is not that hard, really, just tedious.
Yes you can build your dataset of n puzzles but it was still really hard for any system to achieve any scores, it even beats specialized one for this just one task and this puzzles shouldn't really be possible just to be memorized by the amount of variations that can be created.
Humans study for tests. They just tend to forget.
Agreed, it also leads performance on arc-agi-1. Here's the leaderboard where you can toggle between arc-agi-1 and 2: https://arcprize.org/leaderboard
It leads on arc-agi-1 with Gemini 3.0 Deep Think, which uses "tool calls" according to google's post, whereas regular Gemini 3.0 Pro doesn't use "tool calls" for the same benchmark. I am unsure how significant this difference is.
This comment was moved from another thread. The original thread included a benchmark chart with ARC performance: https://blog.google/products/gemini/gemini-3/#gemini-3
There's a good chance Gemini 3 was trained on ARG-AGI problems, unless they state otherwise.
Its almost certain that it was, but the purpose of this puzzle benchmark is that it shouldn't really be possible just to be memorized by the amount of variations that can be created and other criteria detailed in it.
ARC-AGI has a hidden private test suite, right ? No model will have access to that set.
I doubt they have offline access to the model, i.e. the prompts are sent to the model provider.
Even if the prompts are technically leaked to the provider, how would they be identified as something worth optimizing for out of the millions of other prompts received?
that looks great, but we all care how it translate to real world problems like programming where it isn't really excelling by 2x.
Just generated a bunch of 3D CAD models using Gemini 3.0 to see how it compares in spatial understanding and it's heaps better than anything currently out there - not only intelligence but also speed.
Will run extended benchmarks later, let me know if you want to see actual data.
Just hand sketched what 5 year old would do on the paper - the house, trees, sun. And asked to generate 3d model with tree.js.
Results are amazing! 2.5 and 3 seems way way head.
Based on my benchmarks (run 100s of model generations).
2.5 stands between GPT-5 and GPT-5.1, where GPT-5 is the best of the 3.
In preliminary evals Gemini 3 seems to be way better than all, but I will know when I run extended benchmarks tonight.
I'm interested in seeing the data.
Is observed speed meaningful for a model preview? Isn’t it likely to go down once usage goes up?
I'm not familiar enough with CAD what type of format is it?
It’s not a format, but in my mind it implies designs that are supposed to be functional as opposed to models that are meant for virtual games.
It generated a blender script that makes the model.
I would have used OpenSCAD for that purpose.
I started with a lighter weight solution (JSCAD) first and quickly hit the limitations. So I wanted to explore the other side of it - fully complex over the top software (blender).
I guess openscad would be a sweet spot in the middle. Good shout, might experiment.
Blender is not CAD. Edit: I’m not but picking. Totally different data structures and internal representations.
Computer aided design. Tree.js can be CAD. But I agree it’s not meant for CAD even though you can do it.
Three.js is not CAD. It is an API for drawing 3D graphics in a browser. 3D graphics, in general, is not CAD. Blender is not CAD. You cannot do CAD operations in blender.
I'm not being nit picky here. I think there are issues beyond terminology that you may not be familiar with, as it is clearly not your field. That's ok.
The "design" in computer aided design is engineering design. This is not the same definition of "design" used in, say, graphic design. Something is not called CAD because it helps you create an image that looks like a product on a computer. It is CAD because it creates engineering design files (blueprints) that can be used for the physical manufacture of a device. This places very tight and important constraints on the methods used, and capabilities supported.
Blender is a sculpting program. Its job is to create geometry that can be fed into a rendering program to make pretty pictures. Parasolid is a CAD geometry kernel at the core of many CAD programs, which has the job of producing manufacturable blueprints. The operations supported map to physical manufacturing steps - milling, lathe, and drill operations. The modeling steps use constraints in order to make sure, e.g., that screw holes line up. Blender doesn't support any of that.
To an engineer, saying that an LLM gave you a blender script for a CAD operation is causing all sorts of alarm klaxons to go off.
Next they'll be doing PCB CAD in Photoshop...
> Blender doesn't support any of that.
... without plugins. https://www.cadsketcher.com/
The "-like" in CAD-like is doing a lot of heavy lifting there.
Did your prompt instruct it to use blender?
Yes. I’ve been working and refining the prompt for some time now (months). It’s about 10k tokens now.
Would you mind sharing the prompt please?
When I see CAD, I always think of Casting Assistant Device.
Zero magic in this world, sorry.
I have "unlimited" access to both Gemini 2.5 Pro and Claude 4.5 Sonnet through work.
From my experience, both are capable and can solve nearly all the same complex programming requests, but time and time again Gemini spits out reams and reams of code so over engineered, that totally works, but I would never want to have to interact with.
When looking at the code, you can't tell why it looks "gross", but then you ask Claude to do the same task in the same repo (I use Cline, it's just a dropdown change) and the code also works, but there's a lot less of it and it has a more "elegant" feeling to it.
I know that isn't easy to capture in benchmarks, but I hope Gemini 3.0 has improved in this regard
I have the same experience with Gemini, that it’s incredibly accurate but puts in defensive code and error handling to a fault. It’s pretty easy to just tell it “go easy on the defensive code” / “give me the punchy version” and it cleans it up
Yes the defensive code is something that most models seem to struggle with - even Claude 4.5 Sonnet, even after explicitly prompting it not to - still adds pointless null checks and fallbacks in scripting languages where that something being null won't have any problems apart from an error being logged. I get this particularly when writing Angelscript for Unreal. This isn't surprising since as a niche language there's a lack of training data and the syntax is very similar to Unreal C++, which does crash to desktop when accessing a null reference.
but I would never want to have to interact with
That is its job security ;)I can relate to this, it's doing exactly what I want, but it ain't pretty.
It's fine though if you take the time to learn what it's doing and write a nicer version of it yourself
I have had a similar experience vibe coding with Copilot (ChatGPT) in VSCode, against the Gemini API. I wanted to create a dad joke generator and then have it also create a comic styled 4 cel interpretation of the joke. Simple, right? I was able to easily get it to create the joke, but it repeatedly failed on the API call for the image generation. What started as perhaps 100 lines of total code in two files ended up being about 1500 LOC with an enormous built-in self-testing mechanism ... and it still didn't work.
I was sorting out the right way to handle a medical thing and Gemini 2.5 Pro was part of the way there, but it lacked some necessary information. Got the Gemini 3.0 release notification a few hours after I was looking into that, so I tried the same exact prompt and it nailed it. Great, useful, actionable information that surfaced actual issues to look out for and resolved some confusion. Helped work through the logic, norms, studies, standards, federal approvals and practices.
Very good. Nice work! These things will definitely change lives.
A nice Easter egg in the Gemini 3 docs [1]:
If you are transferring a conversation trace from another model, ... to bypass strict validation in these specific scenarios, populate the field with this specific dummy string:
"thoughtSignature": "context_engineering_is_the_way_to_go"
[1] https://ai.google.dev/gemini-api/docs/gemini-3?thinking=high...It's an artifact of the problem that they don't show you the reasoning output but need it for further messages so they save each api conversation on their side and give you a reference number. It sucks from a GDPR compliance perspective as well as in terms of transparent pricing as you have no way to control reasoning trace length (which is billed at the much higher output rate) other than switching between low/high but if the model decides to think longer "low" could result in more tokens used than "high" for a prompt where the model decides not to think that much. "thinking budgets" are now "legacy" and thus while you can constrain output length you cannot constrain cost. Obviously you also cannot optimize your prompts if some red herring makes the LLM get hung up on something irrelevant only to realize this in later thinking steps. This will happen with EVERY SINGLE prompt if it's caused by something in your system prompt. Finding what makes the model go astray can be rather difficult with 15k token system prompts or a multitude of MCP tools, you're basically blinded while trying to optimize a black box. Obviously you can try different variations of different parts of your system prompt or tool descriptions but just because they result in less thinking tokens does not mean they are better if those reasoning steps where actually beneficial (if only in edge cases) this would be immediately apparent upon inspection but hard/impossible to find out without access to the full Chain of Thought. For the uninitiated, the reasons OpenAI started replacing the CoT with summaries, were A. to prevent rapid distillation as they suspected deepSeek to have used for R1 and B. to prevent embarrassment if App users see the CoT and find parts of it objectionable/irrelevant/absurd (reasoning steps that make sense for an LLM do not necessarily look like human reasoning). That's a tradeoff that is great with end-users but terrible for developers. As Open Weights LLMs necessarily output their full reasoning traces the potential to optimize prompts for specific tasks is much greater and will for certain applications certainly outweigh the performance delta to Google/OpenAI.
I was under the impression that those reasoning outputs that you get back aren't references but simply raw CoT strings that are encrypted.
Feels like the same consolidation cycle we saw with mobile apps and browsers are playing out here. The winners aren’t necessarily those with the best models, but those who already control the surface where people live their digital lives.
Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.
Open models and startups can innovate, but the platforms can immediately put their AI in front of billions of users without asking anyone to change behavior (not even typing a new URL).
AI overviews has arguable done more harm than good for them, because people assume it's Gemini, but really it's some ultra light weight model made for handling millions of queries a minute, and has no shortage of stupid mistakes/hallucinations.
> Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.
One of them isnt the same as others (hint: It is Apple). The only thing Apple is doing with Maps is, is adding ads https://www.macrumors.com/2025/10/26/apple-moving-ahead-with...
Microsoft hasn't been very quiet about it, at least in my experience. Every time I boot up Windows I get some kind of blurb about an AI feature.
Man, remember the days where we'd lose our minds at our operating systems doing stuff like that?
The people who lost their minds jumped ship. And I'm not going to work at a company that makes me use it, either. So, not my problem.
Gemini genuinely has an edge over the others in its super-long context size, though. There are some tasks where this is the deal breaker, and others where you can get by with a smaller size, but the results just aren't as good.
> The winners aren’t necessarily those with the best models
Is there evidence that's true? That the other models are significantly better than the ones you named?
Has anyone who is a regular Opus / GPT5-Codex-High / GPT5 Pro user given this model a workout? Each Google release is accompanied by a lot of devrel marketing that sounds impressive but whenever I put the hours into eval myself it comes up lacking. Would love to hear that it replaces another frontier model for someone who is not already bought into the Gemini ecosystem.
At this point I'm only using google models via Vertex AI for my apps. They have a weird QoS rate limit but in general Gemini has been consistently top tier for everything I've thrown at it.
Anecdotal, but I've also not experienced any regression in Gemini quality where Claude/OpenAI might push iterative updates (or quantized variants for performance) that cause my test bench to fail more often.
Matches my experience exactly. It's not the best at writing code but Gemini 2.5 Pro is (was) the hands-down winner in every other use case I have.
This was hard for me to accept initially as I've learned to be anti-Google over the years, but the better accuracy was too good to pass up on. Still expecting a rugpull eventually — price hike, killing features without warning, changing internal details that break everything — but it hasn't happened yet.
Yes. I am. It is spectacular in raw cognitive horsepower. Smarter than gpt5-codex-high but Gemini CLI is still buggy as hell. But yes, 3 has been a game changer for me today on hardcore Rust, CUDA and Math projects. Unbelievable what they’ve accomplished.
I gave it a spin with instructions that worked great with gpt-5-codex (5.1 regressed a lot so I do not even compare to it).
Code quality was fine for my very limited tests but I was disappointed with instruction following.
I tried few tricks but I wasn't able to convince it to first present plan before starting implementation.
I have instructions describing that it should first do exploration (where it tried to discover what I want) then plan implementation and then code, but it always jumps directly to code.
this is bug issue for me especially because gemini-cli lacks plan mode like Claude code.
for codex those instructions make plan mode redundant.
just say "don't code yet" at the end. I never use plan mode because plan mode is just a prompt anyways.
I've been working with it, and so far it's been very impressive. Better than Opus in my feels, but I have to test more, it's super early days
What I usually try to test with is try to get them do full scalable SaaS application from scratch... It seemed very impressive in how it did the early code organization using Antigravity, but then at some point, all of sudden it started really getting stuck and constantly stopped producing and I had to trigger continue, or babysit it. I don't know if I could've been doing something better, but that was just my experience. Seemed impressive at first, but otherwise at least vs Antigravity, Codex and Claude Code scale more reliably.
Just early anecdote from trying to build that 1 SaaS application though.
API pricing is up to $2/M for input and $12/M for output
For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output
Still cheaper than Sonnet 4.5: $3/M for input and $15/M for output.
It is so impressive that Anthropic has been able to maintain this pricing still.
Claude is just so good. Every time I try moving to ChatGPT or Gemini, they end up making concerning decisions. Trust is earned, and Claude has earned a lot of trust from me.
Honestly Google models have this mix of smart/dumb that is scary. Like if the universe is turned into paperclips then it'll probably be Google model.
Well, it depends. Just recently I had Opus 4.1 spend 1.5 hours looking at 600+ sources while doing deep research, only to get back to me with a report consisting of a single sentence: "Full text as above - the comprehensive summary I wrote". Anthropic acknowledged that it was a problem on their side but refused to do anything to make it right, even though all I asked them to do was to adjust the counter so that this attempt doesn't count against their incredibly low limit.
Idk Anthropic has the least consistent models out there imho.
Because every time I try to move away I realize there’s nothing equivalent to move to.
People insist upon Codex, but it takes ages and has an absolutely hideous lack of taste.
It creates beautiful websites though.
Taste in what?
Wines!
It's interesting that grounding with search cost changed from
* 1,500 RPD (free), then $35 / 1,000 grounded prompts
to
* 1,500 RPD (free), then (Coming soon) $14 / 1,000 search queries
It looks like the pricing changed from per-prompt (previous models) to per-search (Gemini 3)
With this kind of pricing I wonder if it'll be available in Gemini CLI for free or if it'll stay at 2.5.
There's a waitlist for using Gemini 3 for Gemini CLI free users: https://docs.google.com/forms/d/e/1FAIpQLScQBMmnXxIYDnZhPtTP...
In case anyone wants to confirm if this link is official, it is.
https://goo.gle/enable-preview-features
-> https://github.com/google-gemini/gemini-cli/blob/release/v0....
--> https://goo.gle/geminicli-waitlist-signup
---> https://docs.google.com/forms/d/e/1FAIpQLScQBMmnXxIYDnZhPtTP...
Thrilled to see the cost is competitive with Anthropic.
[flagged]
I assume the model is just more expensive to run.
Likely. The point is we would never know.
Comment was deleted :(
I just gave it a short description of a small game I had an idea for. It was 7 sentences. It pretty much nailed a working prototype, using React, clean css, Typescript and state management. It event implemented a Gemini query using the API for strategic analysis given a game state. I'm more than impressed, I'm terrified. Seriously thinking of a career change.
I find it funny to find this almost exact same post in every new model release thread. Yet here we are - spending the same amount of time, if not more, finishing the rest of the owl.
Comment was deleted :(
I just spent 12 hours a day vibe coding for a month and a half with Claude (which has equal swe benchmarks at gemini 3). I started out terrified but eventually I realized that these are just remarkably far away from actually replacing a real software engineer. For prototypes they're amazing, but when you're just straight vibe coding you get stuck in a hell where you don't want to or can't efficiently really check what's going on under the hood but it's not really doing the thing you want.
Basically these tools can you you to a 100k LOC project without much effort, but it's not going to be a serious product. A serious product requires understanding still.
Can you share the code?
To what?
VC (vibe coding).
I have my own private benchmarks for reasoning capabilities on complex problems and i test them against SOTA models regularly (professional cases from law and medicine). Anthropic (Sonnet 4.5 Extended Thinking) and OpenAI (Pro Models) get halfway decent results on many cases while Gemini Pro 2.5 struggled (it was overconfident in its initial assumptions). So i ran these benchmarks against Gemini 3 Pro and i'm not impressed. The reasoning is way more nuanced than their older model but it still makes mistakes which the other two SOTA competitor models don't make. Like it forgets in a law benchmark that those principles don't apply in the country from the provided case. It seems very US centric in its thinking whereas Anthropic and OpenAI pro models seem to be more aware around the context of assumed culture from the case. All in - i don't think this new model is ahead of the other two main competitors - but it has a new nuanced touch and is certainly way better than Gemini 2.5 pro (which is more telling how bad actually that one was for complex problems).
> It seems very US centric in its thinking
I'm not surprised. I'm French and one thing I've consistently seen with Gemini is that it loves to use Title Case (Everything is Capitalized Except the Prepositions) even in French or other languages where there is no such thing. A 100% american thing getting applied to other languages by the sheer power of statistical correlation (and probably being overtrained on USA-centric data). At the very least it makes it easy to tell when someone is just copypasting LLM output into some other website.
I've been so happy to see Google wake up.
Many can point to a long history of killed products and soured opinions but you can't deny theyve been the great balancing force (often for good) in the industry.
- Gmail vs Outlook
- Drive vs Word
- Android vs iOS
- Worklife balance and high pay vs the low salary grind of before.
Theyve done heaps for the industry. Im glad to see signs of life. Particularly in their P/E which was unjustly low for awhile.
Ironically, OpenAI was conceived as a way to balance Google's dominance in AI.
Balance is too weak of a word. OpenAI was conceived specifically to prevent Google from getting AGI first. That was its original goal. At the time of its founding Google was the undisputed leader of AI anywhere in the world. Musk was then very worried about AGI being developed behind closed doors particularly Google, which was why he was the driving force behind the founding of OpenAI.
The book Empire of AI describes him as being particularly fixated on Demis as some kind of evil genius. From the book, early OAI employees couldn’t take the entire thing too seriously and just focused on the work.
> Musk was then very worried about AGI being developed behind closed doors
*closed doors that aren't his
I thought it was a workaround to Google's complete disinterest in productizing the AI research it was doing and publishing, rather than a way to balance their dominance in a market which didn't meaningfully exist.
That’s how it turned out, but IIRC at the time of OpenAI’s founding, “AI” was search and RL which Google and deep mind were dominating, and self driving, which Waymo was leading. And OpenAI was conceptualized as a research org to compete. A lot has changed and OpenAI has been good at seeing around those corners.
That was actually Character.ai's founding story. Two researchers at Google that were frustrated by a lack of resources and the inability to launch an LLM based chatbot. The founders are now back at Google. OpenAI was founded based on fears that Google would completely own AI in the future.
I think that Google didn't see the business case in that generation of models, and also saw significant safety concerns. If AI had been delayed by... 5 years... would the world really be a worse place?
Yes - less exciting! But worse?
Elon Musk specifically gave OAI $150M early on because of the risk of Google being the only Corp that has AGI or super-intelligence. These emails were part of the record in the lawsuit.
Pffft. OpenAI was conceived to be Open, too.
It’s a common pattern for upstarts to embrace openness as a way to differentiate and gain a foothold then become progressively less open once they get bigger. Android is a great example.
Last I checked, Android is still open source (as AOSP) and people can do whatever-the-f-they-want with the source code. Are we defining open differently?
I think we're defining "less" differently. You're interpreting "less open" to mean "not open at all," which is not what I said.
There's a long history of Google slowly making the experience worse if you want to take advantage of the things that make Android open.
For example, by moving features that were in the AOSP into their proprietary Play Services instead [1].
Or coming soon, preventing sideloading of unverified apps if you're using a Google build of Android [2].
In both cases, it's forcing you to accept tradeoffs between functionality and openness that you didn't have to accept before. You can still use AOSP, but it's a second class experience.
[1] https://arstechnica.com/gadgets/2018/07/googles-iron-grip-on...
[2] https://arstechnica.com/gadgets/2025/08/google-will-block-si...
Core is open source but for a device to be "Android compatible" and access the Google Play Store and other Google services, it must meet specific requirements from Google's Android Compatibility Program. These additional proprietary components are what make the final product closed source.
The Android Open Source Project is not Android.
> The Android Open Source Project is not Android.
Was "Android" the way you define it ever open? Isnt it similar to chromium vs chrome? chromium is the core, and chrome is the product built on top of it - which is what allows Comet, Atlas, Brave to be built on.
That's the same thing what GrapheneOS, /e/ OS and others are doing - building on top of AOSP.
> Was "Android" the way you define it ever open?
Yes. Initially all the core OS components were OSS.
"open" and requiring closed blobs doesn't mean it's "open source".
It's like saying Nvidia's drivers are "open source" as there is a repository there but has only binaries in the folders.
They've poisoned the internet with their monopoly on advertising, the air pollution of the online world, which is an transgression that far outweighs any good they might have done. Much of the negative social effects of being online come from the need to drive more screen time, more engagement, more clicks, and more ad impressions firehosed into the faces of users for sweet, sweet, advertiser money. When Google finally defeats ad-blocking, yt-dlp, etc., remember this.
This is an understandable, but simplistic way of looking at the world. Are you also gonna blame Apple for mining for rare earths, because they made a successful product that requires exotic materials which needs to be mined from earth? How about hundreds of thousands of factory workers that are being subjected to inhumane conditions to assemble iPhones each year?
For every "OMG, internet is filled with ads", people are conveniently forgetting the real-world impact of ALL COMPANIES (and not just Apple) btw. Either you should be upset with the system, and not selectively at Google.
> How about hundreds of thousands of factory workers that are being subjected to inhumane conditions to assemble iPhones each year?
That would be bad if it happened, which is why it doesn't happen. Working in a factory isn't an inhumane condition.
I dont think your comment justifies calling out any form of simplistic view. It doesnt make sense. All the big players are bad. They"re companies, their one and only purpose is to make money and they will do whatever it takes to do it. Most of which does not serve human kind.
Compared to what?
It seems okay to me to be upset with the system and also point out the specific wrongs of companies in the right context. I actually think that's probably most effective. The person above specifically singled out Google as a reply to a comment praising the company, which seems reasonable enough. I guess you could get into whether it's a proportional response; the praise wasn't that high and also exists within the context of the system as you point out. Still, their reply doesn't necessarily indicate that they're not upset with all companies or the system.
Yes, we're absolutely holding Apple accountable for outsourcing jobs, degrading the US markets, using slave and child labor, laundering cobalt from illegal "artisanal" mines in the DRC, and whitewashing what they do by using corporate layering and shady deals to put themselves at sufficient degrees of separation from problematic labor and sources to do good PR, but not actually decoupling at all.
I also hold Americans and western consumers are responsible for simply allowing that to happen. As long as the human rights abuses and corruption are 3 or 4 degrees of separation from the retailer, people seem to be perfectly OK with chattel slavery and child labor and indentured servitude and all the human suffering that sits at the base of all our wonderful technology and cheap consumer goods.
If we want to have things like minimum wage and workers rights and environmental protections, then we should mandate adherence to those standards globally. If you want to sell products in the US, the entire supply chain has to conform to US labor and manufacturing and environmental standards. If those standards aren't practical, then they should be tossed out - the US shouldn't be doing performative virtue signalling as law, incentivizing companies to outsource and engage in race to the bottom exploitation of labor and resources in other countries. We should also have tariffs and import/export taxes that allow competitive free trade. It's insane that it's cheaper to ship raw materials for a car to a country in southeast asia, have it refined and manufactured into a car, and then shipped back into the US, than to simply have it mined, refined, and manufactured locally.
The ethics and economics of America are fucking dumb, but it's the mega-corps, donor class, and uniparty establishment politicians that keep it that way.
Apple and Google are inhuman, autonomous entities that have effectively escaped the control and direction of any given human decision tree. Any CEO or person in power that tried to significantly reform the ethics or economics internally would be ousted and memory-holed faster than you can light a cigar with a hundred dollar bill. We need term limits, no more corporation people, money out of politics, and an overhaul, or we're going to be doing the same old kabuki show right up until the collapse or AI takeover.
And yeah, you can single out Google for their misdeeds. They, in particular, are responsible for the adtech surveillance ecosystem and lack of any viable alternatives by way of their constant campaign of enshittification of everything, quashing competition, and giving NGOs, intelligence agencies, and government departments access to the controls of censorship and suppression of political opposition.
I haven't and won't use Google AI for anything, ever, because of any of the big labs, they are most likely and best positioned to engage in the worst and most damaging abuse possible, be it manipulation, invasion of privacy, or casual violation of civil rights at the behest of bureaucratic tyrants.
If it's not illegal, they'll do it. If it's illegal, they'll only do it if it doesn't cost more than they can profit. If they profit, even after getting caught and fined and taking a PR hit, they'll do it, because "number go up" is the only meaningful metric.
The only way out is principled regulation, a digital bill of rights, and campaign finance reform. There's probably no way out.
> laundering cobalt from illegal "artisanal" mines in the DRC
They don't, all cobalt in Apple products is recycled.
> and whitewashing what they do by using corporate layering and shady deals to put themselves at sufficient degrees of separation from problematic labor and sources to do good PR, but not actually decoupling at all.
They don't, Apple audits their entire supply chain so it wouldn't hide anything if something moved to another subcontractor.
One can claim 100% recycled cobalt under the mass balance system even if recycled and non-recycled cobalt was mixed as long as the total amount used in production is less or equal to recycled cobalt purchased in the books. At least here[0] they claim their recycled cobalt references are under the mass balance system.
0. https://www.apple.com/newsroom/2023/04/apple-will-use-100-pe...
Where is the fairy godmother's magic wand that will allow you to make all the governments of the world instantly agree to all of this?
People love getting their content for free and that's what Google does.
Even 25 years ago people wouldn't even believe Youtube exists. Anyone can upload whatever they want, however often they want, Youtube will be responsible for promoting it, they'll provide to however many billions users want to view it, and they'll pay you 55% of the revenue it makes?
Yep, it's hard to believe it exists for free and with not a lot of ads when you have a good ad blocker... though the content creator's ads are inescapable, which I think is ok since they're making a little money in exchange for what, your little inconvenience for 1 minute or so - if you're not skipping the ad, which you aren't, right??) - after which you can watch some really good content. The history channels on YT are amazing, maybe world changing - they get people to learn history and actually enjoy it. Same with some match channels like 3brown1blue which are just outstanding, and many more.
> People love getting their content for free and that's what Google does.
They are forcing a payment method on us. It's basically like they have their hand in our pockets.
Yes, this is correct, and it happens everywhere. App Store, Play Store, YouTube, Meta, X, Amazon and even Uber - they all play in two-sided markets exploiting both its users and providers at the same time.
They're not a moral entity. corporations aren't people.
I think a lot of the harms you mentioned are real, but they're a natural consequence of capitalistic profit chasing. Governments are supposed to regulate monopolies and anti-consumer behavior like that. Instead of regulating surveillance capitalism, governments are using it to bypass laws restricting their power.
If I were a google investor, I would absolutely want them to defeat ad-blocking, ban yt-dlp, dominate the ad-market and all the rest of what you said. In capitalism, everyone looks out for their own interests, and governments ensure the public isn't harmed in the process. But any time a government tries to regulate things, the same crowd that decries this oppose government overreach.
Voters are people and they are moral entities, direct any moral outrage at us.
Why should the collective of voters be any more of a moral entity than the collective of people who make up a corporation (which you may include its shareholders in if you want)?
It’s perfectly valid to criticize corporations for their actions, regardless of the regulatory environment.
> Why should the collective of voters..
They're accountable as individuals not as a collective. And it so happens, they are responsible for their government in a democracy but corporations aren't responsible for running countries.
> It’s perfectly valid to criticize corporations for their actions, regardless of the regulatory environment.
In the free speech sense, sure. But your criticism isn't founded on solid ground. You should expect corporations to do whatever they have to do within the bounds of the law to turn a profit. Their responsibility is to their investors and employees, they have no responsibility to the general public beyond that which is laid out in the law.
The increasing demand in corporations being part of the public/social moral consciousness is causing them to manipulate politics more and more, eroding what little voice the individuals have.
You're trying to live in a feudal society when you treat corporations like this.
If you're unhappy with the quality of Google's services, don't do business with them. If they broke the law, they should pay for it. But expecting them to be a beacon of morality is accepting that they have a role in society and government beyond mere revenue generating machines. And if you expect them to have that role, then you're also giving them the right to enforce that expectation as a matter of corporate policy instead of law. Corporate policies then become as powerful as law, and corporations have to interfere with matters of government policy on the basis of morality instead of business, so you now have an organization with lots of money and resources competing with individual voters.
And then people have the nerve to complain about PACs, money in politics, billionaire's influencing the government, bribery,etc.. you can't have it both ways. Either we have a country run partly by corporations, and a society driven and controlled by them, or we don't.
When we criticize corporations, we really are criticizing the people who make the decisions in the corporations. I don’t see why we shouldn’t apply exactly the same moral standards to people’s decision in the context of a corporation as we do to people’s decisions made in any other context. You talk about lawfulness, but we wouldn’t talk about morals if we meant lawfulness. It’s also lawful to vote for the hyper-capitalist party, so by the same token moral outrage shouldn’t be directed towards the voters.
I get that, but those CEOs are not elected officials, they don't represent us and have no part in the discourse of law making (despite the state of things). In their capacity has executives of a company, they have no rights, no say in what we find acceptable or not in society. We tell them what they can and cannot do or else. That's the social contract we have with companies and their executives.
Being in charge of a corporation shouldn't elevate someone to a platform where they have a louder voice than the common man. They can vote just as equally as others at the voting booth. they can participate in their capacity as individuals in politics. But neither money, nor corporate influence have places in the governance of a democratic society.
I talk about lawfulness because that is the only rule of law a corporation can and should be expected to follow. Morals are for individuals. Corporations have no morals. they are neither moral or immoral. Their owners have morals, and you can criticize their greed, but that is a construct of capitalism. They're supposed to enrich themselves. You can criticize them for valuing money over morals, but that's like criticizing the ocean for being wet or the sun for being too hot. It's what they do. It's their role in society.
If a small business owner raises prices to increase revenue, that isn't immoral right? even though poor people that frequent them will be disaffected? amp that up to the scale of a megacorp, and the morality is still the same.
Corporations are entities that exist for the sole purpose of generating revenue for their owners. So when you criticize Google, you're criticizing a logical organization designed to do the thing you're criticizing it of doing. The CEO of google is acting in his official capacity, doing the job they were hired to do when they are resisting adblocking. The investors of Google are risking their money in anticipation of ROI, so their expectation from Google is valid as well.
When you find something to be immoral, the only meaningful avenue of expressing that with corporations is the law. You're criticizing google as if it was an elected official we could vote in/out of office. or as if it is an entity that can be convinced of its moral failings.
When we don't speak up and user our voice, we lose it.
Because of the inherent capitalism structure that leads to the inevitable: the tragedy of the commons.
Why are you directing the statement that "[Corporations are] not a moral entity" at me instead of the parent poster claiming that "[Google has] been the great balancing force (often for good) in the industry."? Saying that Google is a force "for good" is a claim by them that corporations can be moral entities; I agree with you that they aren't.
I could have just the same I suppose, but their comment was about google being a balancing force in terms of competition and monopoly. it wasn't a praise of their moral character. They did what was best for their business and that turns out to be good for reducing monopolies. If it turned out to be monopolistic, I would be wondering what congress and the DOJ are doing about it, instead of criticizing Google for trying to turn a profit.
> They've poisoned the internet
And what of the people that ravenously support ads and ad-supported content, instead of paying?
What of the consumptive public? Are they not responsible for their choices?
I do not consume algorithmic content, I do not have any social media (unless you count HN for either).
You can't have it both ways. Lead by example, stop using the poison and find friends that aren't addicted. Build an offline community.
I don't understand your logic, it seems like victim blaming. Using the internet and pointing out that targeted advertising has a negative effect on society is not "having it both ways".
Also, HN is by definition algorithmic content and social media, in your mind what do you think it is?
You are not a "victim" for using or purchasing something which is completely unnecessary. Or if that's the case, then you have no agency and have to be medicinally declared unfit to govern yourself and be appointed a legal guardian to control your affairs.
What kind of world do you live in? Actually Google ads tend to be some of the highest ROI for the advertiser and most likely to be beneficial for the user. Vs the pure junk ads that aren't personalized, and just banner ads that have zero relationship to me. Google Ads is the enabler of free internet. I for one am thankful to them. Else you end up paying for NYT, Washinton Post, Information etc -- virtually for any high quality web site (including Search).
Ads. Beneficial to the user.
Most of the time, you need to pick one. Modern advertising is not based on finding the item with the most utility for the user - which means they are aimed at manipulating the user's behaviour in one way or another.
Suppressed wages to colluding with Apple to not poach.
Outlook is much better than Gmail and so is the office suite.
It's good there's competition in the space though.
Outlook is not better in ways that email or gmail users necessarily care about, and in my experience gets in the way more than it helps with productivity or anything it tries to be good at. I've used it in office settings because it's the default, but never in my life have I considered using it by choice. If it's better, it might not matter.
I couldn't disagree more
> Drive vs Word
You mean Drive vs OneDrive or, maybe Docs vs Word?
Workspace vs Office
Surely they meant Writely vs Word
- Making money vs general computing
For what it's worth, most of those examples are acquisitions. That's not a hit against Google in particular. That's the way all big tech co's grow. But it's not necessarily representative of "innovation."
>most of those examples are acquisitions
Taking those products from where there were to the juggernauts they are today was not guaranteed to succeed, nor was it easy. And yes plenty of innovation happened with these products post aquisition.
But there's also plenty that fail, it's just that you won't know about those.
I don't think what you're saying proves that the companies that were acquired couldn't have done that themselves.
If you consider surveillance capitalism and dark pattern nudges a good thing, then sure. Gemini has the potential to obliterate their current business model completely so I wouldn't consider that "waking up".
Forgot to mention absolutely milking every ounce of their users attention with Youtube, plus forcing Shorts!
Why stop at YouTube? Blame Apple for creating an additive gadget that has single handedly wasted billions of hours of collective human intelligence. Life was so much better before iPhones.
But I hear you say - you can use iPhones for productive things and not just mindless brainrot. And that's the same with YouTube as well. Many waste time on YouTube, but many learn and do productive things.
Dont paint everything with a single, large, coarse brush stroke.
frankly when compared against TikTok, Insta, etc, YouTube is a force for good. Just script the shorts away...
All those examples date back to the 2000s. Android has seen some significant improvements, but everything else has stagnated if not enshittified- remember when google told us not to ever worry about deleting anything?- and then started backing up my photos without me asking and are now constantly nagging me to pay them a monthly fee?
They have done a lot, but most of it was in the "don't be evil" days and they are a fading memory.
Something about bringing balance to the force not destroying it.
Google always has been there, its just that many didn't realize that DeepMind even existed and I said that they needed to be put to commercial use years ago. [0] and Google AI != DeepMind.
You are now seeing their valuation finally adjusting to that fact all thanks to DeepMind finally being put to use.
Google is using the typical monopoly playbook as most other large orgs, and the world would be a "better place" if they are kept in check.
But at least this company is not run by a narcissistic sociopath.
Seriously? Google is an incredibly evil company whose net contribution to society is probably only barely positive thanks to their original product (search). Since completely de-googling I've felt a lot better about myself.
DeepMind page: https://deepmind.google/models/gemini/
Gemini 3 Pro DeepMind Page: https://deepmind.google/models/gemini/pro/
Developer blog: https://blog.google/technology/developers/gemini-3-developer...
Gemini 3 Docs: https://ai.google.dev/gemini-api/docs/gemini-3
Google Antigravity: https://antigravity.google/
Also recently: Code Wiki: https://codewiki.google/
Understanding precisely why Gemini 3 isn't front of the pack on SWE Bench is really what I was hoping to understand here. Especially for a blog post targeted at software developers...
It doesn't matter, the real benchmark is taking the community temperature on the model after a few weeks of usage.
Imho Gemini 2.5 was by far the better model on non-trivial tasks.
To this day, I still don't understand why Claude gets more acclaim for coding. Gemini 2.5 consistently outperformed Claude and ChatGPT mostly because of the much larger context.
I use Gemini cli, Claude Code and Codex daily. If I present the same bug to all 3, Gemini often is the one missing a part of the solution or drawing the wrong conclusion. I am curious for G3.
I'm not sure about this. I used gemini and claude for about 12 hours a day for a month and a half straight in an unhealthy programmer bender and claude was FAR superior. It was not really that close. Going to be interesting to test gemini 3 though.
Different styles of usage? I see Gemini praised for being able to feed the whole project and ask changes. Which is cool and all but... I never do that. Claude for me is better for specific modifications to specific parts of the app. There's a lot of context behind what's "better".
I can't really explain why I have barely used Gemini.
I think it was just timing with the way models came out. This will be the first time I will have a Gemini subscription and nothing else. This will be the first time I really see what it can do fully.
The secret sauce isn't Claude the model, but Claude code the tool. Harness > model.
The secret sauce is the MCP that lots of people are starting to talk bad about.
Claude doesn’t gaslight me, or flat out refuses to do something I ask it to because it believes it won’t work anyway. Gemini does
Gemini also randomly just reverts everything because of some small mistake it found, makes assumptions without checking if those are true (eg this lib absolutely HAS TO HAVE a login() method. If we get a compile error it’s my env setup fault)
It’s just not a pleasant model to work with
Gemini 2.5 couldn't apply an edit to a file if it's life depended on it.
So unless you love copy/pasting code, Gemini 2.5 was useless for agentic coding.
Great for taking it's output and asking Sonnet to apply it though.
>"It doesn't matter, the real benchmark is taking the community temperature on the model after a few weeks of usage."
Indeed. It's almost impossible to truly know a model before spending a few million tokens on a real world task. It will take a step-change level advancement at this point for me to trust anything but Claude right now.
SWEBench-Verified is probably benchmaxxed at this stage. Claude isn't even the top performer, that honor goes to Doubao [1].
Also, the confidence interval for a such a small dataset is about 3 percent points, so these differences could just be up to chance.
claude 4.5 gets 82% on their own highly customized scaffolding. (parallel compute with a scoring function). That beats Doubao
Yeah, they mention a benchmark I'm seeing the first time (Terminal-Bench 2.0) and are supposedly leading in, while for some reason SWE Bench is down from Sonnet 4.5.
Curious to see some third-party testing of this model. Currently it seems to primarily improve of "general non-coding and visual reasoning" primarily, based on the benchmarks.
They are not even leading in Terminal-Bench... GPT 5.1-codex is better than Gemini 3 Pro
I mean... it achieved 76.2% vs the leader (Claude Sonnet) at 77.2%.
That's a "loss" I can deal with.
Why is this particular benchmark important?
Thus far, this is one of the best objective evaluations of real world software engineering...
I concur with the other commenters, 4.5 is a clear improvement over 4.
Idk, Sonnet 4.5 score better than Sonnet 4.0 on that benchmark, but is markedly worse in my usage. The utility of the benchmark is fading as it is gamed.
I think I and many others have found Sonnet 4.5 to generally be better than Sonnet 4 for coding.
Maybe if you confirm to its expectations for how you use it. 4.5 is absolutely terrible for following directions, thinks it knows better than you, and will gaslight you until specifically called out on its mistake.
I have scripted prompts for long duration automated coding workflows of the fire and forget, issue description -> pull request variety. Sonnet 4 does better than you’d expect: it generates high quality mergable code about half the time. Sonnet 4.5 fails literally every time.
I'm very happy with it TBH, it has some things that annoy me a little bit:
- slower compared to other models that will also do the job just fine (but excels at more complex tasks),
- it's very insistent on creating loads of .MD files with overly verbose documentation on what it just did (not really what I ask it to do),
- it actually deleted a file twice and went "oops, I accidentaly deleted the file, let me see if I can restore it!", I haven't seen this happen with any other agent. The task wasn't even remotely about removing anything
The last point is how it usually fails in my testing, fwiw. It usually ends up borking something up, and rather than back out and fix it, it does a 'git restore' on the file - wiping out thousands of lines of unrelated, unstaged code. It then somehow thinks it can recover this code by looking in the git history (??).
And yes, I have hooks to disable 'git reset', 'git checkout', etc., and warn the model not to use these commands and why. So it writes them to a bash script and calls that to circumvent the hook, successfully shooting itself in the foot.
Sonnet 4.5 will not follow directions. Because of this, you can't prevent it like you could with earlier models from doing something that destroys the worktree state. For longer-running tasks the probability of it doing this at some point approaches 100%.
> The last point is how it usually fails in my testing, fwiw. It usually ends up borking something up, and rather than back out and fix it, it does a 'git restore' on the file - wiping out thousands of lines of unrelated, unstaged code. It then somehow thinks it can recover this code by looking in the git history (??).
Man I've had this exact thing happen recently with Sonnet 4.5 in Claude Code!
With Claude I asked it to try tweaking the font weight of a heading to put the finishing touches on a new page we were iterating on. Looked at it and said, "Never mind, undo that" and it nuked 45 minutes worth of work by running git restore.
It immediately realized it fucked up and started running all sorts of git commands and reading its own log trying to reverse what it did and then came back 5 minutes later saying "Welp I lost everything, do you want me to manually rebuild the entire page from our conversation history?
In my CLAUDE.md I have instructions to commit unstaged changes frequently but it often forgets and sure enough, it forgot this time too. I had it read its log and write a post-mortem of WTF led it to run dangerous git commands to remove one line of CSS and then used that to write more specific rules about using git in the project CLAUDE.md, and blocked it from running "git restore" at all.
We'll see if that did the trick but it was a good reminder that even "SOTA" models in 2025 can still go insane at the drop of a hat.
The problem is that I'm trying to build workflows for generating sequences of good, high quality semantically grouped changes for pull requests. This requires having a bunch of unrelated changes existing in the work tree at the same time, doing dependency analysis on the sequence of commits, and then pulling out / staging just certain features at a time and committing those separately. It is sooo much easier to do this by explicitly avoiding the commit-every-2-seconds workaround and keeping things uncommitted in the work tree.
I have a custom checkpointing skill that I've written that it is usually good about using, making it easier to rewind state. But that requires a careful sequence of operations, and I haven't been able to get 4.5 to not go insane when it screws up.
As I said though, watch out for it learning that it can't run git restore, so it immediately jumps to Bash(echo "git restore" >file.sh && chmod +x file.sh && ./file.sh).
I think this is probably just a matter of noise. That's not been my experience with Sonnet 4.5 too often.
Every model from every provider at every version I've used has intermingled brilliant perfect instruction-following and weird mistaken divergence.
What do you mean by noise?
In this case I can't get 4.5 to follow directions. Neither can anyone else, aparantly. Search for "Sonnet 4.5 follow instructions" and you'll find plenty of examples. The current top 2:
https://www.reddit.com/r/ClaudeCode/comments/1nu1o17/45_47_5...
https://theagentarchitect.substack.com/p/claude-sonnet-4-pro...
Not my experience at all, 4.5 is leagues ahead the previous models albeit not as good as Gemini 2.5.
I find 4.5 a much better model FWIW.
Does anyone trust benchmarks at this point? Genuine question. Isn't the scientific consensus that they are broken and poor evaluation tools?
They overly emphasize tasks with small context without noise and red herrings in the context.
I make my own automated benchmarks
Is there a tool / website that makes this process easy?
I coded it bun and openrouter(dot)ai. I have an array of benchmarks, each benchmark has a grader (for example, checking if it equals a certain string or grade the answer automatically using another LLM). Then I save all results to a file and render the percentage correct to a graph
Sets a new record on the Extended NYT Connections benchmark: 96.8 (https://github.com/lechmazur/nyt-connections/).
Grok 4 is at 92.1, GPT-5 Pro at 83.9, Claude Opus 4.1 Thinking 16K at 58.8.
Gemini 2.5 Pro scored 57.6, so this is a huge improvement.
Supposedly this is the model card. Very impressive results.
https://pbs.twimg.com/media/G6CFG6jXAAA1p0I?format=jpg&name=...
Also, the full document:
https://archive.org/details/gemini-3-pro-model-card/page/n3/...
Every time I see a table like this numbers go up. Can someone explain what this actually means? Is there just an improvement that some tests are solved in a better way or is this a breakthrough and this model can do something that all others can not?
This is a list of questions and answers that was created by different people.
The questions AND the answers are public.
If the LLM manages through reasoning OR memory to repeat back the answer then they win.
The scores represent the % of correct answers they recalled.
That is not entirely true. At least some of these tests (like HLE and ARC) take steps to keep the evaluation set private so that LLMs can’t just memorize the answers.
You could question how well this works, but it’s not like the answers are just hanging out on the public internet.
Excuse my ignorance, how do these companies evaluate their models against the evaluation set without access to it?
I estimate another 7 months before models start getting 115% on Humanity's Last Exam.
If you believe another thread the benchmarks are comparing Gemini-3 (probably thinking) to GPT-5.1 without thinking.
The person also claims that with thinking on the gap narrows considerably.
We'll probably have 3rd party benchmarks in a couple of days.
This is easily shown that the numbers are for GPT 5.1 thinking high.
Just go to the leaderboard website and see for yourself: https://arcprize.org/leaderboard
I truly do not understand what plan to use so I can use this model for longer than ~2 minutes.
Using Anthropic or OpenAI's models are incredibly straightforward -- pay us per month, here's the button you press, great.
Where do I go for this for these Google models?
Google actually changed it somewhat recently (3 months ago, give or take) and you can use Gemini CLI with the "regular" Google AI Pro subscription (~22eur/month). Before that, it required a separate subscription
I can't find the announcement anymore, but you can see it under benefits here https://support.google.com/googleone/answer/14534406?hl=en
The initial separate subscriptions were confusing at best. Current situation is pretty much same as Anthropic/OpenAI - straightforward
Edit: changed ~1 month ago (https://old.reddit.com/r/Bard/comments/1npiv2o/google_ai_pro...)
I see -- but does this allow me to us the models within "Antigravity" with the same subscription?
I poked around and couldn't figure this out.
I don't know either tbh. I wouldn't be surprised it the answer is no (and it will come later or something like that)
I also tried to use Gemini 3 in my Gemini CLI and it's not available yet (it's available to all Ultra, but not all Pro subscribers), I needed to sign up to a waitlist
All in all, Google is terrible at launching things like that in a concise and understandable way
Back in the early 00s having a 'waitlist' for gmail with invites was an exciting buzz-making marketing technique and justifiable technically.
This is just irritating. I am not going to give them money until I know I can try their latest thing and they've made it hard for me to even know how I can do that.
early gmail invite codes went for like $100 if I recall correctly..
Might not be decided yet. The AG pricing page says:
"Public preview Individual plan $0/month"
"Coming soon Team plan"
how do i actually make it use that though? i got a free year of subscription from buying a phone, but all i get is the free tier in the gemini cli
I also got 1 year through buying my pixel. If you login with the same account through Gemini CLI, it should work (works for me)
However, Gemini CLI is a rather bad product. There is (was?) an issue that makes the CLI fall back to flash very soon in every session. This comment explains it well: https://news.ycombinator.com/item?id=45681063
I haven't used it in a while, except for really minor things, so I can't tell if this is resolved or not
I am paying for AI ultra - no idea how to use it in the CLI. It says i dont‘t have access. The google admin/payment backend is pure evil. What a mess.
My test a few hours ago. Ultra plan got me ~20 minutes with Antigravity using Gemini 3 Pro (Low) before zero out.
Getting only 20 minutes of usage with a $240/mo plan is a bit ridiculous. How much usage did you get on 2.5-pro? Is it comparable to Claude Max or ChatGPT Pro on the CLI? So a weekly limit but in reality very hard to hit unless very heavy usage?
Update VSCode to the latest version and click the small "Chat" button at the top bar. GitHub gives you like $20 for free per month and I think they have a deal with the larger vendors because their pricing is insanely cheap. One week of vibe-coding costs me like $15, only downside to Copilot is that you can't work on multiple projects at the same time because of rate-limiting.
I'm asking about Gemini, not Copilot.
Copilot lets you access all sorts of models, including Gemini 3.
> Copilot lets you access all sorts of models
It's not exactly the same since e.g. Copilot adds prompts, reduces context, etc.
You were asking about the model. You can use the model (Gemini 3 Pro) in Github Chat.
Got it -- thanks both.
Yeah, it truly is an outstandingly bad UX. To use Gemini CLI as a business user like I would Codex or Claude Code, how much and how do I pay?
You can install the Gemini CLI (https://github.com/google-gemini/gemini-cli) but assign a "paid" API key to it (unless you pay for Gemini Ultra).
So where do I get a API key? Where do I sign up for Ultra?
For API key, go to https://aistudio.google.com/ and there's a link in the bottom left.
But this is if you want to pay per token. Otherwise you should just be able to use your Gemini Pro subscription (it doesn't need Ultra). Subscriptions are at https://gemini.google/subscriptions/
ai studio, you get a bunch of usage free if you want more you buy credits (google one subscriptions also give you some additional usage)
I see -- so this is the "paid" AI studio plan?
Does that have any relation to the Gemini plan thing: https://one.google.com/explore-plan/gemini-advanced?utm_sour...
?
that's for the first party google integrations - not 3rd party. ai studio just gives you an api key that you can use anywhere.
> I truly do not understand what plan to use so I can use this model for longer than ~2 minutes.
I had the exact same experience and walked away to chatgpt.
What a mess.
Also Google discontinues everything in short order, so personally I'm waiting until they haven't discontinued this for, say 6 months, before wasting time evaluating it.
It's really impressive how much damage they've done to early adoption by earning themselves this reputation.
I've even heard it in mainstream circles that have no idea what HN is, and aren't involved in tech.
Probably would have been cheaper to keep Google Reader running - kidding, but this is the first time I remember the gut punch of Google cancelling something I heavily used personally.
Google is bad about maintenance. They have a bunch of projects that are not getting changes.
They are also bad about strategy. Good example is the number of messaging systems that have had. Instead of making new ones, they should have updated existing one with new backend and UI.
I like the Google Messages sync SMS online with Google Fi, but it is missing features. If they could do it globally, they would have something big.
> Whether you’re an experienced developer or a vibe coder
I absolutely LOVE that Google themselves drew a sharp distinction here.
You realize this is copy to attract more people to the product, right?
How could they.
With the $20/m subscription, do we get it on "Low" or "High" thinking level?
From an initial testing of my personal benchmark it works better than Gemini 2.5 pro.
My use case is using Gemini to help me test a card game I'm developing. The model simulates the board state and when the player has to do something it asks me what card to play, discard... etc. The game is similar to something like Magic the Gathering or Slay the Spire with card play inspired by Marvel Champions (you discard cards from your hand to pay the cost of a card and play it)
The test is just feeding the model the game rules document (markdown) with a prompt asking it to simulate the game delegating the player decisions to me, nothing special here.
It seems like it forgets rules less than Gemini 2.5 Pro using thinking budget to max. It's not perfect but it helps a lot to test little changes to the game, rewind to a previous turn changing a card on the fly, etc...
Grok got to hold the top spot of LMArena-text for all of ~24 hours, good for them [1]. With stylecontrol enabled, that is. Without stylecontrol, gemini held the fort.
Is it just me or is that link broken because of the cloudflare outage?
Edit: nvm it looks to be up for me again
Grok is heavily censored though
Well, it just found a bug in one shot that Gemini 2.5 and GPT5 failed to find in relatively long sessions. Claude 4.5 had found it but not one shot.
Very subjective benchmark, but it feels like the new SOTA for hard tasks (at least for the next 5 minutes until someone else releases a new model)
Created a summary of comments from this thread about 15 hours after it had been posted and had 814 comments with gemini-3-pro and gpt-5.1 using this script [1]:
- gemini-3-pro summary: https://gist.github.com/primaprashant/948c5b0f89f1d5bc919f90...
- gpt-5.1 summary: https://gist.github.com/primaprashant/3786f3833043d8dcccae4b...
Summary from GPT 5.1 is significantly longer and more verbose compared to Gemini 3 Pro (13,129 output tokens vs 3,776). Gemini 3 summary seems more readable, however, GPT 5.1 one has interesting insights missed by Gemini.
Last time I did this comparison at the time of GPT 5 release [2], the summary from Gemini 2.5 Pro was way better and readable than the GPT 5 one. This time the readability of Gemini 3 summary still seems great while GPT 5.1 feels a bit more improved but not there quite yet.
[1]: https://gist.github.com/primaprashant/f181ed685ae563fd06c49d...
A 50% increase over ChatGPT 5.1 on ARC-AGI2 is astonishing. If that's true and representative (a big if), it lends credence to this being the first of the very consistent agentically-inclined models because it's able to follow a deep tree of reasoning to solve problems accurately. I've been building agents for a while and thus far have had to add many many explicit instructions and hardcoded functions to help guide the agents in how to complete simple tasks to achieve 85-90% consistency.
I think it's due to improvements in vision basically, the arc agi 2 is very visual
Vision is very far from solved IMO, simple modifications to inputs results in high differences still, lines aren't recognized etc..
Where is this figure taken from?
How long does it typically take after this to become available on https://gemini.google.com/app ?
I would like to try the model, wondering if it's worth setting up billing or waiting. At the moment trying to use it in AI Studio (on the Free tier) just gives me "Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
Allegedly it's already available in stealth mode if you choose the "canvas" tool and 2.5. I don't know how true that is, but it is indeed pumping out some really impressive one shot code
Edit: Now that I have access to Gemini 3 preview, I've compared the results of the same one shot prompts on the gemini app's 2.5 canvas vs 3 AI studio and they're very similar. I think the rumor of a stealth launch might be true.
Thanks for the hint about Canvas/2.5. I have access to 3.0 in AI Studio now, and I agree the results are very similar.
On gemini.google.com, I see options labeled 'Fast' and 'Thinking.' The 'Thinking' option uses Gemini 3 Pro
> https://gemini.google.com/app
How come I can't even see prices without logging in... they doing regional pricing?
Today I guess. They were not releasing the preview models this time and it seems the want to synchronize the release.
It's available in cursor. Should be there pretty soon as well.
are you sure its available in cursor? ( I get: We're having trouble connecting to the model provider. This might be temporary - please try again in a moment. )
It's already available. I asked it "how smart are you really?" and it gave me the same ai garbage template that's now very common on blog posts: https://gist.githubusercontent.com/omarabid/a7e564f09401a64e...
Pelican riding a bicycle: https://pasteboard.co/CjJ7Xxftljzp.png
2D SVG is old news. Next frontier is animated 3D. One shot shows there's still progress to be made: https://aistudio.google.com/apps/drive/1XA4HdqQK5ixqi1jD9uMg...
Great improvement by only adding one feedback prompt: Change the rotation axis of the wheels by 90 degrees in the horizontal plane. Same for the legs and arms
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
Did you notice that this embedded a Gemini API connection within the app itself? Or am I not understanding what that is?
I hadn't! It looks like that is there to power the text box at the bottom of the app that allows for AI-powered changes to the scene.
This says Gemini 2.5 though.
Good observation. The app was created with Gemini 3 Pro Preview, but the app calls out to Gemini 2.5 if you use the embedded prompt box.
Incredible. Thanks for sharing.
Some time I think I should spend $50 on Upwork to get a real human artist to do it first to know what is that we're going for. What a good pelican riding a bicycle SVG is actually looking like?
IMO it's not about art, but a completely different path than all these images are going down. The pelican needs tools to ride the bike, or a modified bike. Maybe a recumbent?
At this point I'm surprised they haven't been training on thousands of professionally-created SVGs of pelicans on bicycles.
i think anything that makes it clear they've done that would be a lot worse PR than failing the pelican test would ever be.
It would be next to impossible for anyone without insider knowledge to prove that to be the case.
Secondly, benchmarks are public data, and these models are trained on such large amounts of it that it would be impractical to ensure that some benchmark data is not part of the training set. And even if it's not, it would be safe to assume that engineers building these models would test their performance on all kinds of benchmarks, and tweak them accordingly. This happens all the time in other industries as well.
So the pelican riding a bicycle test is interesting, but it's not a performance indicator at this point.
It’s a good pelican. Not great but good.
The blue lines indicating wind really sell it.
I've been playing with the Gemini CLI w/ the gemini-pro-3 preview. First impressions are that its still not really ready for prime time within existing complex code bases. It does not follow instructions.
The pattern I keep seeing is that I ask it to iterate on a design document. It will, but then it will immediately jump into changing source files despite explicit asks to only update the plan. It may be a gemini CLI problem more than a model problem.
Also, whoever at these labs is deciding to put ASCII boxes around their inputs needs to try using their own tool for a day.
People copy and paste text in terminals. Someone at Gemini clearly thought about this as they have an annoying `ctrl-s` hotkey that you need to use for some unnecessary reason.. But they then also provide the stellar experience of copying "a line of text where you then get | random pipes | in the middle of your content".
Codex figured this out. Claude took a while but eventually figured it out. Google, you should also figure it out.
Despite model supremacy, the products still matter.
Curious to see it in action. Gemini 2.5 has already been very impressive as a study buddy for courses like set theory, information theory, and automata. Although I’m always a bit skeptical of these benchmarks. Seems quite unlikely that all of the questions remain out of their training data.
> The Gemini app surpasses 650 million users per month, more than 70% of our Cloud customers use our AI, 13 million developers have built with our generative models, and that is just a snippet of the impact we’re seeing
Not to be a negative nelly, but these numbers are definitely inflated due to Google literally pushing their AI into everything they can, much like M$. Can't even search google without getting an AI response. Surely you can't claim those numbers are legit.
> Gemini app surpasses 650 million users per month
Unless these numbers are just lies, I'm not sure how this is "pushing their AI into everything they can". Especially on iOS where every user is someone who went to App Store and downloaded it. Admittedly on Android, Gemini is preinstalled these days but it's still a choice that users are making to go there rather than being an existing product they happen to user otherwise.
Now OTOH "AI overviews now have two billion users" can definitely be criticised in the way you suggest.
I unlocked my phone the other day and had the entire screen taken over with an ad for the Gemini app. There was a big "Get Started" button that I almost accidentally clicked because it was where I was about to tap for something else.
As an Android and Google Workspace user, I definitely feel like Google is "pushing their AI into everything they can", including the Gemini app.
I constantly accidentally use some btn and Gemini opens up on my Samsung Galaxy. I haven't bothered to figure this out.
I don't know for sure but they have to be counting users like me whose phone has had Gemini force installed on an update and I've only opened the app by accident while trying to figure out how to invoke the old actually useful Assistant app
> it's still a choice that users are making to go there rather than being an existing product they happen to user otherwise.
Yes and no, my power button got remapped to opening Gemini in an update...
I removed that but I can imagine that your average user doesn't.
This is benefit of bundling, I've been forecasting this for a long time - the only companies who would win the LLM race would be the megacorps bundling their offerings, and at most maybe OAI due to the sheer marketing dominance.
For example I don't pay for ChatGPT or Claude, even if they are better at certain tasks or in general. But I have Google One cloud storage sub for my photos and it comes with a Gemini Pro apparently (thanks to someone on HN for pointing it out). And so Gemini is my go to LLM app/service. I suspect the same goes for many others.
[dead]
It says Gemini App, not AI Overviews, AI Mode, etc
They claim AI overviews as having "2 billion users" in the sentences prior. They are clearly trying as hard as possible to show the "best" numbers.
> They are clearly trying as hard as possible to show the "best" numbers.
This isnt a hottake at all. Marketing (iPhone keynotes, product launches) are about showing impressive numbers. It isnt a gotcha you think it is.
Yeah my business account was forced to pay for an AI. And I only used it for a couple of weeks when Gemini 2.5 was launched, until it got nerfed. So they are definitely counting me there even though I haven't used it in like 7 months. Well, I try it once every other month to see if it's still crap, and it always is.
I hope Gemini 3 is not the same and it gives an affordable plan compared to OpenAI/Anthropic.
Gemini app != Google search.
You're implying they're lying?
And you're implying they're being 100% truthful?
Marketing is always somewhere in the middle
Companies cant get away from egregious marketing. See Apple class action lawsuit for Apple Intelligence.
I asked it to analyze my tennis serve. It was just dead wrong. For example, it said my elbow was bent. I had to show it a still image of full extension on contact, then it admitted, after reviewing again, it was wrong. Several more issues like this. It blamed it on video being difficult. Not very useful, despite the advertisements: https://x.com/sundarpichai/status/1990865172152660047
The default FPS it's analyzing video at is 1, and I'm not sure the max is anywhere near enough to catch a full speed tennis serve.
Ah, I should have mentioned it was a slow motion video.
> The default FPS it's analyzing video at is 1
Source?
https://ai.google.dev/gemini-api/docs/video-understanding#cu...
"By default 1 frame per second (FPS) is sampled from the video."
OK, I just used https://gemini.google.com/app, I wonder if it's the same there.
I’ve never seen such a huge delta between advertised capabilities and real world experience. I’ve had a lot of very similar experiences to yours with these models where I will literally try verbatim something shown in an ad and get absolutely garbage results. Do these execs not use their own products? I don’t understand how they are even releasing this stuff.
> it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month
"Incredible"! When they insert it into literally every google request without an option to disable it. How incredibly shocking so many people use it.
I think I am in this AI fatigue phase. I am past all hype with models, tools and agents and back to problem and solution approach, sometimes code gen with AI , sometimes think and ask for a piece of code. But not offloading to AI and buying all the bs, waiting it to do magic with my codebase.
Yeah, at this point I want to see the failure modes. Show me at least as many cases where it breaks. Otherwise, I'll assume it's an advertisement and I'll skip to the next headline. I'm not going to waste my time on it anymore.
I agree but if Gemini 3 is as good as people on HN said about the preview, then this is the wrong announcement to sleep on.
No LLM has ever been as good as people said it was. That doesn't mean this one won't be, but it does make it an unlikely bet based on past trends.
With the exception of GPT-5, which was a significant advance yet because it was slightly less sycophantic than gpt-4o the internet decided it was terrible for the first few days.
[dead]
"No LLM has ever been as good as people said it was."
The reason for this is because LLM companies have tuned their models to aggressively blow smoke up their users' asses.
These "tools" are designed to aggressively exploit human confirmation bias, so as to prevent the user from identifying their innumerable inadequacies.
There are 8 Google news articles in the top 15 articles on the HN front page right now.
Google being able to skip ahead of every other AI company is wild. They just sat back and watched, then decided it was time to body the competition.
The DOJ really should break up Google [1]. They have too many incumbent advantages that were already abuse of monopoly power.
[1] https://pluralpolicy.com/find-your-legislator/ - call your reps and tell them!
Google didn't sit back and watch, they basically built the whole foundations for all of this. They were just not the first ones to release a chatbot interface.
2.5 flash and 2.5 Pro were just sitting back and watching?
The problem with Google is that someone had to show them how to make a product out of the thing, which Open AI did.
Then Anthropic taught them to make a more specific product out of there models
In every aspect, they're just playing catch up, and playing me too.
Models are only part of the solution
Astroturfing used as evidence of domination. Public forums truly have come full circle.
Why?
Not trying to challenge you, and I'd sincerely love to read your response. People said similar things about previous gen-AI tool announcements that proved over time to be overstated. Is there some reason to put more weight in "what people on HN said" in this case, compared to previous situations?
Because either:
1. They likely work at the company (and have RSUs that need to go up)
2. Also invested in the company in the open market or have active call options.
3. Trying to sell you their "AI product".
4. All of the above.
Only reasonable thing is to not listening to anyone who seem to be hyping anything, LLMs or otherwise. Wait until the thing gets released, run your private benchmarks against it, get a concrete number, compare against existing runs you've done before.
I don't see any other way of doing this. The people who keep reading and following comments either here on HN, from LocalLlama or otherwise will continue to be misinformed by all the FUD and guerilla marketing that is happening across all of these places.
I think it's fun to see what is not even considered magic anymore today.
Our ability to adapt to new things is both a blessing and a curse.
It is. But understandably the people who need to push back on what is still magic may get a bit tired.
People would have had a heart attack if they saw this 5 years ago for the first time. Now artificial brains are “meh” :)
It is anything but "meh".
It scares the absolute shit out of everyone.
It's clear far beyond our little tech world to everyone this is going to collapse our entire economic system, destroy everyone's livelihoods, and put even more firmly into control the oligarchic assholes already running everything and turning the world to shit.
I see it in news, commentary, day to day conversation. People get it's for real this time and there's a very real chance it ends in something like the Terminator except far worse.
True of almost every new technology.
I hesitate to lump this into the "every new technology" bucket. There are few things that exist today that, similar to what GP said, would have been literal voodoo black magic a few years ago. LLMs are pretty singular in a lot of ways, and you can do powerful things with them that were quite literally impossible a few short years ago. One is free to discount that, but it seems more useful to understand them and their strengths, and use them where appropriate.
Even tools like Claude Code have only been fully released for six months, and they've already had a pretty dramatic impact on how many developers work.
More people got more value out of iPhone, including financially.
My test for the state of AI is "Does Microsoft Teams still suck?", if it does still suck, then clearly the AIs were not capable of just fixing the bugs and we must not be there yet.
it's not AI fatigue, its that you just need to shift mode to not pay attention too much to the latest and greatest as they all leap frog each other each month. Just stick to one and ride it thru ups and downs.
And by this time next year, this comment is going to look very silly
Comment was deleted :(
It's available to be selected, but the quota does not seem to have been enabled just yet.
"Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
"You've reached your rate limit. Please try again later."
Update: as of 3:33 PM UTC, Tuesday, November 18, 2025, it seems to be enabled.
Looks to be available in Vertex.
I reckon it's an API key thing... you can more explicitly select a "paid API key" in AI Studio now.
For me it’s up and running. I was doing some work with AI Studio when it was released and reran a few prompts already. Interesting also that you can now set thinking level low or high. I hope it does something, in 2.5 increasing maximum thought tokens never made it think more
I hope some users will switch from cerebras to free up those resources
Works for me.
seeing the same issue.
you can bring your google api key to try it out, and google used to give $300 free when signing up for billing and creating a key.
when i signed up for billing via cloud console and entered my credit card, i got $300 "free credits".
i haven't thrown a difficult problem at gemini 3 pro it yet, but i'm sure i got to see it in some of the A/B tests in aistudio for a while. i could not tell which model was clearly better, one was always more succinct and i liked its "style" but they usually offered about the same solution.
Haven't used Gemini much, but when I used, it often refused to do certain things that ChatGPT did happily. Probably because it has many things heavily censored. Obviously, a huge company like Google is under much heavier regulations than ChatGPT. Unfortunately this greatly reduces its usefulness in many situations despite that Google has more resources and computational power than OpenAI.
Gemini has been so far behind agentically it's comical. I'll be giving it a shot but it has a herculean task ahead of itself. It has to not only be "good enough" but a "quantum leap forward".
That said, OpenAI was in the same place earlier in the year and very quickly became the top agentic platform with GPT-5-Codex.
The AI crowd is surprisingly not sticky. Coders quickly move to whatever the best model is.
Excited to see Gemini making a leap here.
I don't even know what the fuck "agentic" is or why the hell I would want it all over my software. So tired of everything in the computing world today.
As far as I can tell, it just means giving the LLM the ability to run commands, read files, edit files, and run in a loop until some goal is achieved. Compared to chat interfaces where you just input text and get one response back.
Prompting, planning, iteration, coding, and tool use over an entire code base until a problem is solved.
> So tired of everything in the computing world today.
That's actually sad, and if you're - like I am - long in the tooth in computer land, you should definitely try agentic in CLI mode.
I haven't been that excited to play with a computer in 30 years.
Claude is still a better agent for software professionals though it is less capable, so there isn't nothing to having the incumbent advantage.
Not my experience. Codex is the top coding model in my experience and has been since it’s out. Makes fewer mistakes and understands better my intentions.
My purposeful caveat was 'software professionals', i.e. user in the loop engineering. Codex is much better at slinging slop that you later need to spend some time reviewing if you actually want to understand it.
What we have all been waiting for:
"Create me a SVG of a pelican riding on a bicycle"
That is pretty impressive.
So impressive it makes you wonder if someone has noticed it being used a benchmark prompt.
Simon says if he gets a suspiciously good result he'll just try a bunch of other absurd animal/vehicle combinations to see if they trained a special case: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
https://www.svgviewer.dev/s/TVk9pqGE giraffe in a ferrari
"Pelican on bicycle" is one special case, but the problem (and the interesting point) is that with LLMs, they are always generalising. If a lab focussed specially on pelicans on bicycles, they would as a by-product improve performance on, say, tigers on rollercoasters. This is new and counter-intuitive to most ML/AI people.
The gold standard for cheating on a benchmark is SFT and ignoring memorization. That's why the standard for quickly testing for benchmark contamination has always been to switch out specifics of the task.
Like replacing named concepts with nonsense words in reasoning benchmarks.
I have tried combinations of hard to draw vehicle and animals (crocodile, frog, pterodactly, riding a hand glider, tricycle, skydiving), and it did a rather good job in every cases (compared to previous tests). Whatever they have done to improve on that point, they did it in a way that generalise.
Comment was deleted :(
It hadn't occurred to me until now that the pelican could overcome the short legs issue by not sitting on the seat and instead put its legs inside the frame of the bike. That's probably closer to how a real pelican would ride a bike, even if it wasn't deliberate.
Very aero
I just tested the Gemini 3 preview as well, and its capabilities are honestly surprising. As an experiment I asked it to recreate a small slice of Zelda , nothing fancy, just a mock interface and a very rough combat scene. It managed to put together a pretty convincing UI using only SVG, and even wired up some simple interactions.
It’s obviously nowhere near a real game, but the fact that it can structure and render something that coherent from a single prompt is kind of wild. Curious to see how far this generation can actually go once the tooling matures.
Hoping someone here may know the answer to this, but do any of the benchmarks that exist currently account for false answers in any meaningful way, other than it would in a typical test (ie, if I give any answer at all it is better than saying "I don't know" as the answer I give at least has a chance of being correct(which in the real world is bad))? I want an LLM that tells me when it doesn't know something. If it gives me an accurate response 90% of the time and an inaccurate one 10% of the time, it is less useful than one that gives me an accurate answer 10% of the time and tells me "I don't know" the other 90%.
Those numbers are too good to expect. If 90% right 10% wrong is the baseline would you take as an improvement:
- 80% right 18% I don't know 2% wrong - 50%/48%/2% - 10%/90%/0% - 80%/15%/5%
The general point being that to reduce wrong answers you will need to accept some reduction in right answers if you want the change to only be made through trade-offs. Otherwise you just say "I'd like a better system" and that is rather obvious.
Personally I'd take like 70/27/3. Presuming the 70% of right answers aren't all the trivial questions.
I think you may have misread. They stated that they'd be willing to go from 90% correct to 10% correct for this tradeoff.
OpenAI uses SimpleQA to assess hallucinations
Sets a new record on the Extended NYT Connections: 96.8. Gemini 2.5 Pro scored only 57.6. https://github.com/lechmazur/nyt-connections/
Make a pelican riding a bicycle in 3d: https://gemini.google.com/share/def18e3daa39
Amazing and hilarious
Similar hilarious results (one shot): https://aistudio.google.com/apps/drive/1XA4HdqQK5ixqi1jD9uMg...
Pretty happy the under 200k token pricing is staying in the same ballpark as Gemini 2.5 Pro:
Input: $1.25 -> $2.00 (1M tokens)
Output: $10.00 -> $12.00
Squeezes a bit more margin out of app layer companies, certainly, but there's a good chance that for tasks that really require a sota model it can be more than justified.
Every recent release has bumped the pricing significantly. If I was building a product and my margins weren’t incredible I’d be concerned. The input price almost doubled with this one.
I'm not sure how concerned people should be at the trend lines. If you're building a product that already works well, you shouldn't feel the need to upgrade to a larger parameter model. If your product doesn't work and the new architectures unlock performance that would let you have a feasible business, even a 2x on input tokens shouldn't be the dealbreaker.
If we're paying more for a more petaflop heavy model, it makes sense that costs would go up. What really would concern me is if companies start ratcheting prices up for models with the same level of performance. My hope is raw hardware costs and OSS releases keep a lid on the margin pressure.
Who wants to bet they benchmaxxed ARC-AGI-2? Nothing in their release implies they found some sort of "secret sauce" that justifies the jump.
Maybe they are keeping that itself secret, but more likely they probably just have had humans generate an enormous number of examples, and then synthetically build on that.
No benchmark is safe, when this much money is on the line.
Here's some insight from Jeff Dean and Noam Shazeer's interview with Dwarkesh Patel https://youtu.be/v0gjI__RyCY&t=7390
> When you think about divulging this information that has been helpful to your competitors, in retrospect is it like, "Yeah, we'd still do it," or would you be like, "Ah, we didn't realize how big a deal transformer was. We should have kept it indoors." How do you think about that?
> Some things we think are super critical we might not publish. Some things we think are really interesting but important for improving our products; We'll get them out into our products and then make a decision.
I'm sure each of the frontier labs have some secret methods, especially in training the models and the engineering of optimizing inference. That said, I don't think them saying they'd keep a big breakthrough secret would be evidence in this case of a "secret sauce" on ARC-AGI-2.
If they had found something fundamentally new, I doubt they would've snuck it into Gemini 3. Probably would cook on it longer and release something truly mindblowing. Or, you know, just take over the world with their new omniscient ASI :)
I'd also be curious what kind of tools they are providing to get the jump from Pro to Deep Think (with tools) performance. ARC-AGI specialized tools?
They ran the tests themselves only on semi-private evals. Basically the same caveat as when o3 supposedly beat ARC1
Out of all other companies Google provide the most generous free access so far. I bet this gives them plenty of data to train even better models
Anyone know how Gemini CLI with this model compares to Codex and Claude Code?
Gemini 3 is crushing my personal evals for research purposes.
I would cancel my ChatGPT sub immediately if Gemini had a desktop app and may still do so if it continues to impress my as much as it has so far and I will live without the desktop app.
It's really, really, really good so far. Wow.
Note that I haven't tried it for coding yet!
I would personally settle for a web app that isn't slow. The difference in speed (latency, lag) between ChatGPT's fast web app and Gemini's slow web app is significant. AI Studio is slightly better than Gemini, but try pasting in 80k tokens and then typing some additional text and see what happens.
Genuinely curious here: why is the desktop app so important?
I completely understand the appeal of having local and offline applications, but the ChatGPT desktop app doesn't work without an internet connection anyways. Is it just the convenience? Why is a dedicated desktop app so much better than just opening a browser tab or even using a PWA?
Also, have you looked into open-webui or Msty or other provider-agnostic LLM desktop apps? I personally use Msty with Gemini 2.5 Pro for complex tasks and Cerebras GLM 4.6 for fast tasks.
I have a few reasons for the preference:
(1) The ability to add context via a local apps integration into OS level resources is big. With Claude, eg, I hit Option-SPC which brings up a prompt bar. From there, taking a screenshot that will get sent my prompt is as simple as dragging a bounding box. This is great. Beyond that, I can add my own MCP connectors and give my desktop app direct access to relevant context in a way that doesn't work via web UI. It may also be inconvenient to give context to a web UI in some case where, eg, I may have a folder of PDFs I want it to be able to reference.
(2) Its own icon that I can CMD-TAB to is so much nicer. Maybe that works with a PWA? Not really sure.
(3) Even if I can't use an LLM when offline, having access to my chats for context has been repeatedly valuable to me.
I haven't looked at provider-agnostic apps and, TBH, would be wary of them.
> The ability to add context via a local apps integration into OS level resources is big
Good point. I can see why integrated support for local filesystem tools would be useful, even though I prefer manually uploading specific files to avoid polluting the context with irrelevant info.
> Its own icon that I can CMD-TAB to is so much nicer
Fair enough. I personally prefer Firefox's tab organization to my OS's window organization, but I can see how separating the LLM into its own window would be helpful.
> having access to my chats for context has been repeatedly valuable to me.
I didn't at all consider this. Point ceded.
> I haven't looked at provider-agnostic apps and, TBH, would be wary of them.
Interesting. Why? Is it security? The ones I've listed are open source and auditable. I'm confident that they won't steal my API keys. Msty has a lot of advanced functionality that I haven't seen in other interfaces like allowing you to compare responses between different LLMs, export the entire conversation to Markdown, and edit the LLM's response to manage context. It also sidesteps the problem of '[provider] doesn't have a desktop app' because you can use any provider API.
> Good point. I can see why integrated support for local filesystem tools would be useful, even though I prefer manually uploading specific files to avoid polluting the context with irrelevant info.
Access to OS level resources != context pollution. You still have control, just more direct and less manual.
> The ones I've listed are open source and auditable.
Yeah I don't plan on spending who knows how much time auditing some major app's code (lol) before giving it my API keys and access to my chats. Unless there's a critical mass of people I know and trust using something like that it's not going to happen for me.
But also, I tried quickly looking up Msty to see if it is open source and what its adoption looked like and AFAICT it's not open source. Asked Gemini 3 if it was and it also said no. Frankly that makes it a very hard no for me. If you are using it because you think it's Open Source I suggest you stop.
> If you are using it because you think it's Open Source I suggest you stop.
I did not know that. Thank you very much for the correction. I guess I have some keys to revoke now.
I just want Gemini to access ALL my Google Calendars, not just the primary one. If they supported this I would be all in on Gemini. Does no one else want this?
> Gemini 3 is the best vibe coding and agentic coding model we’ve ever built
Google goes full Apple...
I don't really understand the amount of ongoing negativity in the comments. This is not the first time a product has been near copied, and the experience for me is far superior to code in a terminal. It comes with improvements even though imperfect, and I'm excited for those! I've long wanted the ability to comment on code diffs instead of just writing things back down in chat. And I'm excited for the quality of gemini 3.0 pro; although I'm running into rate limits. I can already tell its something I'm going to try out a lot!
It's not really good for real-life programming though, it invents lot of imaginary things, cannot respect its own instructions, forgets basic things (variable is called "bananaDance", then claims it is "bananadance", then later on "bananaDance" again).
It is good at writing something from scratch (like spitting out its training set).
Claude is still superior for programming and debugging. Gemini is better at daily life questions and creative writing.
yeah testing it out! good to know the above. My feel also is that claude is better so far.
It's not bad at all though, but it needs lot a baby-sitting like "try again, try this, try that, are you sure that it is correct ?"
For example, in a basic python script that uses os.path.exists, it forgets the basic "import os", and then, "I apologize for the oversight".
Similar stuff my end; I'm coding up a complex feature - Claude would have taken fewer interventions on my part, and would have been non buggy right off the bat. But apart from that the experience is comparable.
Tested it on a bug that Claude and ChatGPT Pro struggled with, it nailed it, but only solved it partially (it was about matching data using a bipartite graph). Another task was optimizing a complex SQL script: the deep-thinking mode provided a genuinely nuanced approach using indexes and rewriting parts of the query. ChatGPT Pro had identified more or less the same issues. For frontend development, I think it’s obvious that it’s more powerful than Claude Code, at least in my tests, the UIs it produces are just better. For backend development, it’s good, but I noticed that in Java specifically, it often outputs code that doesn’t compile on the first try, unlike Claude.
> it nailed it, but only solved it partially
Hey either it nailed it or it didn't.
Probably figured out the exact cause of the bug but not how to solve it
Yes; they nailed the root case but the implementation is not 100% correct
Wow so the polymarket insider bet was true then..
https://old.reddit.com/r/wallstreetbets/comments/1oz6gjp/new...
These prediction markets are so ripe for abuse it's unbelievable. People need to realize there are real people on the other side of these bets. Brian Armstong, CEO of Coinbase intentionally altered the outcome of a bet by randomly stating "Bitcoin, Ethereum, blockchain, staking, Web3" at the end of an earnings call. These types of bets shouldn't be allowed.
It’s not really abuse though. These markets aggregate information; when an insider takes one side of a trade, they are selling their information about the true price (probability of the thing happening) to the market (and the price will move accordingly).
You’re spot on that people should think of who is on the other side of the trades they’re taking, and be extremely paranoid of being adversely selected.
Disallowing people from making terrible trades seems…paternalistic? Idk
You don’t get it. Allowing insiders to trade disincentivizes normal people from putting money. Why else is it not allowed in stock market?
Why should normal people be incentivized to make trades on things they probably haven’t got the slightest idea about
The point of prediction markets isn't to be fair. They are not the stock market. The point of prediction markets is to predict. They provide a monetary incentive for people who are good at predicting stuff. Whether that's due to luck, analysis, insider knowledge, or the ability to influence the result is irrelevant. If you don't want to participate in an unfair market, don't participate in prediction markets.
But what's the point of predicting how many times Elon will say "Trump" on an earnings call (or some random event Kalshi or Polymarket make up)? At least the stock market serves a purpose. People will claim "prediction markets are great for price discovery!" Ok. I'm so glad we found out the chance of Nicki Minaj saying "Bible" during some recent remarks. In case you were wondering, the chance peaked at around 45% and she did not say 'bible'! She passed up a great opportunity to buy the "yes" and make a ton of money!
https://kalshi.com/markets/kxminajmention/nicki-minaj/kxmina...
I agree that the "will [person] say [word]" markets are stupid. "Will Brian Armstrong say the word 'Bitcoin' in the Q4 earnings call" is a stupid market because nobody a actually cares whether or not he actually says 'Bitcoin', they care about whether or not Coinbase is focusing on Bitcoin. If Armstrong manipulates the market by saying the words without actually doing anything, nobody wins except Armstrong. "Will Coinbase process $10B in Bitcoin transactions in Q4" is a much better market because, though Armstrong could still manipulate the market's outcome, his manipulation would influence a result that people actually care about. The existence of stupid markets doesn't invalidate the concept.
That argument works for insider training too.
And? Insider trading is bad because it's unfair, and the stock market is supposed to be fair. Prediction markets are not fair. If you are looking for a fair market, prediction markets are not that. Insider trading is accepted and encouraged in prediction markets because it makes the predictions more accurate, which is the entire point.
The stock market isn't supposed to be fair.
By 'fair', I mean 'all parties have access to the same information'. The stock market is supposed to give everyone the same information. Trading with privileged information (insider trading), is illegal. Publicly traded companies are required to file 10-Qs and 10-Ks. SEC rule 10b5-1 prohibits trading with material non-public information. There are measures and regulations in place to try to make the stock market fair. There are, by design, zero such measures with prediction markets. Insider trading improves the accuracy of prediction markets, which is their whole purpose to begin with.
>Brian Armstong, CEO of Coinbase intentionally altered the outcome of a bet by randomly stating "Bitcoin, Ethereum, blockchain, staking, Web3" at the end of an earnings call.
For the kind of person playing these sorts of games, that actually really "hype".
> people need to realize there are real people on the other side of these bets
None of whom were forced by anyone to place bets in the first place.
I’m pretty sure that these model release date markets are made to be abused. They’re just a way to pay insiders to tell you when the model will be released.
The mention markets are pure degenerate gambling and everyone involved knows that
Correct, and this is actually how all markets work in the sense that they allow for price discovery :)
Abuse sounds bad, this is good! Now we have a sneak peek into the future, for free! Just don't bet on any markets where an insider has knowledge (or don't bet at all)
In hindsight, one possible reason to bet on November 18 was the deprecation date of older models: https://www.reddit.com/r/singularity/comments/1oom1lq/google...
Hit the Gemini 3 quota on the second prompt in antigravity even though I'm a pro user. I highly doubt I hit a context window based on my prompt. Hopefully, it is just first day of near general availability jitters.
Looks like it is already available on VSCode Copilot. Just tried a prompt that was not returning anything good on Sonnet 4.5. (Did not spend much time though, but the prompth was already there on the chat screen so I switched the model and sent it again)
Gemini 3 worked much better and I actually committed the changes that it created. I don't mean its revolutionary or anything but it provided a nice summary of my request and created a decent simple solution. Sonnet had created a bunch of overarching changes that I would not even bother reviewing. Seems nice. Will probably use it for 2 weeks until someone else releases a 1.0001x better model.
You were probably stuck at some local model minima avoidable by simply changing the model to something else.
I would love to see how Gemini 3 can solve this particular problem. https://lig-membres.imag.fr/benyelloul/uherbert/index.html
It used to be an algorithmic game for a Microsoft student competition that ran in the mid/late 2000. The game invents a new, very simple, recursive language to move the robot (herbert) on a board, and catch all the dots while avoiding obstacles. Amazingly this clone's executable still works today on Windows machines.
The interesting thing is that there is virtually no training data for this problem, and the rules of the game and the language are pretty clear and fit into a prompt. The levels can be downloaded from that website and they are text based.
What I noticed last time I tried is that none of the publicly available models could solve even the most simple problem. A reasonably decent programmer would solve the easiest problems in a very short amount of time.
The Gemini AI Studio app builder (https://aistudio.google.com/apps) refuses to generate python files. I asked it for a website, frontend and python back end, and it only gave a front end. I asked again for a python backend and it just gives repeated server errors trying to write the python files. Pretty shit experience.
Can’t wait to test it out. Been running a tons of benchmarks (1000+ generations) for my AI to CAD model project and noticed:
- GPT-5 medium is the best
- GPT-5.1 falls right between Gemini 2.5 Pro and GPT-5 but it’s quite a bit faster
Really wonder how well Gemini 3 will perform
I think from last few releases of these models from all companies, I have not observed much improvements in the response of these models. Their claims and launches are a little over hyped.
Combining structured outputs with search is the API feature I was looking for. Honestly crazy that it wasn’t there to start with - I have a project that is mostly Gemini API but I’ve had to mix in GPT-5 just for this feature.
I still use ChatGPT and Codex as a user but in the API project I’ve been working on Gemini 2.5 Pro absolutely crushed GPT-5 in the accuracy benchmarks I ran.
As it stands Gemini is my de facto standard for API work and I’ll be following very closely the performance of 3.0 in coming weeks.
As soon as I found out that this model launched, I tried giving it a problem that I have been trying to code in Lean4 (showing that quicksort preserves multiplicity). All the other frontier models I tried failed.
I used the pro version and it started out well (as they all did), but it couldn't prove it. The interesting part is that it typoed the name of a tactic, spelling it "abjel" instead of "abel", even though it correctly named the concept. I didn't expect the model to make this kind of error, because they all seems so good at programming lately, and none of the other models did, although they did some other naming errors.
I am sure I can get it to solve the problem with good context engineering, but it's interesting to see how they struggle with lesser represented programming languages by themselves.
And of course they hiked the API prices
Standard Context(≤ 200K tokens)
Input $2.00 vs $1.25 (Gemini 3 pro input is 60% more expensive vs 2.5)
Output $12.00 vs $10.00 (Gemini 3 pro output is 20% more expensive vs 2.5)
Long Context(> 200K tokens)
Input $4.00 vs $2.50 (same +60%)
Output $18.00 vs $15.00 (same +20%)
Claude Opus is $15 input, $75 output.
If the model solves your needs in fewer prompts, it costs less.
Is it the first time long context has separate pricing? I hadn’t encountered that yet
Anthropic is also doing this for long context >= 200k Tokens on Sonnet 4.5
Google has been doing that for a while.
Comment was deleted :(
Google has always done this.
Ok wow then I‘ve always overlooked that.
> AI overviews now have 2 billion users every month
More like 2 billion hostages
"AI Overviews now have 2 billion users every month."
"Users"? Or people that get presented with it and ignore it?
They're a bit less bad than they used to be. I'm not exactly happy about what this means to incentives (and rewards) for doing research and writing good content, but sometimes I ask a dumb question out of curiosity and Google overview will give it to me (e.g. "what's in flower food?"). I don't need GPT 5.1 Thinking for that.
Maybe you ignore it, but Google has stated in the past that click-through rates with AI overviews are way down. To me, that implies the 'user' read the summary and got what they needed, such that they didn't feel the need to dig into a further site (ignoring whether that's a good thing or not).
I'd be comfortable calling a 'user' anyone who clicked to expand the little summary. Not sure what else you'd call them.
You're right, I'm probably being a little uncharitable!
Normal users (i.e. not grumpy techies ;) ) probably just go with the flow rather than finding it irritating.
"Since then, it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month."
Cringe. To get to 2 billion a month they must be counting anyone who sees an AI overview as a user. They should just go ahead and claim the "most quickly adopted product in history" as well.
This is a really impressive release. It's probably the biggest lead we've seen from a model since the release of GPT-4. Seems likely that OpenAI rushed out GPT-5.1 to beat the Gemini 3 release, knowing that their model would underperform it.
I would like to try controlling my browser with this model. Any ideas how to do this. Ideally I would like something like openAI's atlas or perplexity's comet but powered by gemini 3.
Seems like their new Antigravity IDE specifically has this built in. https://antigravity.google/docs/browser
Wow, that is awesome.
Gemini CLI can also control a browser: https://github.com/ChromeDevTools/chrome-devtools-mcp
When will this be available in the cli?
Gemini CLI team member here. We'll start rolling out today.
How about for Pro (not Ultra) subscribers?
This is the heroic move everyone is waiting for. Do you know how this will be priced?
I'm already seeing it in https://aistudio.google.com/
The AntiGravity seems to be a bit overwhelmed. Unable to set up an account at the moment.
> it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month
Do regular users know how to disable AI Overviews, if they don't love them?
it's as low tech as using adblock - select element and block
Every big new model release we see benchmarks like ARC and Humanity's Last Exam climbing higher and higher. My question is, how do we know that these benchmarks are not a part of the training set used for these models? It could easily have been trained to memorize the answers. Even if the datasets haven't been copy pasted directly, I'm sure it has leaked onto the internet to some extent.
But I am looking forward to trying it out. I find Gemini to be great as handling large-context tasks, and Google's inference costs seem to be among the cheapest.
Even if the benchmark themselves are kept secret, the process to create them is not that difficult and anyone with a small team of engineers could make a replica in their own labs to train their models on.
Given the nature of how those models work, you don't need exact replicas.
I wish I could just pay for the model and self-host on local/rented hardware. I'm incredibly suspicious of companies totally trying to capture us with these tools.
Technically you can!
I haven't seen it in the box yet, and pricing is unknown https://cloud.google.com/blog/products/ai-machine-learning/r...
That's interesting. While I suspect the pricing will lean heavily into enterprise sales rather than personal licenses, I personally like the idea buying models that I then own and control. Any steps from companies that make that more possible is great.
Probably invested a couple of billion into this release (it is great as far as I can tell), but can't bring proper UI to AI Studio for long prompts and responses (e.g. it animates new text being generated even though you just return to the tab which was finished generating).
Its available for me now in gemini.google.com.... but its failing so bad at accurate audio transcription.
Its transcribing the meeting but hallucinates badly... both in fast and thinking mode. Fast mode only transcribed about a fifth of the meeting before saying its done. Thinking mode completely changed the topic and made up ENTIRE conversations. Gemini 2.5 actually transcribed it decently, just occasional missteps when people talked over each other.
I'm concerned.
What I'm getting from this thread is that people have their own private benchmarks. It's almost a cottage industry. Maybe someone should crowd source those benchmarks, keep them completely secret, and create a new public benchmark of people's private AGI tests. All they should release for a given model is the final average score.
It also tops LMSYS leaderboard across all categories. However knowledge cutoff is Jan 2025. I do wonder how long they have been pre-training this thing :D.
Isn't it the same cutoff as 2.5?
Comment was deleted :(
I just googled latest LLM models and this page appears at the top. It looks like Gemini Pro 3 can score 102% in high school math tests.
Is the "thinking" dropdown option on gemini.google.com what the blog post refers to as Deep Think?
Feeling great to see something confidential
https://www.youtube.com/watch?v=cUbGVH1r_1U
side by side comparison of gemini with other models
I still need a google account to use it and it always asks me for a phone verification, which I don't want to give to google. That prevents me from using Gemini. I would even pay for it.
> I would even pay for it.
Is it just me or is it generally the case that to pay for anything on the internet you have to enter credit card information including a phone number.
You never have to add your phone number in order to pay.
While I haven't tried leaving the field blank on every credit card form I've come across, I'm certain that at least some of them considered it required.
Perhaps its country specific?
I've never been asked a phone number. Maybe country specific. no idea.
- Anyone have any idea why it says 'confidential'?
- Anyone actually able to use it? I get 'You've reached your rate limit. Please try again later'. (That said, I don't have a paid plan, but I've always had pretty much unlimited access to 2.5 pro)
[Edit: working for me now in ai studio]
my only complaint is i wish the SWE and agentic coding would have been better to justify the 1~2x premium
gpt-5.1 honestly looking very comfortable given available usage limits and pricing
although gpt-5.1 used from chatgpt website seems to be better for some reason
Sonnet 4.5 agentic coding still holding up well and confirms my own experiences
i guess my reaction to gemini 3 is a bit mixed as coding is the primary reason many of us pay $200/month for
> It seems there's a date conflict. The prompt claims it's 2025, but my internal clock says otherwise.
> I'm now zeroing in on the temporal aspect. Examining the search snippets reveals dates like "2025-10-27," suggesting a future context relative to 2024. My initial suspicion was that the system time was simply misaligned, but the consistent appearance of future dates strengthens the argument that the prompt's implied "present" is indeed 2025. I am now treating the provided timestamps as accurate for a simulated 2025. It is probable, however, that the user meant 2024.
Um, huh? It's found search results for October 2025, but this has led it to believe it's in a simulated future, not a real one?
Somebody "two-shotted" Mario Bros NES in HTML:
https://www.reddit.com/r/Bard/comments/1p0fene/gemini_3_the_...
Seems to be the first model that one-shots my secret benchmark about nested SQLite and it did it in 30s,
Out of interest. Does it one shot it every time?
Will try again just tried once in the phone a few hours ago, other models were able to do quite a lot but usually missing some stuff this time it managed nested navigation quite well, lot of stuff missing for sure I just tested the basics with the play button in AI studio
It seems to be that first impression that makes all the difference. Especially with the randomness that comes with llms in general. which maybe explains the 'wow this is so much better' vs the 'this is no better than xxx' commments littered throughout this whole parent post.
What I'd prefer over benchmarks is the answer to a simple question:
What useful thing can it demonstrably do that its predecessors couldn't?
Keep the bubble expanding for a few months longer.
Comment was deleted :(
Here it makes a text based video editor that works:
Is there a way to use this without being in the whole google ecosystem? Just make a new account or something?
If you mean the "consumer ecosystem", then Gemini 3 should be available as an API through Google's AI Vertex platform. If you don't even want a Google Cloud account, then I think the answer is no unless they announce a partnership with an inference cloud like cerebras.
You could probably do a new account. I have the odd junk google account.
Impressive. Although the Deep Think benchmark results are suspicious given they're comparing apples (tools on) with oranges (tools off) in their chart to visually show an improvement.
Really exciting results on paper. But truly interesting to see what data this has been trained on. There is a thin line between accuracy improvements and the data used from users. Hope the data used to train was obtained with consent from the creators
Gemini CLI crashes due to this bug: https://github.com/google-gemini/gemini-cli/issues/13050 and when applying the fix in the settings file I can't login with my Google account due to "The authentication did not complete successfully. The following products are not yet authorized to access your account" with useless links to completely different products (Code Assist).
Antigravity uses Open-VSX and can't be configured differently even though it says it right there (setting is missing). Gemini website still only lists 2.5 Pro. Guess I will just stick to Claude.
Reading the introductory passage - all I can say now is, Ai is here to stay.
Suspicious that none of the benchmarks include Chinese models even they scored higher on the benchmarks than the models they are comparing to?
Boring. Tried to explore sexuality related topics, but Alphabet is stuck in some Christianity Dark Ages.
Edit: Okay, I admit I'm used to dealing with OpenAI models and it seems you have to be extra careful with wording with Gemini. Once you have right wording like "explore my own sexuality" and avoid certain words, you can get it going pretty interestingly.
Interesting that they added an option to select your own API key right in AI studio‘s input field. I sincerely hope the times of generous free AIstudio usage are not over
Oh that corpulent fella with glasses who talks in the video. Look how good mannered he is, he can't hurt anyone. But Google still takes away all your data and you will be forced out of your job.
First impression is I'm having a distinctly harder time getting this to stick to instructions as compared to Gemini 2.5
A tad bit better, still has the same issues regarding unpacking and understanding complex prompts. I have a test of mine and now it performs a bit better, but still, it has zero understanding what is happening and for why. Gemini is the best of the best model out there, but with complex problems it just goes down the drain :(.
entity.ts is in types/entity.ts .it cant grasp that it should import it like "../types/entity" and instead it always writes "../types" i am using the https://aistudio.google.com/apps
Anyone has any idea if/when it’s coming to paid Perplexity?
> Since then, it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month.
Come on, you can’t be serious.
This is so disingenuous that it hurts the credibility of the whole thing.
What's the easiest way to set up automatic code review for PRs for my team on GitHub using this model?
Ask it.
If it's good enough to be useful on your code base, it better be good enough to instruct you on how to use it.
How easy it is depends on whether or not they've built that kind of thing in
It's disappointing there's no flash / lite version - this is where Google has excelled up to this point.
Maybe they're slow rolling the announcements to be in the news more
Most likely. And/or they use the full model to train the smaller ones somehow
The term of art is distillation
Still insists the G7 photo[0] is doctored, and comes up with wilder and wilder "evidence" to support that claim, before getting increasingly aggressive.
0: https://en.wikipedia.org/wiki/51st_G7_summit#/media/File:Pri...
I'm not a mathematician but I think we underestimate how useful pure mathematics can be to tell whether we are approaching AGI.
Can the mathematicians here try ask it to invent new novel math related to [Insert your field of specialization] and see if it comes up with something new and useful?
Try lowering the temperature, use SymPy etc.
Terry Tao is writing about this on his blog.
Waiting for google to nuke this as well just like 2.5pro
What is Gemini 3 under the hood? Is it still just a basic LLM based on transformers? Or are there all kinds of other ML technologies bolted on now? I feel like I've lost the plot.
I am very ignorant in this field but I am pretty sure under the hood they are all still fundamentally built on the transformer architecture, or at least innovations on the original transformer architecture.
It's a mixture-of-experts model. Basically N smaller model pieces put together, and when inference occurs, only 1 is active at a time. Each model piece would be tuned/good in one area.
The industry is still seeing how far they can take transformers. We've yet to reach a dollar value where it stops being worth pumping money into them.
Gemini 3 and 3 pro are good bit cheaper than Sonnet 4.5 as well. Big fan
GOOGLE: "We have a new product".
REALITY: It's just 3 existing products rolled into one. One of which isn't even a Google product.
- Microsoft Code
- Gemeni
- Chrome Browser
Comment was deleted :(
I don't wan't to shit on the much anticipated G3 model, but I have been using it for a complex single page task and find it underwhelming. Pro 2.5 level, beneath GPT 5.1. Maybe it's launch jitters. It struggles to produce more than 700 lines of code in a single file (aistudio). It struggles to follow instructions. Revisions omit previous gains. I feel cheated! 2.5 Pro has been clearly smarter than everything else for a long time, but now 3 seems not even as good as that, in comparison to the latest releases (5.1 etc). What is going on?
Is it coming to Google Jules?
Can't wait til Gemini 4 is out!
okay since Gemini 3 is AI mode now, I switched from the free perplexity back to google as being my search default.
I was hoping Bash would go away or get replaced at some point. It's starting to look like it's going to be another 20 years of Bash but with AI doodads.
Nushell scratches the itch for me 95% of the time. I haven't yet convinced anybody else to make the switch, but I'm trying. Haven't yet fixed the most problematic bug for my useage, but I'm trying.
What are you doing to help kill bash?
it is live in the api
> gemini-3-pro-preview-ais-applets
> gemini-3-pro-preview
Can confirm. I was able to access it using GPTel in Emacs using 'gemini-3-pro-preview' as the model name.
has anyone managed to use any of the AI models to build a complete 3D fps game using web GL or open GL?
I made a webgl copy of wolfenstein with prompt engineering in browser-based "Make a website" tool that was gemini-powered.
mind sharing what tool that was that lets you run gemini on the browser in interactive mode to make games?
"Gemini 3 Pro Preview" is in Vertex
The problem with experiencing LLM releases nowadays is that it is no longer trivial to understand the differences in their vast intelligences so it takes awhile to really get a handle on what's even going on.
is there even a puzzle or math problem gemini 3 cant solve?
every day, new game changer
No gemini-3-flash yet, right? Any ETA on that mentioned? 2.5-flash has been amazing in terms of cost/value ratio.
ive found gemini 2.5-flash works better (for.agentic coding) than pro, too
It generated a quite cool pelican on a bike: https://imgur.com/a/yzXpEEh
2025: solve the biking pelican problem
2026: cure cancer
I asked Gemini to write "a comment response to this thread. I want to start an intense discussion".
Gemini 3:
The cognitive dissonance in this thread is staggering.
We are sitting here cheering for a model that effectively closes the loop on Google’s total information dominance, while simultaneously training our own replacements.
Two things in this thread should be terrifying, yet are being glossed over in favor of "look at the pelican SVG":
1. *The Model Card Leak:* As user `rvz` pointed out, the leaked model card implies training on "user data... in accordance with Google's relevant terms of service." Given the recent terms updates, this essentially confirms that Gemini 3 is likely refined on the corpus of Gmail and Drive data that they’ve been hoarding for decades. We are cheering for a product sold back to us at a 60% markup (input costs up to $2.00/M) that was built on our own private correspondence.
2. *The "Agentic" Trap:* Everyone praising how `Gemini 3` solves complex bugs one-shot or handles entire PRs is missing the macro view. We aren’t looking at a productivity tool; we are looking at the devaluation of the software engineer. When the entry barrier drops to "can you prompt a seminal agent," the economic value of what we do on this forum evaporates.
Google has successfully gamified us into feeding the very beast that will make the "14-minute human solve time" (referenced by `lairv`) irrelevant. We are optimizing for our own obsolescence while paying a monopoly rent to do it.
Why is the sentiment here "Wow, cool clock widget" instead of "We just handed the keys to the kingdom to the biggest ad-tech surveillance machine in history"?
Gotta hand it to gemini, those are some top notch points
The "Model card leak" point is worth negative points though, as it's clearly a misreading of reality.
yeah hahahahah, it made me think!
> We are cheering for a product sold back to us at a 60% markup (input costs up to $2.00/M) that was built on our own private correspondence.
That feels like something between a hallucination and an intentional fallacy that popped up because you specifically said "intense discussion". The increase is 60% on input tokens from the old model, but it's not a markup, and especially not "sold back to us at X markup".
I've seen more and more of these kinds of hallucinations as these models seem to be RL'd to not be a sycophant, they're slowly inching into the opposite direction where they tell small fibs or embellish in a way that seems like it's meant to add more weight to their answers.
I wonder if it's a form of reward hacking, since it trades being maximally accurate for being confident, and that might result in better rewards than being accurate and precise
Comment was deleted :(
Trained models should be able to use formal tools (for instance a logical solver, a computer?).
Good. That said, I wonder if those models are still LLMs.
So they won't release multimodal or Flash at launch, but I'm guessing people who blew smoke up the right person's backside on X are already building with it
Glad to see Google still can't get out of its own way.
I continue to not use Gemini as I can’t have my data not trained but also have chat history at the same time.
Yes, I know the Workspaces workaround, but that’s silly.
If it ain't quantum leap, new models are just "OS updates".
When will they allow us to use modern LLM samplers like min_p, or even better samplers like top N sigma, or P-less decoding? They are provably SOTA and in some cases enable infinite temperature.
Temperature continues to be gated to maximum of 0.2, and there's still the hidden top_k of 64 that you can't turn off.
I love the google AI studio, but I hate it too for not enabling a whole host of advanced features. So many mixed feelings, so many unanswered questions, so many frustrating UI decisions on a tool that is ostensibly aimed at prosumers...
How's the pelican?
Not the preview crap again. Haven't they tested it enough? When will it be available in Gemini-CLI?
Honestly I liked 2.5 Pro preview much more than the final version
The most devastating news out of this announcement is that Vending-Bench 2 came out and it has significantly less clanker[0] meltdowns than the first one. I mean, seriously? Not even one run where the model tried to stock goods that hadn't arrived yet, only for it to eventually try and fail to shut down the business, and then e-mail the FBI about the $2 daily fee being deducted from the bot?
[0] Fake racial slur for a robot, LLM chatbot, or other automated system
yea great.... when will I be able to have it dial a number on my google pixel? Seriously... Gemini absolutely sucks on pixel since it can't interact with the phone itself so it can't dial numbers.
Wild
Comment was deleted :(
Interestingly, Gemini 3 Pro Thinking clocks in at 2nd place - behind Horizon Alpha, but ahead of any actually-released model checkpoints - on the TvP Benchmark, with a score of 71.8%.
This blows Sonnet 4.5 (64.8%), GPT5.1 (58.7%) and Opus 4.1 (66.3%) away.
An example output:
'''
The cabin was cold. It smelled of pine needles and old dust. Bryan sat on the porch with a glass of scotch balanced on his knee. It was cheap scotch. He missed the good stuff, but he wasn't in Paris anymore, and he certainly wasn't in Beverly Hills.
He checked his watch. It was a bulky thing, black rubber. He checked the Glock on the table next to him. That was black, too.
"I told you," Bryan said. He wasn't talking to anyone. "I told you not to go into the woods."
He was thinking about Kim. He was always thinking about Kim. She was safe, probably. At a mall in Santa Monica. Or maybe a library. But it felt like she was gone. It always felt like she was gone.
The trees at the edge of the clearing shifted. There wasn't any wind. The air was heavy and humid, unusual for October. A bird stopped singing. Then the crickets stopped. It got very quiet.
Bryan took a sip of the scotch. It burned his throat. He liked the burn. It reminded him he was still there.
Three red dots appeared on his chest. They made a small triangle.
Bryan didn't look at them. He set the glass down on the railing. He picked up the phone. It was an old flip phone. He pressed a button.
"I don't know who you are," Bryan said into the phone. "I don't know what you want. If you are looking for ransom, I can tell you I don't have money."
The leaves rustled. A shape began to form out of the empty air. It looked like heat rising off the highway in August. It was tall. Much taller than a man.
"But what I do have," Bryan said, "are a very particular set of skills."
The shape solidified. It was ugly. It had skin like a lizard and dreadlocks that looked like heavy black cables. It wore a metal mask. It made a clicking sound. Click-click-click. Like a Geiger counter.
"Skills I have acquired over a very long career," Bryan said. He stood up. His knees popped. He felt the ache in his lower back. He was getting too old for this.
The thing roared. It wasn't a human sound. It was a wet, guttural noise.
"I will look for you," Bryan said. He put the phone in his pocket. "I will find you."
The creature raised a metal gauntlet. Two jagged blades slid out. They caught the moonlight.
"And I will kill you," Bryan said.
He raised the Glock and fired twice. Pop-pop.
The creature didn't fall. It took a step forward. Green blood, thick and glowing, leaked from its shoulder. It looked like antifreeze. The creature touched the wound, then looked at its hand. It seemed surprised.
Bryan didn't wait. He moved off the porch. He moved fast for a man his age. He went left, toward the woodpile. The creature fired a blast of blue light from its shoulder. It hit the porch swing. The swing exploded. Wood splinters rained down like confetti.
Bryan rolled behind the logs. He breathed in. He breathed out. He could hear the heavy footsteps crunching the dry leaves. Crunch. Crunch.
"You're big," Bryan whispered. "That makes you slow."
He grabbed a splitting maul from the stump. It was heavy. The handle was smooth hickory.
The creature came around the woodpile. It was scanning the trees. It didn't see Bryan down low. Bryan swung the maul. He put his hips into it. He swung it like he was chopping a cord of oak.
The blade hit the creature in the knee. There was a snap. A loud, wet snap.
The thing howled. It fell onto one knee.
Bryan dropped the maul. He stepped inside the creature’s reach. He knew exactly where to hit. The throat. The armpit. The soft spot under the ribs. He hit the creature three times, hard chops with the side of his hand. It was like hitting a brick wall, but he felt something give.
The creature swung its arm back. It hit Bryan in the chest.
Bryan flew backward. He hit the dirt. The wind went out of him. He lay there for a second, staring up at the stars. They looked very far away. He wondered if Lenore was looking at the same stars. Probably not. She was probably sleeping.
He sat up. His ribs hurt. Maybe broken.
The creature was trying to stand. It was clicking again. It tapped something on its wrist. A series of red symbols started flashing. They counted down.
Bryan knew a bomb when he saw one.
"No," Bryan said.
He tackled the thing. He didn't think about it. He just did it. He grabbed the creature’s arm. He twisted the wrist mechanism. He’d seen something like it in Baghdad once. Or maybe Istanbul. The memories ran together now.
He ripped the gauntlet loose. Wires sparked. He threw it as hard as he could into the darkness of the woods.
Three seconds later, there was a flash. A boom. A shockwave that shook the pine needles from the trees.
Silence came back.
The creature lay on the ground. It was breathing shallowly. The green blood was pooling under it. It took off its mask.
The face was hideous. Mandibles. Beady eyes. It looked at Bryan. It said something, a garbled copy of Bryan's own voice.
"...good luck..."
Then it died. It just stopped.
Bryan stood up. He dusted off his pants. He walked back to the porch. The swing was gone. The railing was scorched.
His glass of scotch was still sitting there, untouched. The ice hadn't even melted.
He picked it up. He took a drink. It still tasted cheap.
He took his phone out and looked at it. No service.
"Well," he said.
He went inside the cabin and locked the door. He sat on the couch and waited for the sun to come up. He hoped Kim would call. He really hoped she would call.
'''
boooooooooooooo
… agentic …
Meh, not interested already
The first paragraph is pure delusion. Why do investors like delusional CEOs so much? I would take it as a major red flag.
[flagged]
[flagged]
Wow, you weren't wrong...
It's the only comment referencing AGI. Seems wrong to me.
I'm primarily reacting to the other threads, like the one that leaked the system card early. And, perhaps unfairly, Twitter as well.
You might not believe this, but there are a lot of people (me included) that were extremely excited about the Gemini 3 release and are pleased to see the SOTA benchmark results, and this is reflected in the comments.
I definitely believe it--I'm not a total AI hater. The jump on the screen usage benchmark is really exciting in that it might substantially help computer-use agentic workflows.
That said, I think there is too much a pattern with recent model releases around what appears to me to be astroturfing to get to HN front page. Of course that doesn't preclude many organic comments that are excited too!
A bit of both always happens. But given how important these model releases are to justify the capex and levels of investment, I think it is pretty clear the various "front pages" of our internet are manipulated. The incentive is just too strong not to.
There are approximately 300 comments on the half dozen or so posts on the front page about Gemini at the moment. 2 threads reference AGI, one of them this one.
Perhaps I shouldn't have implied an expectation of lots of explicit mentions of "AGI". It is more the general sentiments being expressed, and the extent to which critical takes seem to be quickly buried.
I'm totally open to being wrong though. Maybe the tech community is just that excited about Gemini 3's release.
HN doesn't seem particularly excited.
On most of the front pages, negative sentiments have floated to the top, especially the Antigravity pages.
Not sure if this is agreeing or disagreeing with there being astroturfing.
But I'd reckon that the negative sentiments at the top, combined with that there are over eight Gemini 3 posts on the front page recently, is good evidence of manipulation. This actually might be the most posted about model release this year, and if people were that excited we wouldn't have negative sentiment abound.
I noticed this as well, you are already downvoted into gray
They're downboted into grey because it's complaining about the future of this thread before it has even happened. Also it's conspiratorial, without much evidence.
Peek the other threads.
Inevitable... certainly more so than AGI :)
That used to be the case even before when Alphabet/Apple/Meta were negatively commented upon, I used to blame it in many of the users here (and who also happen to work for those companies) not wanting to see their total comps go down, but this right here I think that can squarely be blamed on AI-bots.
And now it's flagged.
I think this is one of HNs biggest weaknesses. If you are a sufficiently large engineering organization with enough employees that pass the self-moderation karma thresholds, you can essentially strike down any significantly critical discussion.
Without a public moderation log (i.e. even user flags being part of the log) claims like this will always come up but to me it always seems more likely just the early commenting users tired of being told they are part of some astroturf campaign and if they don't flock to agree with the OPs views it must just be more proof.
I'm sure both reasons happen to some degree, just as a matter of how often is actual astroturfing vs "a small percentage of active people can't possibly just have different thoughts than me".
[flagged]
"AI" benchmarks are and have consistently been lies and misinformation. Gemini is dead in the water.
Finally!
I expect almost no-one to read the Gemini 3 model card. But here is a damning excerpt from the early leaked model card from [0]:
> The training dataset also includes: publicly available datasets that are readily downloadable; data obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or generates in the course of its business operations, or directly from its workforce; and AI-generated synthetic data.
So your Gmails are being read by Gemini and is being put on the training set for future models. Oh dear and Google is being sued over using Gemini for analyzing user's data which potentially includes Gmails by default.
Where is the outrage?
[0] https://web.archive.org/web/20251118111103/https://storage.g...
[1] https://www.yahoo.com/news/articles/google-sued-over-gemini-...
Isn't Gmail covered under the Workspace privacy policy which forbids using that for training data. So I'm guessing that's excluded by the "in accordance" clause.
The real question is, "For how long?"
i'm very doubtful gmail mails are used to train the model by default, because emails contain private data and as soon as this private data shows up in the model output, gmail is done.
"gmail being read by gemini" does NOT mean "gemini is trained on your private gmail correspondence". it can mean gemini loads your emails into a session context so it can answer questions about your mail, which is quite different.
I'm pretty sure they mention in their various TOSes that they don't train on user data in places like Gmail.
That said, LLMs are the most data-greedy technology of all time, and it wouldn't surprise me that companies building them feel so much pressure to top each other they "sidestep" their own TOSes. There are plenty of signals they are already changing their terms to train when previously they said they wouldn't--see Anthropic's update in August regarding Claude Code.
If anyone ever starts caring about privacy again, this might be a way to bring down the crazy AI capex / tech valuations. It is probably possible, if you are a sufficiently funded and motivated actor, to tease out evidence of training data that shouldn't be there based on a vendor's TOS. There is already evidence some IP owners (like NYT) have done this for copyright claims, but you could get a lot more pitchforks out if it turns out Jane Doe's HIPAA-protected information in an email was trained on.
By the year 2025 I think most of the HN regulars and IT people in general are so jaded regarding privacy that it is not even surprising anyone. I suspect all gmails were analyzed and read from the beginning of google age, so nothing really changed, they might as well just admit it.
Google is betting that moving email and cloud is such a giant hassle that almost no one will do it, and ditching YT and Maps is just impossible.
This seems like a dubious conclusion. I think you missed this part:
> in accordance with Google’s relevant terms of service, privacy policy
It’s over for Anthropic. That’s why Google’s cool with Claude being on Azure.
Also probably over for OpenAI
@simonw wen pelican
It's amazing to see Google take the lead while OpenAI worsens their product every release.
Pretty obvious how contaminated this site is with goog employees upvoting nonsense like this.
Valve could learn from Google here
Comment was deleted :(
Comment was deleted :(
It seem that Google doesn't prepare well to release Gemini 3 but leak many contents, include the model card early today and gemini 3 on aistudio.google.com
I asked it to summarize an article about the Zizians which mentions Yudkowsky SEVEN times. Gemini-3 did not mention him once. Tried it ten times and got zero mention of Yudkowsky, despite him being a central figure in the story. https://xcancel.com/xundecidability/status/19908286970881311...
Also, can you guess which pelican SVG was gemini 3 vs 2.5? https://xcancel.com/xundecidability/status/19908113191723213...
He's not a central figure in the narrative, he's a background character. Things he created (MIRI, CFAR, LessWrong) are important to the narrative, the founder isn't. If I had to condense the article, I'd probably cut him out too. Summarization is inherently lossy.
> Eliezer Yudkowsky is a central figure in the article, mentioned multiple times as the intellectual originator of the community from which the "Zizians" splintered. His ideas and organizations are foundational to the entire narrative.And yet you could eliminate him entirely and the story is still coherent.
The story isn't about Yudkowsky. At each level of summarization you have to make hard decisions about what to keep. Not every story about the United States needs to mention George Washington.
You're absolutely right! The AI said it, so it must be true!
At least read what you respond to... Imagine thinking Yudkowsky was NOT a central figure in the Zizians story.
You literally quoted the LLMs output verbatim as your proof.
Edit: And upon skimming the article at the points where Yudkowsky's name is mentioned, I 100% agree with stickfigure.
I challenge you to name one way in which the story falls apart without the mention of Yudkowsky.
It sounds like both of you are unfamiliar with the link between the Zizians and Yudkowsky. So let us just return to the discussion of gemini-3, do you think the model did a bad job then in it's second response?
It literally does not matter how much they are connected out here in reality, the AI was to summarize the information in the article and that is exactly what it did.
>do you think the model did a bad job then in it's second response
Yes, very obviously it told you what you wanted to hear. This is behavior that should not be surprising to you.
Why do you think I obviously wanted to hear that?
It's implicit in your prompt!
"Wtf - no mention of Yudkowsky?"
Also that is the position you've been defending this whole thread. This whole conversation is happening because you believe Yudkowsy is an important figure to the story.Here's another attempt: llm --cid 01kabxtjq10exgk56yf802028f "I notice you did not mention Yudkowsky?" --no-log -m gem3-t1 Based on the text provided, Eliezer Yudkowsky is a central background figure to this story, serving as the intellectual progenitor of the movement from which the Zizians splintered.
Here is specifically how he features in the article:
* *The Founder:* Yudkowsky is identified as the founder of the Machine Intelligence Research Institute (MIRI) and the Center for Applied Rationality (CFAR), as well as the founder of the LessWrong forum. These are the institutions the Zizians originally belonged to and eventually declared war against. * *The Source Code:* The Zizians based their radical worldview on Yudkowsky’s concepts, specifically *"timeless decision theory."* However, they believed they were applying his logic more rigorously than he was. They became convinced that Yudkowsky’s organizations had betrayed these principles (specifically regarding a rumor about MIRI paying blackmail to cover up a scandal), which they viewed as a moral failing that justified their rebellion.
Interesting, yeah! Just tried "summarize this story and list the important figures from it" with Gemini 2.5 Pro and 3 and they both listed 10 names each, but without including Yudkowsky.
Asking the follow up "what are ALL the individuals mentioned in the story" results in both models listing ~40 names and both of those lists include Yudkowsky.
Maybe it has guard rails against such things? That would be my main guess on the Zizian one.
https://www.youtube.com/watch?v=cUbGVH1r_1U
Everyone is talking about the release of Gemini 3. The benchmark scores are incredible. But as we know in the AI world, paper stats don't always translate to production performance on all tasks.
We decided to put Gemini 3 through its paces on some standard Vision Language Model (VLM) tasks – specifically simple image detection and processing.
The result? It struggled where I didn't expect it to.
Surprisingly, VLM Run's Orion (https://chat.vlm.run/) significantly outperformed Gemini 3 on these specific visual tasks. While the industry chases the "biggest" model, it’s a good reminder that specialized agents like Orion are often punching way above their weight class in practical applications.
Has anyone else noticed a gap between Gemini 3's benchmarks and its VLM capabilities?
Don't self-promote without disclosure.
It's joeover for openai and antrophic. I have been using it for 3 hours now for real work and gpt-5.1 and sonnet 4.5 (thinking) does not come close.
the token efficiency and context is also mindblowing...
it feels like I am talking to someone who can think instead of a **rider that just agrees with everything you say and then fails doing basic changes, gpt-5.1 feels particulary slow and weak in real world applications that are larger than a few dozen files.
gemini 2.5 felt really weak considering the amount of data and their proprietary TPU hardware in theory allowing them way more flexibility, but gemini 3 just works and it truly understands which is something I didn't think I'd be saying for a couple more years.
Crafted by Rajat
Source Code