Hacker News Clone

Gemini 2.5 Computer Use model

by mfiguiere on 10/7/2025, 7:49:11 PM with 309 comments

by xnx on 10/7/2025, 8:28:41 PM
I've had good success with the Chrome devtools MCP (https://github.com/ChromeDevTools/chrome-devtools-mcp) for browser automation with Gemini CLI, so I'm guessing this model will work even better.
by phamilton on 10/7/2025, 8:29:36 PM
It successfully got through the captcha at https://www.google.com/recaptcha/api2/demo
by mohsen1 on 10/7/2025, 11:07:22 PM
> Solve today's Wordle
Stucks with:
> ...the task is just to "solve today's Wordle", and as a web browsing robot, I cannot actually see the colors of the letters after a guess to make subsequent guesses. I can enter a word, but I cannot interpret the feedback (green, yellow, gray letters) to solve the puzzle.
by krawcu on 10/8/2025, 5:52:40 AM
I wonder how it would behave in a scenario where it has to download some file from a shady website that has all those advertisement with fake "download"
by jcims on 10/7/2025, 9:31:33 PM
(Just using the browserbase demo)
Knowing it's technically possible is one thing, but giving it a short command and seeing it go log in to a site, scroll around, reply to posts, etc. is eerie.
Also it tied me at wordle today, making the same mistake I did on the second to lass guess. Too bad you can't talk to it while it's working.
by albert_e on 10/8/2025, 2:50:03 AM
I believe it will need very capable but small VLMs that understand common User Interfaces very well -- small enough to run locally -- paired with any other higher level models on the cloud, to achieve human-speed interactions and beyond with reliability.
by derekcheng08 on 10/8/2025, 2:40:09 AM
Really feels like computer use models may be vertical agent killers once they get good enough. Many knowledge work domains boil down to: use a web app, send an email. (e.g. recruiting, sales outreach)
by dekhn on 10/7/2025, 9:49:12 PM
Many years ago I was sitting at a red light on a secondary road, where the primary cross road was idle. It seemed like you could solve this using a computer vision camera system that watched the primary road and when it was idle, would expedite the secondary road's green light.
This was long before computer vision was mature enough to do anything like that and I found out that instead, there are magnetic systems that can detect cars passing over - trivial hardware and software - and I concluded that my approach was just far too complicated and expensive.
Similarly, when I look at computers, I typically want the ML/AI system to operate on a structured data that is codified for computer use. But I guess the world is complicated enough and computers got fast enough that having an AI look at a computer screen and move/click a mouse makes sense.
by dekhn on 10/7/2025, 10:08:27 PM
I just have to say that I consider this an absolutely hilarious outcome. For many years, I focused on tech solutions that eliminated the need for a human to be in front of a computer doing tedious manual operations. For a wide range of activities, I proposed we focus on "turning everything in the world into database objects" so that computers could operate on them with minimal human effort. I spent significant effort in machine learning to achieve this.
It didn't really occur to me that you could just train a computer to work directly on the semi-structured human world data (display screen buffer) through a human interface (mouse + keyboard).
However, I fully support it (like all the other crazy ideas on the web that beat out the "theoretically better" approaches). I do not think it is unrealistic to expect that within a decade, we could have computer systems that can open chrome, start a video chat with somebody, go back and forth for a while to achieve a task, then hang up... with the person on the other end ever knowing they were dealing with a computer instead of a human.
by skc on 10/8/2025, 7:12:18 AM
How likely is it that the end game becomes that we stop writing apps for actual human users and instead sites become massive walls of minified text against a black screen.
by realty_geek on 10/7/2025, 10:09:29 PM
Absolutely hilarious how it gets stuck trying to solve captcha each time. I had to explicitly tell it not to go to google first.
In the end I did manage to get it to play the housepriceguess game:
https://www.youtube.com/watch?v=nqYLhGyBOnM
I think I'll make that my equivalent of Simon Willison's "pelican riding a bicycle" test. It is fairly simple to explain but seems to trip up different LLMs in different ways.
by ramoz on 10/7/2025, 8:45:38 PM
This will never hit a production enterprise system without some form of hooks/callbacks in place to instill governance.
Obviously much harder with UI vs agent events similar to the below.
https://docs.claude.com/en/docs/claude-code/hooks
https://google.github.io/adk-docs/callbacks/
by CuriouslyC on 10/7/2025, 8:49:23 PM
I feel like screenshots should be the last thing you reach for. There's a whole universe of data from accessibility subsystems.
by whinvik on 10/7/2025, 8:49:27 PM
My general experience has been that Gemini is pretty bad at tool calling. The recent Gemini 2.5 Flash release actually fixed some of those issues but this one is Gemini 2.5 Pro with no indication about tool calling improvements.
by AaronAPU on 10/7/2025, 9:49:19 PM
I’m looking forward to a desktop OS optimized version so it can do the QA that I have no time for!
by enjoylife on 10/7/2025, 10:38:53 PM
> It is not yet optimized for desktop OS-level control
Alas, AGI is not yet here. But I feel like if this OS-level of control was good enough, and the cost of the LLM in the loop wasn't bad, maybe that would be enough to kick start something akin to AGI.
by xrd on 10/8/2025, 4:02:12 AM
I really want this model to try userinyerface.com
by omkar_savant on 10/7/2025, 9:34:06 PM
Hey - I'm on the team that launched this. Please let me know if you have any questions!
by martinald on 10/7/2025, 9:40:54 PM
Interesting, seems to use 'pure' vision and x/y coords for clicking stuff. Most other browser automation with LLMs I've seen uses the dom/accessibility tree which absolutely churns through context, but is much more 'accurate' at clicking stuff because it can use the exact text/elements in a selector.
Unfortunately it really struggled in the demos for me. It took nearly 18 attempts to click the comment link on the HN demo, each a few pixels off.
by jsrozner on 10/8/2025, 1:19:29 AM
The irony is that most of tech companies make their money by forcing users to wade through garbage. For example, if you could browse the internet and avoid ads, why wouldn't you? If you could choose what twitter content to see outside of their useless algorithms, why wouldn't you?
by SilverSlash on 10/8/2025, 8:41:23 AM
Is this different from ChatGPT agent mode that I can use from the web app? I found that extremely useful for my task which required running some python and javascript code with open source libraries to generate an animated video effect.
I greatly appreciated ChatGPT writing the code and then running it on OpenAI's VMs instead of me pasting that code on my machine.
I wish Google released something like that in AI Studio.
by password54321 on 10/7/2025, 8:42:06 PM
doesn't seem like it makes sense to train AI around human user interfaces which aren't really efficient. It is like building a mechanical horse.
by ChaoPrayaWave on 10/8/2025, 8:51:45 AM
I've always been interested in running LLM locally to automate browser tasks, but every time I've tried, I've found the browser API to be too complex. In contrast, writing scripts directly with Playwright or Puppeteer tends to be much more stable.
by zomgbbq on 10/7/2025, 11:58:45 PM
I would love to use this for E2E testing. It would be great to make all my assertions with high level descriptions so tests are resilient to UI changes.
Seems similar to the Amazon Nova Act API which is still in research preview.
by mosura on 10/7/2025, 9:07:58 PM
One of the slightly buried stories here is BrowserBase themselves. Great stuff.
by numpad0 on 10/7/2025, 9:15:40 PM
How big are Gemini 2.5(Pro/Flash/Lite) models in parameter counts, in experts' guesstimation? Is it towards 50B, 500B, or bigger still? Even Flash feels smart enough for vibe coding tasks.
by Havoc on 10/8/2025, 12:29:02 AM
At some point just having APIs for the web would just make sense. Rendering it and then throwing llms at interpreting it seems…suboptimal
Impressive tech nonetheless
by jwpapi on 10/8/2025, 2:17:19 AM
Can somebody give me use cases that are faster than using an UX?
How am I supposed to use this. I really can’t think of one, but I don’t want to be blind-sighted as obviously a lot of money is going into this.
I also appreciate the tech behind it and functionality, but I still wonder for use cases
by strangescript on 10/7/2025, 8:24:53 PM
I assume its tool calling and structured output are way better, but this model isn't in Studio unless its being silently subbed in.
by sbinnee on 10/7/2025, 11:38:14 PM
I think it’s related that I got an email from google, titled “ Simplifying your Gemini Apps experience”. It reads no privacy maximize AI. They are going to automatically collect data from all google apps, and users no longer have options to control access to individual apps.
by btbuildem on 10/8/2025, 12:52:53 AM
Does it work with ~legacy~ software? Eg, early 2000's Windows WhateverSoft's Widget Designer? Does it interface over COM?
There's a goldmine to be had in automating ancient workflows that keep large corps alive.
by Oras on 10/7/2025, 9:05:57 PM
It is actually quite good at following instructions, but I tried clicking on job application links, and since they open in a new window, it couldn't find the new window. I suppose it might be an issue with BrowserBase, or just the way this demo was set up.
by cryptoz on 10/7/2025, 8:29:36 PM
Computer Use models are going to ruin simple honeypot form fields meant to detect bots :(
by amelius on 10/7/2025, 11:09:35 PM
Only use in environments where you can roll back everything.
by asadm on 10/7/2025, 9:32:47 PM
This is great. Now I want it to run faster than I can do it.
by orliesaurus on 10/7/2025, 10:54:39 PM
Does it know what's behind the "menu" of different apps? Or does it have to click on all menus and submenus to find out?
by CryptoBanker on 10/7/2025, 11:07:45 PM
Unless you give it specific instructions it will google something and give you the AI generated summary as the answer
by t43562 on 10/8/2025, 6:24:05 AM
How ironic that words become more powerful than images.....
by iAMkenough on 10/7/2025, 9:31:58 PM
Not great at Google Sheets. Repeatedly overwrites all previous columns while trying to populate new columns.
> I am back in the Google Sheet. I previously typed "Zip Code" in F1, but it looks like I selected cell A1 and typed "A". I need to correct that first. I'll re-type "Zip Code" in F1 and clear A1. It seems I clicked A1 (y=219, x=72) then F1 (y=219, x=469) and typed "Zip Code", but then maybe clicked A1 again.
by bonoboTP on 10/7/2025, 9:09:14 PM
There are some absolutely atrocious UIs out there for many office workers, who spend hours clicking buttons opening popup after popup clicking repetitively on checkboxes etc. E.g. entering travel costs or somesuch in academia and elsewhere. You have no idea how annoying that type of work is, you pull out your hair. Why don't they make better UIs, you ask? If you ask, you have no idea how bad things are. Because they don't care, there is no communication, it seems fine, the software creators are hard to reach, the software is approved by people who never used it and decide based on gut feel, powerpoints and feature tickmarks. Even big name brands are horrible at this, like SAP.
If such AI tools allow to automate this soulcrushing drudgery, it will be great. I know that you can technically script things Selenium, AutoHotkey whatnot. But you can imagine that it's a nonstarter in a regular office. This kind of tool could make things like that much more efficient. And it's not like it will then obviate the jobs entirely (at least not right away). These offices often have immense backlogs and are understaffed as is.
by keepamovin on 10/8/2025, 12:47:49 AM
It’s basically an OODA loop. This is a good thing
by tgsovlerkhgsel on 10/8/2025, 1:58:56 AM
I'm so looking forward to it. Many of the problems that should be trivially solved with either AI or a script are hard to impossible to solve because the data is locked away in some form.
Having an AI handle this may be inefficient, but as it uses the existing user interfaces, it might allow bypassing years of bureaucracy, and when the bureaucracy tries to fight it to justify its existence, it can fight it out with the EVERYONE MUST USE AI OR ELSE layers of management, while I can finally automate that idiotic task (using tens of kilowatts rather than a one-liner, but still better than having to do it by hand).
by nsonha on 10/8/2025, 11:40:18 AM
Is there a claude code for computer use models? I mean something that's actually useful and not just a claude.ai kinda thing.
by TIPSIO on 10/7/2025, 8:58:42 PM
Painfully slow
by informal007 on 10/8/2025, 2:12:06 PM
Future will be more challenging for fraud-detection fields, good luck for them.
by mmaunder on 10/7/2025, 10:44:54 PM
I prepare to be disappointed every time I click on a Google AI announcement. Which is so very unfortunate, given that they're the source of LLMs. Come on big G!! Get it together!
by alexnewman on 10/7/2025, 9:57:57 PM
A year ago I did something that used rag and accessibility mode to navigate ui.
by dude250711 on 10/7/2025, 8:39:42 PM
Have average Google developers been told/hinted that their bonuses/promotions will be tied to their proactivity in using Gemini for project work?
by mianos on 10/7/2025, 9:06:31 PM
I sure hope this is better than pathetically useless. I assume it is to replace the extremely frustrating Gemini for Android. If I have a bluetooth headset and I try "play music on Spotify" it fails about half the time. Even with youtube music. I could not believe it was so bad so I just sat at my desk with the helmet on and tried it over and over. It seems to recognise the speech but simply fails to do anything. Brand new Pixel 10. The old speech recognition system was way dumber but it actually worked.
by GeminiFan2025 on 10/7/2025, 10:35:25 PM
Impressive interface interaction by Gemini 2.5. Could be great for accessibility.
by hipassage on 10/7/2025, 10:09:15 PM
hi there, interesting post
by hipassage on 10/7/2025, 10:08:48 PM
hi there
by GeminiFan2025 on 10/7/2025, 10:35:03 PM
The new Gemini 2.5 model's ability to understand and interact with computer interfaces looks very impressive. It could be a game-changer for accessibility and automation. I wonder how robust it is with non-standard UI elements.
by hipassage on 10/7/2025, 10:08:29 PM
hi