Hacker News Clone

Magma: A foundation model for multimodal AI agents

by SerCe on 2/20/2025, 2:11:25 AM with 68 comments

by jwyang on 2/20/2025, 6:29:52 AM
Thanks for your great interests on our Magma work, everyone!
We will gradually roll out the inference/training/evaluation/data preprocessing code on our codebase: https://github.com/microsoft/Magma, and this will be finished by next Tuesday. Stay tunned!
by ygouzerh on 2/20/2025, 7:19:03 AM
The rate of progress on multimodal agents is impressive. OpenVLA was released in June 2024 and was state of the art at that time... 8 months later, on tasks like "Pick Place Hotdog Sausage" the success rate is passing from 2/10 to 6/10
by erikig on 2/20/2025, 5:19:54 AM
The multimodal capabilities especially on next action prediction are quite impressive; watching the github to see if & when they'll open source this: https://github.com/microsoft/Magma
Also, I wonder why they named it Magma?
by Oras on 2/20/2025, 8:53:26 AM
Looking at industrial robots they don't mimic how humans do things, and hence, they are efficient. That's why I don't understand how these propsals to teach robots how humans do things will make any sense.
To have robots at homes, they will need their tools to be efficient. It will not be the same washing machine, oven, or dishwasher that we use now, there will be new ones made for robots.
by sorz on 2/20/2025, 8:48:15 AM
In the mug-scrubbing video, the person clearly pretends to wash the cup but does not seem to want to get their hands wet anyway. I'm curious as to when models can figure out that subtle thing.
by lelag on 2/20/2025, 12:29:40 PM
Really interesting model, I'm looking forward to play with it.
But what I want is a multimodal agent model capable of generating embeddings for a humanoid control model like Meta motivo[0] rather than directly outputting coordinates.
Meta motivo is still a toy model, trained on the SMPL skeleton, which lacks fingers which limits its capabilities beside having some fun with it. They could have used a more advanced based model, SMPL-X, which includes fingers, but there isn’t enough open motion data with precise finger motion to train a robust manipulation model anyway.
Most existing motion datasets come from academic motion capture setups, which are complex and not focused on manipulation tasks (and also pretty old). I believe this gap will be filled by improvements in 3D HPE from 2D video. With access to thousands of hours of video, we can build large-scale motion datasets covering a wide range of real-world interactions.
This will enable training the two components needed for dexterous humanoid robots: the agentic model that decides what actions to take and generates embeddings that can be read by a control model that accurately models hand and finger joint movement.
Given the rapid progress in the capabilities of SoTA 3D HPE from 2D video, and the vast amount of videos online (Youtube), I expect we will see humanoid robots with good manipulation capabilities it the not so distant future.
[0]: https://github.com/facebookresearch/metamotivo
by bilsbie on 2/20/2025, 1:12:30 PM
Why do no multimodels fluidly create images. It seems like they pass off to another model to generate images? They’re not really aware what’s in the images they make and the can edit images in place.
by yurimo on 2/20/2025, 7:00:56 AM
Multimodal agents notoriously fail at long horizon tasks, how does Magma perform on it?
by kittikitti on 2/20/2025, 1:15:25 PM
These benchmarks are not really representative of what agents are capable of. The slow process of checking the weather through UI elements is not a good use case which is non-peer reviewed paper showcases.
by Mizza on 2/20/2025, 9:34:25 AM
Have any multimodal models been reasoning-trained yet?
by funnyAI on 2/20/2025, 8:48:36 AM
Just wondering if there is any research done in incremental training? That could be used in robots as alternative to RAG.
by bob_theslob646 on 2/20/2025, 10:56:56 PM
Am I the only one that read that title in Dr.Evil's voice?
All kidding aside. This looks promising
by digitaltrees on 2/20/2025, 6:16:22 AM
They need to build an epistemology and theory of mind engine into models. We take it for granted when dealing with other humans that they can infer deep meaning, motivations, expectations of truth vs fiction. But these agents don’t do that and so will be awful collaborators until those behaviors are present
by bosky101 on 2/20/2025, 12:05:45 PM
Spent 10 mins on the website, all the examples are single agent examples. There is 0 value add for yet another wrapper on an openai call, parading as an agent.
The whole point of agents is knowing what to do among potentially 100's of intents and actions.
Disappointing.