Thanks for your great interests on our Magma work, everyone!
We will gradually roll out the inference/training/evaluation/data preprocessing code on our codebase: https://github.com/microsoft/Magma, and this will be finished by next Tuesday. Stay tunned!
The rate of progress on multimodal agents is impressive. OpenVLA was released in June 2024 and was state of the art at that time... 8 months later, on tasks like "Pick Place Hotdog Sausage" the success rate is passing from 2/10 to 6/10
"Pick Place Hotdog Sausage" is such a bizarre name, though. Is it meant to be human readable? AI-readable? Just a label for the researchers? Same with "Put Mushroom Place Pot". As far as I can see both labels are only used in this Magma paper, nowhere else that Google can find.
"Pick & place" is a term for a kind of robot that can pick up scattered items from a conveyor belt and arrange them in a regular fashion.
The really fast multi-arm versions can be hypnotic to watch. You can see an example at 1:00 in this video: https://youtu.be/aPTd8XDZOEk
The limitation of industrial pick & place robots is that they're configured for a single task, and reconfiguring them for a new product is notoriously expensive.
Magma's "pick & place" demo is much slower and shakier than a specialized industrial robot. But Magma can apparently be adapted to a new task by providing plain English instructions.
Looking at industrial robots they don't mimic how humans do things, and hence, they are efficient. That's why I don't understand how these propsals to teach robots how humans do things will make any sense.
To have robots at homes, they will need their tools to be efficient. It will not be the same washing machine, oven, or dishwasher that we use now, there will be new ones made for robots.
what do you think how long it takes to align all washing machine manufactures to make a 'Robot Washingmashine'? Also if its optimized for robots, you need a robot to use it.
The progress in ML/AI is so fast, we can easily skip this tedious assumption that we have to form our env for robots instead of teaching robots how to interact in our env.
and no a robot doesn't need to be efficient at all. A robot can do things 24/7. If i'm out of the house and a robot needs 4 hours to clean my kitchen and this doesn't cost me a lot besides of a little bit of energy, i could care very little how long it takes as long as it doesn't take too long.
Why is it important to be efficient at home for a robot? If you have an industrial robot churning out parts 24/7 in a factory shaving of 3 sec on a 30 sec process is very valuable. But does it really matter if the robot folds your laundry in 3 min or 5 min? Probably not unless you are running a dry cleaner business.
For me doing the dishes is a 10-15 minute chore in the sink with some hot water but most people don't seem to object that their dishwasher takes 3 hours to do the same job with superheated steam or whatever. It still saves them the 15 minutes.
Don't forget that the dishwasher also uses less water and the dishes get sterilized by the steam.These features may or may not be important for a particular user.
You are considering issues of like 0.25x - 4x efficiency gain or loss while ignoring that the real issue is possible vs impossible. If I have a household tobit that can vacuum the floor taking 4x as time as me I don't care. It's still extremely useful. This is why we human-imitatinf approached are used everywhere. It won't be the most efficient solution but the data contains information that helps make it possible at all.
No, I'm not considering speed. I'm considering efficiency. A simple example: I tried to do plastering for a room in my house a few months ago, and as a first-timer, I did a terrible job. There are many things I learned later by hiring a professional plasterer:
- Mix type.
- Mix consistency.
- Mix timing and water amount (which he adjusted as he worked).
- Uneven walls.
These things are based on experience, room type, wall type, wall alignment ... etc. There is no way that a robot will do the same job as a man; it has to be done differently. Using the same space as humans will not make generic robots useful. Your vacuum example is perfect, I have one and had to manually add/remove the water container as the robot will not be able to do that. Even if it does, it has to return to the dock, unload, and start again. A human would remove it at any point and place it somewhere else.
I don't understand why you think a human-inspired general purpose robot won't be able to learn to deal with the issues in your bullet point list.
We're already seeing in AI that given human examples they can "learn" to act the same way, and a robot wouldn't have to go from never having done any plastering to being an expert plasterer in a single shot. They can be trained once using the expertise of professionals (both explanations, and videos of them doing work) such that they don't make the same mistakes you make on a first attempt - and once trained once, that knowledge can be rolled out to all robots running that software rather than needing to teach each one individually. Hell, even without expert advise in the training, they could even learn how to do it by trying it in a demo room (or a virtual environment) thousands of times until they figure out what does and doesn't lead to the desired end result...
Ok but these are going to be humanoid robots, so it makes sense for them to use the appliances that have been designed for use by humans. I don’t really care if tasks like doing the dishes or doing the laundry take 1 hour or 10 hours, as long as they get done.
Because many spaces are built for humans. We already have specialized robots that vacuum our floors, but cleaning the surfaces, removing cables and tidying up is still constrainted to human-shaped entities
But Humans generalize very well across tasks. You can have an employee driving a forklift, then stop pick-up a pallet that blocks his way and continue.
And robots will not do that either, what if the employee used hearing to determine if there is a hazard (another moving vehicle around) before jumping to pick a pallet? How would the robot know by just “looking”? How to prioritise visuals, audio, sense … etc?
When it comes to integration, efficiency isn’t the most important quality attribute (“ility”). Having the interfaces be easily used by humans manually is more important. It’s why http1 has been dominant for so long… it’s easy for humans to understand and manually use without needing complex tooling. Yeah, eventually these things get replaced once machine-to-machine efficiency becomes a real pain point, but it’s far down the road. Not the first thing you try to tackle.
Yes but it wouldn't be ideal to have 100s of complex, specialized robots running around at home, if they behave like humans, they can complete tasks _optimally_.
You're not looking at hyper efficiency for most of the tasks anyway, see the robot vacuum for example, they are not quick and are slower than humans but absolutely useful.
Everyone is piling on Transformers and Diffusion (and in robotic, humanoids) today; but for most of the history of AI, we've been making things so simple they can only mono-task, and the only way to make commercial sense of that is to be much more efficient (on one of the many axies) than humans.
Now we have models that seem (at least at first glance) to cover the full breadth of what humans can do, so the question has become: can we make them perform at a decent skill level, rather than like someone who is book-smart enough to pass the tests but has almost no real experience of anything.
The multimodal capabilities especially on next action prediction are quite impressive; watching the github to see if & when they'll open source this: https://github.com/microsoft/Magma
Good catch! A minor correction: Magma - M(ultimodal) Ag(entic) M(odel) at M(icrosoft) (Rese)A(rch), the last part is similar to how the name Llama came out, :)
A bit sad that they reused name of https://icl.utk.edu/magma/ (Matrix Algebra on GPU and Multi-core Architectures). This library is already heavily used in machine learning, for example, it is included in every pytorch-based project.
Really interesting model, I'm looking forward to play with it.
But what I want is a multimodal agent model capable of generating embeddings for a humanoid control model like Meta motivo[0] rather than directly outputting coordinates.
Meta motivo is still a toy model, trained on the SMPL skeleton, which lacks fingers which limits its capabilities beside having some fun with it. They could have used a more advanced based model, SMPL-X, which includes fingers, but there isn’t enough open motion data with precise finger motion to train a robust manipulation model anyway.
Most existing motion datasets come from academic motion capture setups, which are complex and not focused on manipulation tasks (and also pretty old). I believe this gap will be filled by improvements in 3D HPE from 2D video. With access to thousands of hours of video, we can build large-scale motion datasets covering a wide range of real-world interactions.
This will enable training the two components needed for dexterous humanoid robots: the agentic model that decides what actions to take and generates embeddings that can be read by a control model that accurately models hand and finger joint movement.
Given the rapid progress in the capabilities of SoTA 3D HPE from 2D video, and the vast amount of videos online (Youtube), I expect we will see humanoid robots with good manipulation capabilities it the not so distant future.
Some more thoughts about training a manipulation model: I would add that synthetic data might be key to making it happen.
One issue is that most video is not shot in first person, so it might make for a poor dataset for the agentic part assuming the robot has human like vision.
Still if you have a large data set of motion capture data with reasonably accurate finger mouvement, you could use a video diffusion model with a control net to get a realistic looking video of a specific motion in first person. Another way would be to use a model like dust3r to generate a geometric 3d scene from the initial video allowing to change the camera angle to match a first person view.
This could be used as the dataset for the agentic model.
Now, maybe human like vision is not even necessary, unlike human, there is nothing preventing your robot to see through external camera placed around the house. Hell, there's even a good chance, your robot's brain will live in a datacenter hundreds of mile away.
Trying to wrap my head around this - are you saying that those models are trained around the concept of fingers (some kind of physical manipulators with set dimensions)?
In the mug-scrubbing video, the person clearly pretends to wash the cup but does not seem to want to get their hands wet anyway. I'm curious as to when models can figure out that subtle thing.
It's all probabilistic, my guess. I.e. model produces probabilities for a set of actions from the same video. Even pretended action may look more like it than anything else. Thus getting higher probability.
You want that to still work so that the human can demonstrate an action without putting themselves in the path of a danger to squishy human bits that the robot is safe from.
Why do no multimodels fluidly create images. It seems like they pass off to another model to generate images? They’re not really aware what’s in the images they make and the can edit images in place.
very good question, now we are mainly focusing on building the foundtion for multimodal perception and atomic action taking. Of course, integrating the trace-of-mark prediction for robotics and human video data enhances the model's medium length reasoning but this is not sufficient for sure. The current Magma model will serve as the basis for our next step, i.e., longer horizong reasoning and planning! We are exactly looking at this part for our next version of Magma!
But not multimodal reasoning, the intermediate and output tokens are text only, at least in the released version, they probably have actual multimodal reasoning that's not been shown yet, as they already showed gpt-4o can output image tokens,but that's not been released yet either.
That wasn’t the question… they asked if any multimodal models had been reasoning trained. o1 fits that criteria precisely, and it can reason about the image input.
They didn’t ask about a model that can create images while thinking. That’s an entirely unrelated topic.
They need to build an epistemology and theory of mind engine into models. We take it for granted when dealing with other humans that they can infer deep meaning, motivations, expectations of truth vs fiction. But these agents don’t do that and so will be awful collaborators until those behaviors are present
Theory of mind should naturally emerge when the models are partly trained in an adversarial simulation environment, like the Cicero model, although that's a narrow AI example.
And it causes a ton of chaos that we do take that for granted between humans. The annoying collaborator is the person who takes information for granted.
Did you read any research on theory on mind and models? Since gpt4 they were tested using similar metrics to humans and it seems the bigger models “have” it
Spent 10 mins on the website, all the examples are single agent examples. There is 0 value add for yet another wrapper on an openai call, parading as an agent.
The whole point of agents is knowing what to do among potentially 100's of intents and actions.
These benchmarks are not really representative of what agents are capable of. The slow process of checking the weather through UI elements is not a good use case which is non-peer reviewed paper showcases.
Thanks for your great interests on our Magma work, everyone!
We will gradually roll out the inference/training/evaluation/data preprocessing code on our codebase: https://github.com/microsoft/Magma, and this will be finished by next Tuesday. Stay tunned!
How far are we from making peanut butter sandwiches? Is that a valid benchmark to look towards, in this space?
The rate of progress on multimodal agents is impressive. OpenVLA was released in June 2024 and was state of the art at that time... 8 months later, on tasks like "Pick Place Hotdog Sausage" the success rate is passing from 2/10 to 6/10
"Pick Place Hotdog Sausage" is such a bizarre name, though. Is it meant to be human readable? AI-readable? Just a label for the researchers? Same with "Put Mushroom Place Pot". As far as I can see both labels are only used in this Magma paper, nowhere else that Google can find.
"Pick & place" is a term for a kind of robot that can pick up scattered items from a conveyor belt and arrange them in a regular fashion.
The really fast multi-arm versions can be hypnotic to watch. You can see an example at 1:00 in this video: https://youtu.be/aPTd8XDZOEk
The limitation of industrial pick & place robots is that they're configured for a single task, and reconfiguring them for a new product is notoriously expensive.
Magma's "pick & place" demo is much slower and shakier than a specialized industrial robot. But Magma can apparently be adapted to a new task by providing plain English instructions.
Looking at industrial robots they don't mimic how humans do things, and hence, they are efficient. That's why I don't understand how these propsals to teach robots how humans do things will make any sense.
To have robots at homes, they will need their tools to be efficient. It will not be the same washing machine, oven, or dishwasher that we use now, there will be new ones made for robots.
Because of adoption and flexibility.
what do you think how long it takes to align all washing machine manufactures to make a 'Robot Washingmashine'? Also if its optimized for robots, you need a robot to use it.
The progress in ML/AI is so fast, we can easily skip this tedious assumption that we have to form our env for robots instead of teaching robots how to interact in our env.
and no a robot doesn't need to be efficient at all. A robot can do things 24/7. If i'm out of the house and a robot needs 4 hours to clean my kitchen and this doesn't cost me a lot besides of a little bit of energy, i could care very little how long it takes as long as it doesn't take too long.
I will happily replace my dishes with steel if a robot can load and unload the dishwasher
Why is it important to be efficient at home for a robot? If you have an industrial robot churning out parts 24/7 in a factory shaving of 3 sec on a 30 sec process is very valuable. But does it really matter if the robot folds your laundry in 3 min or 5 min? Probably not unless you are running a dry cleaner business.
Well, difference between 3 and 4 hours meanwhile might be important if robot is noisy.
For me doing the dishes is a 10-15 minute chore in the sink with some hot water but most people don't seem to object that their dishwasher takes 3 hours to do the same job with superheated steam or whatever. It still saves them the 15 minutes.
Don't forget that the dishwasher also uses less water and the dishes get sterilized by the steam.These features may or may not be important for a particular user.
You are considering issues of like 0.25x - 4x efficiency gain or loss while ignoring that the real issue is possible vs impossible. If I have a household tobit that can vacuum the floor taking 4x as time as me I don't care. It's still extremely useful. This is why we human-imitatinf approached are used everywhere. It won't be the most efficient solution but the data contains information that helps make it possible at all.
No, I'm not considering speed. I'm considering efficiency. A simple example: I tried to do plastering for a room in my house a few months ago, and as a first-timer, I did a terrible job. There are many things I learned later by hiring a professional plasterer:
- Mix type.
- Mix consistency.
- Mix timing and water amount (which he adjusted as he worked).
- Uneven walls.
These things are based on experience, room type, wall type, wall alignment ... etc. There is no way that a robot will do the same job as a man; it has to be done differently. Using the same space as humans will not make generic robots useful. Your vacuum example is perfect, I have one and had to manually add/remove the water container as the robot will not be able to do that. Even if it does, it has to return to the dock, unload, and start again. A human would remove it at any point and place it somewhere else.
I don't understand why you think a human-inspired general purpose robot won't be able to learn to deal with the issues in your bullet point list.
We're already seeing in AI that given human examples they can "learn" to act the same way, and a robot wouldn't have to go from never having done any plastering to being an expert plasterer in a single shot. They can be trained once using the expertise of professionals (both explanations, and videos of them doing work) such that they don't make the same mistakes you make on a first attempt - and once trained once, that knowledge can be rolled out to all robots running that software rather than needing to teach each one individually. Hell, even without expert advise in the training, they could even learn how to do it by trying it in a demo room (or a virtual environment) thousands of times until they figure out what does and doesn't lead to the desired end result...
These are orthogonal to form.
Ok but these are going to be humanoid robots, so it makes sense for them to use the appliances that have been designed for use by humans. I don’t really care if tasks like doing the dishes or doing the laundry take 1 hour or 10 hours, as long as they get done.
Because many spaces are built for humans. We already have specialized robots that vacuum our floors, but cleaning the surfaces, removing cables and tidying up is still constrainted to human-shaped entities
Yes it actually solved the fun parts of vaccuming and left tedious parts to humans.
But Humans generalize very well across tasks. You can have an employee driving a forklift, then stop pick-up a pallet that blocks his way and continue.
And robots will not do that either, what if the employee used hearing to determine if there is a hazard (another moving vehicle around) before jumping to pick a pallet? How would the robot know by just “looking”? How to prioritise visuals, audio, sense … etc?
There's no reason to expect models won't be able to handle this even better than humans.
When it comes to integration, efficiency isn’t the most important quality attribute (“ility”). Having the interfaces be easily used by humans manually is more important. It’s why http1 has been dominant for so long… it’s easy for humans to understand and manually use without needing complex tooling. Yeah, eventually these things get replaced once machine-to-machine efficiency becomes a real pain point, but it’s far down the road. Not the first thing you try to tackle.
Yes but it wouldn't be ideal to have 100s of complex, specialized robots running around at home, if they behave like humans, they can complete tasks _optimally_.
You're not looking at hyper efficiency for most of the tasks anyway, see the robot vacuum for example, they are not quick and are slower than humans but absolutely useful.
Yet we shouldn't not forget, we create things to make our lives easier, we create tools that assist us - not the other way around!
I think a robot that can work with existing machines will be much better than one that requires new, complimentary ones - at least initially.
This. The paucity of imagination in the AI space is mind-numbing.
Fashions in AI.
Everyone is piling on Transformers and Diffusion (and in robotic, humanoids) today; but for most of the history of AI, we've been making things so simple they can only mono-task, and the only way to make commercial sense of that is to be much more efficient (on one of the many axies) than humans.
Now we have models that seem (at least at first glance) to cover the full breadth of what humans can do, so the question has become: can we make them perform at a decent skill level, rather than like someone who is book-smart enough to pass the tests but has almost no real experience of anything.
Airplanes don't fly by flapping their wings.
But they could... Its just not made yet.
Could does not mean should, though.
how would an AI robot oven be different. I as a human want to use AI robot oven.
The multimodal capabilities especially on next action prediction are quite impressive; watching the github to see if & when they'll open source this: https://github.com/microsoft/Magma
Also, I wonder why they named it Magma?
`M(ultimodal) Ag(ent) [ma]` maybe
Good catch! A minor correction: Magma - M(ultimodal) Ag(entic) M(odel) at M(icrosoft) (Rese)A(rch), the last part is similar to how the name Llama came out, :)
How many 'M's in "Magma"? ;)
ops, a typo, no M from Microsoft.
It's ok GPT
there are two r.
A bit sad that they reused name of https://icl.utk.edu/magma/ (Matrix Algebra on GPU and Multi-core Architectures). This library is already heavily used in machine learning, for example, it is included in every pytorch-based project.
I know that AWS have an AI product for foundational models called Bedrock so MS might've decided to go even deeper.
Wow
From the news section of that github README:
> [2025.02.19] We will be releasing our code, model and UI navigation demo at MSR Forum on 02.25 next Tuesday!
looking at the paper some other agentic models they compared to were named LLaVA... maybe it's just a play on words
Really interesting model, I'm looking forward to play with it.
But what I want is a multimodal agent model capable of generating embeddings for a humanoid control model like Meta motivo[0] rather than directly outputting coordinates.
Meta motivo is still a toy model, trained on the SMPL skeleton, which lacks fingers which limits its capabilities beside having some fun with it. They could have used a more advanced based model, SMPL-X, which includes fingers, but there isn’t enough open motion data with precise finger motion to train a robust manipulation model anyway.
Most existing motion datasets come from academic motion capture setups, which are complex and not focused on manipulation tasks (and also pretty old). I believe this gap will be filled by improvements in 3D HPE from 2D video. With access to thousands of hours of video, we can build large-scale motion datasets covering a wide range of real-world interactions.
This will enable training the two components needed for dexterous humanoid robots: the agentic model that decides what actions to take and generates embeddings that can be read by a control model that accurately models hand and finger joint movement.
Given the rapid progress in the capabilities of SoTA 3D HPE from 2D video, and the vast amount of videos online (Youtube), I expect we will see humanoid robots with good manipulation capabilities it the not so distant future.
[0]: https://github.com/facebookresearch/metamotivo
Some more thoughts about training a manipulation model: I would add that synthetic data might be key to making it happen.
One issue is that most video is not shot in first person, so it might make for a poor dataset for the agentic part assuming the robot has human like vision.
Still if you have a large data set of motion capture data with reasonably accurate finger mouvement, you could use a video diffusion model with a control net to get a realistic looking video of a specific motion in first person. Another way would be to use a model like dust3r to generate a geometric 3d scene from the initial video allowing to change the camera angle to match a first person view.
This could be used as the dataset for the agentic model.
Now, maybe human like vision is not even necessary, unlike human, there is nothing preventing your robot to see through external camera placed around the house. Hell, there's even a good chance, your robot's brain will live in a datacenter hundreds of mile away.
Trying to wrap my head around this - are you saying that those models are trained around the concept of fingers (some kind of physical manipulators with set dimensions)?
The SMPL-x body model, a standard in this academic field does model fingers https://smpl-x.is.tue.mpg.de/
The issue is that there are much less dataset available for it than for the simplier SMPL model.
Regarding fingers, you already have "dumb" models like https://github.com/google-deepmind/mujoco_mpc which can control finger mouvement to achieve specific task.
Look at this video to see it action: https://www.youtube.com/watch?v=2xVN-qY78P4&t=387s
Pretty cool stuff.
In the mug-scrubbing video, the person clearly pretends to wash the cup but does not seem to want to get their hands wet anyway. I'm curious as to when models can figure out that subtle thing.
It's all probabilistic, my guess. I.e. model produces probabilities for a set of actions from the same video. Even pretended action may look more like it than anything else. Thus getting higher probability.
You want that to still work so that the human can demonstrate an action without putting themselves in the path of a danger to squishy human bits that the robot is safe from.
Why do no multimodels fluidly create images. It seems like they pass off to another model to generate images? They’re not really aware what’s in the images they make and the can edit images in place.
What do you mean by fluidly?
Multimodal agents notoriously fail at long horizon tasks, how does Magma perform on it?
very good question, now we are mainly focusing on building the foundtion for multimodal perception and atomic action taking. Of course, integrating the trace-of-mark prediction for robotics and human video data enhances the model's medium length reasoning but this is not sufficient for sure. The current Magma model will serve as the basis for our next step, i.e., longer horizong reasoning and planning! We are exactly looking at this part for our next version of Magma!
Have any multimodal models been reasoning-trained yet?
https://platform.openai.com/docs/models/#o1
> The latest o1 model supports both text and image inputs
But not multimodal reasoning, the intermediate and output tokens are text only, at least in the released version, they probably have actual multimodal reasoning that's not been shown yet, as they already showed gpt-4o can output image tokens,but that's not been released yet either.
That wasn’t the question… they asked if any multimodal models had been reasoning trained. o1 fits that criteria precisely, and it can reason about the image input.
They didn’t ask about a model that can create images while thinking. That’s an entirely unrelated topic.
Just wondering if there is any research done in incremental training? That could be used in robots as alternative to RAG.
They need to build an epistemology and theory of mind engine into models. We take it for granted when dealing with other humans that they can infer deep meaning, motivations, expectations of truth vs fiction. But these agents don’t do that and so will be awful collaborators until those behaviors are present
We're in the 56k modem era of generative AI, so I wouldn't be surprised if we had that in the next few years, or weeks.
Theory of mind should naturally emerge when the models are partly trained in an adversarial simulation environment, like the Cicero model, although that's a narrow AI example.
And it causes a ton of chaos that we do take that for granted between humans. The annoying collaborator is the person who takes information for granted.
Did you read any research on theory on mind and models? Since gpt4 they were tested using similar metrics to humans and it seems the bigger models “have” it
[dead]
Am I the only one that read that title in Dr.Evil's voice?
All kidding aside. This looks promising
Spent 10 mins on the website, all the examples are single agent examples. There is 0 value add for yet another wrapper on an openai call, parading as an agent.
The whole point of agents is knowing what to do among potentially 100's of intents and actions.
Disappointing.
These benchmarks are not really representative of what agents are capable of. The slow process of checking the weather through UI elements is not a good use case which is non-peer reviewed paper showcases.
[dead]
[flagged]