Google’s Genie game maker is what happens when AI watches 30K hrs of video games

 At this point, anyone who follows generative AI is used to tools that can generate passive, consumable content in the form of text, images, video, and audio. Google DeepMind’s recently unveiled Genie model (for “GENerative Interactive Environment”) does something altogether different, converting images into “interactive, playable environments that can be easily created, stepped into, and explored.”

DeepMind’s Genie announcement page shows plenty of sample GIFs of simple platform-style games generated from static starting images (children’s sketches, real-world photographs, etc.) or even text prompts passed through ImageGen2. While those slick-looking GIFs gloss over some major current limitations that are discussed in the full research paper, AI researchers are still excited about how Genie’s generalizable “foundational world modeling” could help supercharge machine learning going forward.

Under the hood
While Genie’s output looks similar at a glance to what might come from a basic 2D game engine, the model doesn’t actually draw sprites and code a playable platformer in the same way a human game developer might. Instead, the system treats its starting image (or images) as frames of a video and generates a best guess at what the entire next frame (or frames) should look like when given a specific input.

To establish that model, Genie started with 200,000 hours of public Internet gaming videos, which were filtered down to 30,000 hours of standardized video from “hundreds of 2D games.” The individual frames from those videos were then tokenized into a 200-million-parameter model that a machine-learning algorithm could easily work with.

An image like this, generated via text prompt to an image generator, can serve as the starting point for Genie’s world-building.

Enlarge / An image like this, generated via text prompt to an image generator, can serve as the starting point for Genie’s world-building.
Google DeepMind
A sample of interactive movement enabled by Genie from the above starting image (Click “enlarge” if GIF doesn’t animate).
Enlarge / A sample of interactive movement enabled by Genie from the above starting image (Click “enlarge” if GIF doesn’t animate).
Google DeepMind
From here, the system generated a “latent action model” to predict what kind of interactive “actions” (i.e., button presses) could feasibly and consistently generate the kind of frame-by-frame changes seen across all of those tokens. The system limits the potential inputs to a “latent action space” of eight possible inputs (e.g., four d-pad directions plus diagonals) in an effort “to permit human playability” (which makes sense, as the videos it was trained on were all human-playable).
With the latent action model established, Genie then generates a “dynamics model” that can take any number of arbitrary frames and latent actions and generate an educated guess about what the next frame should look like given any potential input. This final model ends up with 10.7 billion parameters trained on 942 billion tokens, though Genie’s results suggest that even larger models would generate better results.

Previous work on generating similar interactive models using generative AI has relied on using “ground truth action labels” or text descriptions of training data to help guide their machine learning algorithms. Genie differentiates itself from that work in its ability to “train without action or text annotations,” inferring the latent actions behind a video using nothing but those hours of tokenized video frames.

“The ability to generalize to such significantly [out-of-distribution] inputs underscores the robustness of our approach and the value of training on large-scale data, which would not have been feasible with real actions as input,” the Genie team wrote in its research paper.

Significant limitations
Before you get too excited about being able to generate endless platformers from nothing but rough sketches, there are some important limitations to keep in mind. Most significantly, the system currently only runs at one frame per second, which is at least 20 to 30 times slower than what would be needed for something that could be considered playable in real time. Sample GIFs that show much smoother animation over a few frames are just splicing together a series of frames that would take significant chunks of a minute to generate in real time.

The Genie team also admits that the system “can hallucinate unrealistic futures,” much like other AI models. You can see this clearly in some of the sample GIFs the Genie team has shared, including one where two flying birds collapse into a single entity or another where a character seems to start floating rather than falling to the ground after a simple jump.

You might also notice that the samples shown publicly by the Genie team so far show only a handful of (sometimes very blurry) frames of action before looping back to the start. That’s likely because the system is currently limited to analyzing only up to “16 frames of memory,” which the team says “makes it challenging to get consistent environments over long horizons.”

A step toward a “world model”?
Despite those current challenges, we don’t want to undersell what Genie is currently able to do. With nothing but a single static image to start from, Genie seems able to distinguish a player character from a game background, generate a rough estimate of how that character should move and animate in response to player inputs, and even scroll the background appropriately as that character moves (with impressive parallax scrolling in some examples). That’s a significant achievement for a system that doesn’t have any human guidance or action-labeling assistance to interpret its video training data.

And Genie’s generalizable “learning from video frames” approach has potential applications outside of creating 2D platformers, too. As a proof of concept, the Genie team trained a smaller, 2.5-billion-parameter model that tried to map latent actions to a video of a robotic arm working in three dimensions. That system was similarly able to map arm movements to consistent action inputs and even predict how objects picked up by the robot arm might “deform” in response to actions.

Some researchers are particularly excited about how the Genie system learned how to realistically “deform” the potato chip bag in this video simulation.

Some researchers are particularly excited about how the Genie system learned how to realistically “deform” the potato chip bag in this video simulation.
Google DeepMind
That robotic result has researchers hopeful that this kind of technique could be used “to create a foundational world model for robotics, with low-level controllable simulation that could be used for a variety of applications,” as the Genie team puts it. And that idea could go beyond robotics, too: “Given its generality, the model could be trained from an even larger proportion of Internet videos to simulate diverse, realistic, and imagined environments,” the team wrote in its paper.
Despite Genie’s limitations, DeepMind researchers are already looking ahead to what this kind of robust world modeling could mean for AI as a whole. DeepMind’s Jack Parker-Holder said on social media that Genie represents nothing less than “a viable path to generating the rich diversity of environments we need for [artificial general intelligence].” The current state of Genie “is the worst video models are ever going to be,” he continued. “Super exciting to see the impact these models will have when used as world simulators with open-ended learning.”

DeepMind’s Richard Song added that Genie could lead to the “infinite generator” researchers need to generate “tons of diverse video game environments necessary for training general-purpose [reinforcement learning] agents.”

OpenAI collapses media reality with Sora, a photorealistic AI video generator
The project is starting to make waves outside of Google, too. Nvidia AI researcher Jim Fan noted that Genie improves on OpenAI’s Sora video model, in a way, because it’s “actually a proper action-driven world model with inferred actions.” Fan went on to declare that “2024 will also be the Year of Foundation World Models!”
Whether or not those kinds of predictions pan out, the sense of excitement the project is generating among those who have seen it up close is hard to ignore. “When I was shown this project, my reaction was ‘Oh, this is the coolest project I’ve seen in recent time, super exciting!” DeepMind’s Lucas Beyer wrote. “I’m not even paraphrasing; these were my words.”

About the Author

Allow me to introduce you to Mr. Kiran Kumar Shah, a narrative weaver par excellence. Currently immersed in the world of engineering studies, Kiran Kumar Shah possesses a boundless spirit of creativity and an unquenchable thirst for knowledge. A vir…

Post a Comment

Oops!
It seems there is something wrong with your internet connection. Please connect to the internet and start browsing again.