Article·AI Engineering & Research·Feb 6, 2025

How Diffusion Models Are Reimagining Game Environments: DIAMOND

Recently, researchers started exploring diffusion models’ ability to generate virtual worlds in real-time. We cannot deny the speed of innovation in the area and the impressive showcases that keep flooding in.
Zian (Andy) Wang
By Zian (Andy) Wang
PublishedFeb 6, 2025
UpdatedFeb 7, 2025

Diffusion-based generative models have gained significant traction in recent years with the prevalence of video and image synthesis tools like Stable Diffusion, Flux, Kling, and many more. But more recently, researchers started exploring diffusion models’ ability to generate virtual worlds in real-time.

In February of 2024, Google Deepmind’s Genie, or generative interactive environments, marked an initial research that gained significant traction in producing models that can generate interactive, playable environments from a single image prompt. Prior to Genie, models like Sora and Runway ML were able to produce videos based on any text prompts, but those videos cannot be interacted with.

Impressively, Genie is able to learn how to respond to human actions exclusively from internet videos. These videos don’t show players pressing the up arrow or the space bar, they’re simply gameplays.

In their publication, they emphasized Genie’s potential implications for assisting with AI agents training. One of the biggest challenges in developing physical AI agents like autonomous robots is gathering enough training data and performing the training—since physical worlds cannot be accurately simulated and are, after all, finite. However, the authors of the paper stated that “latent actions learned by Genie can transfer to real human-designed environments”.

Fittingly, merely four months later, “Diffusion for World Modeling: Visual Details Matter in Atari” was published that detailed a paradigm in which reinforcement learning agents were trained in world environments “imagined” by diffusion models. Their proposed method, DIffusion As a Model Of eNvironment Dreams, or DIAMOND for short, is able to generate twenty five different Atari Games. Each environment comes equipped with a reinforcement learning agent, fully trained within the diffusion world model, that excels at playing these games.

Reinforcement learning agent playing in a world generated by DIAMOND.

The authors of DIAMOND also showcased the diffusion model being trained on 3D environments such as CounterStrike: Global Offensive (CS:GO).

Amazingly, these diffusion models and their respective reinforcement learning (RL) agents are incredibly efficient and small. They run smoothly on most modern computers without specialized hardware, unlike many current open source diffusion models. While DIAMOND does run at a lower resolution, it remains completely playable.

How Does DIAMOND Work?

DIAMOND is based largely on two core machine learning concepts: reinforcement learning (RL) and diffusion models. The paper does employ some technical jargons and equations but as we will see, they are straightforward and incredibly simple to understand.

The authors model their training environments—the games that both diffusion models and reinforcement learning agents will learn in—as a Partially Observable Markov Decision Process(POMDP).

POMDP essentially describes an environment, in this case, various Atari games, where the agents interacting with it need to make decisions but can’t see everything. This is just like how humans play video games: we need to make decisions based only on the information visible on our screens, not the entire game world and state of every NPC and player.

The mathematical framework of POMDP is represented as (𝒮, 𝒜, 𝒪, T, R, O, γ).

  • 𝒮: The set of all possible states. In Pac-Man, this would include everything about the game - Pac-Man’s position, ghost positions, which pellets are eaten, etc.

  • 𝒜: The set of actions an agent can take. In Pac-Man, these would be moving up, down, left, or right.

  • 𝒪: The set of image observations: what the agent can actually see. This is typically the game screen.

  • T: The transition function that describes how likely we are to end up in a new state given the current state and after taking an action (i.e p(sₜ₊₁|sₜ,aₜ))

  • R: The reward function that tells us what reward we get for our actions.

  • O: The observation function that determines what we can observe about the true state.

  • γ: A discount factor that makes future rewards worth less than immediate ones.

The two components in DIAMOND attempt to learn different aspects within the POMDP.

The reinforcement learning (RL) agent’s goal is to obtain a policy π that maps observations to actions in order to maximize the expected discounted return. Essentially, the RL agent will take a set of observations 𝒪 as input (in this case, a screenshot of the current game screen), and output a set of allowed actions 𝒜 that maximizes the reward function R. The reward function values immediate rewards more than future ones (such as eating a ghost in Pac-Man rather than going for a much further fruit).

The diffusion world model, on the other hand, takes the current game’s observation (what’s on the screen), the action of the player or the RL agent, and outputs the next game observation (what the screen should look like next), and the reward of that action. The diffusion model is attempting to compute p(sₜ₊₁,rₜ|sₜ,aₜ).

Training is broken down into three components: data is collected by running the RL agent in the real environment, the diffusion model is trained on the collected data, the RL agent is then trained in the world “imagined” by the diffusion model.

The Objective Function

At the heart of diffusion models, they aim to restore a noisy image to its original form. In the case of DIAMOND, it works by learning how to add noise to game screenshots and then reverse this process in a controlled manner with a neural network. The authors employ a UNet architecture for the neural network.

DIAMOND adapts the standard score-based diffusion process to work by conditioning the model based on previous game states and making sure the model outputs valid future game states.

They define the model’s objective function as:

Without going too technical, here’s what everything represents and how they work within the DIAMOND framework.

There are two major components to the training objective: the prediction from the UNet/diffusion model and the “ground-truth” that the model will attempt to align its predictions to.

The Network Prediction

The U-Net (𝐅θ) takes in two key pieces of information. First, it receives xᵗₜ₊₁, which is the noisy version of the next game state. This noise level corresponds to our current position τ in the diffusion process—the further along we are in denoising (τ closer to 0), the cleaner this state becomes. The model also receives conditioning information yᵗₜ which includes the current noise level, previous game states (x≤ᵗ⁰), and the actions taken (a≤ᵗ). This helps the model understand the game’s context and dynamics.

The Training Target

Here, x⁰ₜ₊₁ is our true, clean next game state. It’s where we ultimately want to end up. The term xᵗₜ₊₁ is the noisy version of the next state from our previous diffusion step. It represents where we currently are in the denoising process. We multiply this by cᵗₛₖᵢₚ to scale how much we trust this noisy version based on our current position in the diffusion process.

The subtraction (x⁰ₜ₊₁ - cᵗₛₖᵢₚxᵗₜ₊₁) essentially calculates the difference between where we are (our noisy state from the previous step) and where we want to be (the clean state). This difference is what we’re asking our model to predict. It needs to learn how to bridge this gap at each step of the denoising process.

The Training Process

During training the model learns to predict these state differences at various noise levels. At each training step:

  1. We start with a real game sequence (the clean states and actions)

  2. We add noise to the next state according to our current τ

  3. The model sees this noisy state (xᵗₜ₊₁) along with past states and actions

  4. It learns to predict how to transform this noisy state toward the clean state

The authors sample τ from a log-normal distribution. This means the model learns to denoise at any noise level, focusing especially on “medium-noise” regions where the learning is most effective.

During inference the model goes through the diffusion model sequentially. We start with pure noise along with our preconditions that includes information about previous game states and actions taken. At each step:

  1. The model sees this noisy state and predicts how to improve it

  2. We use Euler’s method to take a small step in the right direction

  3. This slightly cleaner state becomes the input for the next step

  4. Repeat until we arrive at a clean prediction of the next game state

Recall that the diffusion world model will also be responsible for predicting the reward for each action taken. The authors employ a separate model Rφ consisting of standard CNN and LSTM layers to make those predictions.

The RL agent is trained using the actor-critic framework with its backbone consisting of a CNN-LSTM. It is trained using the REINFORCE method.

How to run DIAMOND Locally

It is incredibly trivial to both run and train DIAMOND locally on consumer hardwares. The authors trained DIAMOND on a consumer RTX 4090 GPU for 2.9 days and the process only utilized a maximum of 12 gigabytes of memory.

There are other paradigms very similar to DIAMOND on diffusion world modeling as we will cover in the next section. However, DIAMOND is by far the easiest and cheapest model to run with very respectable results. Additionally, DIAMOND comes with a trained reinforcement learning agent that is capable of playing the game generated by the diffusion model. There’s also an option for humans to take over and play in the game “imagined” by the diffusion model.

To setup the python environment for running DIAMOND:

Python’s venv can also be used to create a virtual environment.

To run the Atari models, type python src/play.py --pretrained and the script will prompt the user to select one of twenty five available Atari environments.

To run the CSGO environment, we need to checkout the csgo branch by git checkout csgo and running python src/play.py. Note that on Apple silicon macs, we need to set the CPU fallback flag for the MPS backend by running export PYTORCH_ENABLE_MPS_FALLBACK=1 before running the play.py script.

We can deploy a training run using the following command on a device supported by CUDA: python src/main.py env.train.id=BreakoutNoFrameskip-v4 common.devices=0.

What is the Future of Diffusion World Models?

Obviously, DIAMOND’s generated worlds are far from perfect, especially for complex 3D scenes like CSGO.

But with the increased attention from the research community in the area of diffusion for world modeling, we can expect improvements to be rapid and significant.

A couple months after the release of DIAMOND, Decart and Etched developed OASIS, a diffusion world model that simulates a full 3D version of Minecraft. They have a demo released of the full model while the weights of a downscaled, 500 million parameter model is open sourced.

OASIS suffers from similar issues compared to DIAMOND’s CSGO but the overall quality is much better. Additionally, simulating Minecraft with its different actions from mining, placing, attacking, and much more to its complex and diverse 3D terrain is much more difficult than CSGO.

Demo gameplay of OASIS from authors’ website

Towards the end of 2024, Google Deepmind updated their Genie model with Genie 2, showcasing a model capable of generating accurate, stunning, and playable 3D environments that’s not restricted to a single game or even a genre of a game.

Demo generation of Google’s Genie 2 from blog

Genie 2 is nothing less of a giant leap in the field of diffusion world modeling. Genie 2 solves some of the critical flaws from previous iterations of diffusion world modeling. Its long horizon memory allows it to re-render parts of the world that are no longer in view accurately. With models like DIAMOND or OASIS, one quick look up in the sky or into a wall, the entire world will morph into something else. But Genie 2 is able to maintain a consistent world for up to a minute.

During the later parts of 2024, Google also released GameNGen, a diffusion model simulating the complete game of DOOM. The results are impressive as gameplay is able to stay consistent and accurate for over a minute.

The space of diffusion world modeling is exciting and its potential applications are endless. We may still be years from neural networks generating complete AAA games or generating the training environment of all reinforcement learning agents, but we cannot deny the speed of innovation in the area and the impressive showcases that keep flooding in.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.