In the videos, green-lit keys indicate key presses.
Mouse movements () represent changes in view angles:
horizontal movement for yaw and vertical movement for pitch.
We present GameFactory, a generalizable world model that learns from a small-scale dataset of Minecraft game videos.
By leveraging the prior knowledge of a pretrained video diffusion model, it can create new games in an open domain.
Our work consists of several key components and innovations:
Overview: As shown in Figure 1, GameFactory builds upon pre-trained video generation models, extending them with a pluggable action control module. This design effectively leverages both large-scale unlabeled data and small-scale high-quality Minecraft action data.
Action Control Module: Illustrated in Figure 2, our module integrates with Diffusion Transformer blocks through distinct control mechanisms for mouse and keyboard inputs. To address granularity mismatch between action signals and frame latents, we implement group operations. A sliding window mechanism is adopted to handle delayed action effects (e.g., jump).
Multi-Phase Training Strategy: Figure 3 outlines our four-phase training approach for scene generalization. Starting with open-domain pretraining, followed by game-specific style learning, action control training, and finally enabling open-domain action-controlled generation, this strategy ensures both action control capability while preserving open-domain scene generation ability.
Autoregressive Generation: As demonstrated in Figure 4, our autoregressive generation mechanism creates continuous gameplay by using previous frames as conditions for generating new ones.
In the videos, green-lit keys indicate key presses.
Mouse movements () represent changes in view angles:
horizontal movement for yaw and vertical movement for pitch.