GameFactory

Creating New Games with Generative Interactive Videos

Jiwen Yu^1*† Yiran Qin^1*
Xintao Wang^2✉ Pengfei Wan² Di Zhang² Xihui Liu^1✉

¹The University of Hong Kong ²Kuaishou Technology

^†Intern at KwaiVGI, Kuaishou Technology ^*Equal Contribution ^✉Corresponding Authors

TL;DR: We present GameFactory, a generalizable world model that learns from a small-scale dataset of Minecraft game videos. By leveraging the prior knowledge of a pretrained video diffusion model, it can create new games in an open domain.

(Loading all videos may take some time, thanks for your patience!)

Gallery

In the videos, green-lit keys indicate key presses.

Mouse movements ( mouse ) represent changes in view angles:

horizontal movement for yaw and vertical movement for pitch.

Open-Domain Generalized Results

Prompt: "Walking through a scorching lava field in first person perspective, feeling the intense heat radiating from the glowing molten rivers and watching the heat waves distort the air as the ground beneath cracks and smolders with fiery energy."

prompt: "Walking through a narrow canyon in first person perspective, with steep rock walls towering on either side, sunlight barely filtering through the gap above, and the sound of rushing water echoing through the confined space as you carefully navigate the rocky terrain."

prompt: "A large Saint Bernard walks steadily across the vast, snow-covered expanse of a mountain range. The rugged, towering peaks in the distance create a dramatic backdrop, while the cold mountain air seems to amplify the serene stillness of the scene."

prompt: "A tan and white dog lounges comfortably on a large, dark-colored dog bed, positioned in the middle of a sprawling grassy field. Its tongue lolls out in a relaxed expression, and the open sky stretches endlessly above, bordered by distant trees that frame the serene, pastoral setting."

Prompt: "Walking through a serene bamboo forest in first person perspective, with towering green stalks gently swaying in the breeze, sunlight filtering through the leaves to create intricate patterns on the forest floor, and the soft rustling sound of bamboo leaves adding to the tranquil atmosphere."

prompt: "Running along a cliffside path in a tropical island in first person perspective, with turquoise waters crashing against the rocks far below, the salty scent of the ocean carried by the breeze, and the sound of distant waves blending with the calls of seagulls as the path twists and turns along the jagged cliffs."

prompt: "Standing in the middle of a colorful canyon in first person perspective, surrounded by towering layered rock formations glowing in shades of red, orange, and gold under the midday sun, with the faint sound of wind whistling through narrow crevices and the dry heat of the desert pressing against your skin."

prompt: "A young bear stands next to a large tree in a grassy meadow, its dark fur catching the soft daylight. The bear seems poised, observing its surroundings in a tranquil landscape, with rolling hills and sparse trees dotting the background under a pale blue sky."

prompt: "A giant panda rests peacefully under a blooming cherry blossom tree, its black and white fur contrasting beautifully with the delicate pink petals. The ground is lightly sprinkled with fallen blossoms, and the tranquil setting is framed by the soft hues of the blossoms and the grassy field surrounding the tree."

prompt: "A lion strides confidently along a dusty, rugged path, its golden coat gleaming under the sunlight. The muscles along its powerful frame ripple with each step, embodying strength and grace. Sparse vegetation and dry terrain stretch around, emphasizing the raw beauty of the wilderness."

prompt: "A tall tree stands proudly in a sprawling green meadow, its wide branches casting gentle shadows on the grassy terrain. Surrounding hills and distant patches of vegetation create a peaceful, open landscape under a clear blue sky."

prompt: "A vibrant lionfish hovers near the entrance of a shadowy underwater cave, its long, spiky fins spreading out like a delicate fan. The surrounding seabed is dimly lit, emphasizing the lionfish's striking colors against the dark, rugged backdrop of the reef."

prompt: "An emerald green velvet accent chair sits prominently in the center of a minimalist room, its rich texture contrasting with the neutral tones of the floor and walls. Behind it, a tall bookshelf with wooden panels adds depth to the elegant, understated setting."

prompt: "A sleek black horse moves gracefully across an open field, its mane flowing in the gentle breeze. The golden glow of the evening sun bathes the landscape, casting long shadows over the swaying grass and highlighting the horse's powerful frame against the vast, serene backdrop."

prompt: "A striking white horse with a long, flowing mane stands calmly beside a rustic wooden fence. Its coat gleams under the clear blue sky, while the surrounding open field stretches into the distance, adding to the serene and picturesque scene."

prompt: "A baby squirrel peeks out from a small hollow in the rock, its tiny eyes bright with curiosity. The warm tones of the rugged stone frame the delicate creature, as sunlight gently highlights its soft fur, creating a moment of quiet charm in nature."

Action-Controlled Long Video Results

Example #1

Example #2

Example #3

Example #4

Our Method

We present GameFactory, a generalizable world model that learns from a small-scale dataset of Minecraft game videos.
By leveraging the prior knowledge of a pretrained video diffusion model, it can create new games in an open domain.

Our work consists of several key components and innovations:

Overview: As shown in Figure 1, GameFactory builds upon pre-trained video generation models, extending them with a pluggable action control module. This design effectively leverages both large-scale unlabeled data and small-scale high-quality Minecraft action data.

Action Control Module: Illustrated in Figure 2, our module integrates with Diffusion Transformer blocks through distinct control mechanisms for mouse and keyboard inputs. To address granularity mismatch between action signals and frame latents, we implement group operations. A sliding window mechanism is adopted to handle delayed action effects (e.g., jump).

Multi-Phase Training Strategy: Figure 3 outlines our four-phase training approach for scene generalization. Starting with open-domain pretraining, followed by game-specific style learning, action control training, and finally enabling open-domain action-controlled generation, this strategy ensures both action control capability while preserving open-domain scene generation ability.

Autoregressive Generation: As demonstrated in Figure 4, our autoregressive generation mechanism creates continuous gameplay by using previous frames as conditions for generating new ones.