G

a

m

e

F

a

c

t

o

r

y

Creating New Games with Generative Interactive Videos

Jiwen Yu1*†    Yiran Qin1*   
Xintao Wang2✉    Pengfei Wan2    Di Zhang2    Xihui Liu1✉   
1The University of Hong Kong    2Kuaishou Technology
Intern at KwaiVGI, Kuaishou Technology    *Equal Contribution    Corresponding Authors
TL;DR: We present GameFactory, a generalizable world model that learns from a small-scale dataset of Minecraft game videos. By leveraging the prior knowledge of a pretrained video diffusion model, it can create new games in an open domain.
(Loading all videos may take some time, thanks for your patience!)
Gallery
example image

In the videos, green-lit keys indicate key presses.

Mouse movements (mouse) represent changes in view angles:

horizontal movement for yaw and vertical movement for pitch.

Open-Domain Generalized Results

Prompt: "Walking through a scorching lava field in first person perspective, feeling the intense heat radiating from the glowing molten rivers and watching the heat waves distort the air as the ground beneath cracks and smolders with fiery energy."

prompt: "Walking through a narrow canyon in first person perspective, with steep rock walls towering on either side, sunlight barely filtering through the gap above, and the sound of rushing water echoing through the confined space as you carefully navigate the rocky terrain."

prompt: "A large Saint Bernard walks steadily across the vast, snow-covered expanse of a mountain range. The rugged, towering peaks in the distance create a dramatic backdrop, while the cold mountain air seems to amplify the serene stillness of the scene."

prompt: "A tan and white dog lounges comfortably on a large, dark-colored dog bed, positioned in the middle of a sprawling grassy field. Its tongue lolls out in a relaxed expression, and the open sky stretches endlessly above, bordered by distant trees that frame the serene, pastoral setting."

Prompt: "Walking through a serene bamboo forest in first person perspective, with towering green stalks gently swaying in the breeze, sunlight filtering through the leaves to create intricate patterns on the forest floor, and the soft rustling sound of bamboo leaves adding to the tranquil atmosphere."

prompt: "Running along a cliffside path in a tropical island in first person perspective, with turquoise waters crashing against the rocks far below, the salty scent of the ocean carried by the breeze, and the sound of distant waves blending with the calls of seagulls as the path twists and turns along the jagged cliffs."

prompt: "Standing in the middle of a colorful canyon in first person perspective, surrounded by towering layered rock formations glowing in shades of red, orange, and gold under the midday sun, with the faint sound of wind whistling through narrow crevices and the dry heat of the desert pressing against your skin."

prompt: "A young bear stands next to a large tree in a grassy meadow, its dark fur catching the soft daylight. The bear seems poised, observing its surroundings in a tranquil landscape, with rolling hills and sparse trees dotting the background under a pale blue sky."

prompt: "A giant panda rests peacefully under a blooming cherry blossom tree, its black and white fur contrasting beautifully with the delicate pink petals. The ground is lightly sprinkled with fallen blossoms, and the tranquil setting is framed by the soft hues of the blossoms and the grassy field surrounding the tree."

prompt: "A lion strides confidently along a dusty, rugged path, its golden coat gleaming under the sunlight. The muscles along its powerful frame ripple with each step, embodying strength and grace. Sparse vegetation and dry terrain stretch around, emphasizing the raw beauty of the wilderness."

prompt: "A tall tree stands proudly in a sprawling green meadow, its wide branches casting gentle shadows on the grassy terrain. Surrounding hills and distant patches of vegetation create a peaceful, open landscape under a clear blue sky."

prompt: "A vibrant lionfish hovers near the entrance of a shadowy underwater cave, its long, spiky fins spreading out like a delicate fan. The surrounding seabed is dimly lit, emphasizing the lionfish's striking colors against the dark, rugged backdrop of the reef."

prompt: "An emerald green velvet accent chair sits prominently in the center of a minimalist room, its rich texture contrasting with the neutral tones of the floor and walls. Behind it, a tall bookshelf with wooden panels adds depth to the elegant, understated setting."

prompt: "A sleek black horse moves gracefully across an open field, its mane flowing in the gentle breeze. The golden glow of the evening sun bathes the landscape, casting long shadows over the swaying grass and highlighting the horse's powerful frame against the vast, serene backdrop."

prompt: "A striking white horse with a long, flowing mane stands calmly beside a rustic wooden fence. Its coat gleams under the clear blue sky, while the surrounding open field stretches into the distance, adding to the serene and picturesque scene."

prompt: "A baby squirrel peeks out from a small hollow in the rock, its tiny eyes bright with curiosity. The warm tones of the rugged stone frame the delicate creature, as sunlight gently highlights its soft fur, creating a moment of quiet charm in nature."

Action-Controlled Long Video Results

Example #1

Example #2

Example #3

Example #4

Our Method

We present GameFactory, a generalizable world model that learns from a small-scale dataset of Minecraft game videos.
By leveraging the prior knowledge of a pretrained video diffusion model, it can create new games in an open domain.

Our work consists of several key components and innovations:

Overview: As shown in Figure 1, GameFactory builds upon pre-trained video generation models, extending them with a pluggable action control module. This design effectively leverages both large-scale unlabeled data and small-scale high-quality Minecraft action data.

Action Control Module: Illustrated in Figure 2, our module integrates with Diffusion Transformer blocks through distinct control mechanisms for mouse and keyboard inputs. To address granularity mismatch between action signals and frame latents, we implement group operations. A sliding window mechanism is adopted to handle delayed action effects (e.g., jump).

Multi-Phase Training Strategy: Figure 3 outlines our four-phase training approach for scene generalization. Starting with open-domain pretraining, followed by game-specific style learning, action control training, and finally enabling open-domain action-controlled generation, this strategy ensures both action control capability while preserving open-domain scene generation ability.

Autoregressive Generation: As demonstrated in Figure 4, our autoregressive generation mechanism creates continuous gameplay by using previous frames as conditions for generating new ones.

Image 1 description
Figure 1: GameFactory overview. The blue section shows pre-trained model's generation ability,
while the green section shows the pluggable action control module.
Image 2 description
Figure 2: Action Control Module architecture. It integrates into transformer blocks with separate mechanisms for mouse and keyboard inputs.
Action sequences are grouped to handle temporal compression and delayed effects.
Image 3 description
Figure 3: Multi-phase training pipeline. Phase 0: pretrain on open-domain data; Phase 1: finetune for game style;
Phase 2: train action control;Phase 3: generate action-controlled open-domain content.
Image 4 description
Figure 4: Autoregressive video generation process. Training uses (k+1) initial frames as conditions to predict remaining frames,
while inference iteratively uses latest frames to generate new ones.
More Results
example image

In the videos, green-lit keys indicate key presses.

Mouse movements (mouse) represent changes in view angles:

horizontal movement for yaw and vertical movement for pitch.

Open-Domain Generalized Results
Prompt: "Climbing through a dense, overgrown swamp in first person perspective, with thick vines draping down from towering trees, the humid air filled with the buzzing of insects, and murky water pooling at your feet as you push forward through the tangled vegetation."

W: Forward

S: Backward

Shift: Sneak

Ctrl: Sprint

A: Move Left

D: Move Right

Space: Jump at 25th, 50th frame

Space: Jump at 0th, 10th, 30th, 50th frame

Mouse movements: Up Right

Mouse movements: Down Left

Mouse movements: Up Left

Mouse movements: Down Right

Prompt: "Walking along the edge of an enormous glacier in first person perspective, with towering ice cliffs shimmering in the bright sunlight, frozen winds whipping past, and the sound of cracking ice reverberating through the stillness of the arctic expanse."

W: Forward

S: Backward

Shift: Sneak

Ctrl: Sprint

A: Move Left

D: Move Right

Space: Jump at 15th, 45th frame

Space: Jump at 0th, 10th, 30th, 50th frame

Mouse movements: Up Right

Mouse movements: Down Left

Mouse movements: Up Left

Mouse movements: Down Right

Prompt: "Walking through a wheat field at sunrise in first person perspective, with golden stalks swaying gently in the morning breeze, the warmth of the first rays of sunlight brushing your face, and the soft rustle of the crop accompanied by the distant calls of birds waking in the horizon."

W: Forward

S: Backward

Shift: Sneak

Ctrl: Sprint

A: Move Left

D: Move Right

Space: Jump at 0th, 25th, 50th frame

Space: Jump at 0th, 10th, 30th, 50th frame

Mouse movements: Up Right

Mouse movements: Down Left

Mouse movements: Up Left

Mouse movements: Down Right

Prompt: "In a maple forest in first person perspective, surrounded by towering trees with fiery red and orange foliage, the light filtering through the leaves casting warm patterns on the ground, and the distant sound of birds chirping blending with the rustling of leaves in the breeze."

W: Forward

S: Backward

Shift: Sneak

Ctrl: Sprint

A: Move Left

D: Move Right

Space: Jump at 15th, 30th, 45th, 60th frame

Space: Jump at 0th, 15th, 30th, 50th frame

Mouse movements: Up Right

Mouse movements: Down Left

Mouse movements: Up Left

Mouse movements: Down Right

In-Domain Minecraft Results

W: Forward

S: Backward

Shift: Sneak

Ctrl: Sprint

A: Move Left

D: Move Right

Space: Jump at 0th, 15th, 45th frame

Space: Jump at 0th, 15th, 30th, 45th, 60th frame

Mouse movements: Up Right

Mouse movements: Down Left

Mouse movements: Up Left

Mouse movements: Down Right