FramePack – Packing Input Frame Context in Next-Frame Prediction Models for Offline Video Generation With Low Resource Requirements

https://lllyasviel.github.io/frame_pack_gitpage/

  • Diffuse thousands of frames at full fps-30 with 13B models using 6GB laptop GPU memory.
  • Finetune 13B video model at batch size 64 on a single 8xA100/H100 node for personal/lab experiments.
  • Personal RTX 4090 generates at speed 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache).
  • No timestep distillation.
  • Video diffusion, but feels like image diffusion.

Image-to-5-Seconds (30fps, 150 frames)