Large-scale Training of Generative Models on Video Data: Leveraging Transformer Architecture for Realistic Simulations
Large-scale Training of Generative Models on Video Data
Training Text-Conditional Diffusion Models
- Researchers have developed a method for training generative models on video data.
- The models are trained using a technique called text-conditional diffusion models.
- These models are trained jointly on videos and images with varying durations, resolutions, and aspect ratios.
Leveraging Transformer Architecture
- The researchers use a transformer architecture that operates on spacetime patches of video and image latent codes.
- This method allows for the generation of high-quality video.
- The largest model developed, named Sora, is capable of generating a minute of high-fidelity video.
Promising Path for Building Simulators of the Physical World
- The results of this research suggest t...