Beyond LLMs: New AI Foundation Model → Quantum and You

If You Like Our Meta-Quantum.Today, Please Send us your email.

Country

Email address:

March 25, 2025 coffee

Introduction

This article explores the evolution of AI beyond language models toward more comprehensive “World Foundation Models” that can understand and interpret actions in visual data. Recent research from Harvard University, Google DeepMind, MIT, IBM, and Hong Kong University of Science and Technology focuses on developing AI systems that can understand context-invariant actions across different video scenarios.

Adaptable World Foundation Models in AI: The New Foundation Model Paradigm

What Are AI World Foundation Models?

World Foundation Models represent the next evolution in AI, moving beyond language-only models (LLMs) to develop comprehensive systems that understand and interact with the world through multiple modalities. These models aim to create a unified understanding of actions, visuals, and context across different scenarios.

Unlike LLMs that operate primarily on text, World Foundation Models process and understand:

Visual information and video content
Actions and physical movements
Context transfers between different scenarios
Temporal relationships and predictions

Key Components of Adaptable World Foundation Models

Latent Action Representation
At the core of these models is the ability to extract context-invariant actions from visual data. For example, the action of “picking up an object” can be understood as the same fundamental action whether someone is picking up a pencil, a glass of water, or a coffee cup.
The models encode these actions in a latent space that represents the abstract concept of the action, separate from the specific objects or context.
Continuous vs. Discrete Action Spaces
Recent research explores two approaches:
1. Discrete action spaces (like in Nvidia’s N1 model): Actions are quantized into specific categories
2. Continuous action spaces (like in the “Other World Learning” research): Actions exist on a continuous spectrum, allowing for more nuanced representation

Autoregressive Prediction
These models can predict future states based on current observations and extracted actions. For example:
1. Given a video frame showing a hand approaching a pencil on a table
2. And an action vector representing “lift up from table”
3. The model can generate the next frame showing the hand holding the pencil slightly above the table

Technical Implementation

Two-Step Process

Latent Action Auto-encoder:
1. Input: Video frames at time T and T+1
2. Process: Encode the action occurring between frames into a vector representation
3. Output: A latent action vector (e.g., “jump vector”)
Diffusion-Based World Model:
1. Uses the extracted latent actions as conditions
2. Pre-trains an autoregressive world model
3. Enables prediction of future states based on current observations and actions

Example in Practice

Imagine a video game character jumping:

The model observes a frame of the character standing and a frame of the character in mid-air
It encodes this as a “jump” action in latent space
Given a new frame of the character standing, plus the “jump” action vector
The model can generate a realistic prediction of what the next frame should look like

Real-World Applications

These adaptable world models enable numerous applications:

Robotics: Robots can learn transferable skills by understanding actions independent of specific objects
Video Understanding: Systems can anticipate what will happen next in videos, useful for security monitoring or content analysis
Game Development: AI can generate realistic responses to player actions across different game environments
AR/VR: Enhancing mixed reality experiences with predictive physics and interactions
Healthcare: Analyzing movement patterns for physical therapy or monitoring patient activities

Example: Transferable Action Learning

Consider a robot trained to pick up objects:

Traditional approach: The robot needs separate training for picking up cups, pens, books, etc.

With adaptable world models: The robot learns the abstract “pick up” action once, then applies it to any object without additional training. This works because:

The latent action space captures the essence of “picking up”
The diffusion model can generate appropriate predictions for new objects
The representation is context-invariant

Current Research Landscape

Major players in this field include:

Nvidia (N1 model with LAPA – Latin Action Representations Space)
Harvard University, Google DeepMind, MIT, IBM, and Hong Kong University (“Other World Learning” research)
Stanford University and University of Toronto (reasoning from latent sources)

Most models utilize some combination of:

Transformer architectures (like Genie by Google DeepMind)
Variational auto-encoders
Diffusion models
Flow matching techniques

Video about WFM

Summary of the above video

Foundation Model Evolution

The video begins by explaining how AI is “breaking free” from language-only models (LLMs) to develop a more comprehensive understanding of the world. The presenter references Nvidia’s N1 model as a recent example of a world foundation model that uses Latin Action Representations Space (LAPA) with vector quantized auto-encoding, and explains how flow matching techniques are used to train diffusion transformers to generate necessary actions.

Technical Approach

The presenter outlines a two-step approach used in the research paper “Other World Learning: Adaptable World Models with Latent Actions” (March 24, 2025):

Latent Action Auto-encoder: Using variational auto-encoders (VAEs) with a beta factor from 2017, this component extracts context-invariant actions from unlabeled videos, creating a continuous latent action representation space. Unlike Nvidia’s N1 which uses discrete action approaches, this research focuses on continuous latent space.
Autoregressive World Model: The model conditions predictions on latent actions, enabling fine-grained frame-level control. It employs a diffusion-based framework that can generate high-quality predictions by transferring actions between different video contexts.

Architecture and Implementation

The research utilizes the Genie architecture from Google DeepMind (February 2024), featuring special temporal attention, spatial attention, and feed-forward layers in a spatial-temporal transformer block configuration. The presenter explains that the goal is to extract transferable actions—like “picking up an object”—that can be applied across different contexts regardless of what the specific object is.

Reflections on AI Development

The presenter concludes with a philosophical reflection on AI development approaches, questioning whether the industry needs to train world foundation models on all available social media and video content, or if more targeted, intelligent approaches might be more effective. They express concern about potentially wasting years on inefficient development paths as happened with language models.

Future Directions and Considerations of the new model

As these world foundation models evolve, several questions emerge:

Data Requirements: Do we need to train on all available video content, or can we be more selective and efficient?
Computational Efficiency: How can we make these complex models more accessible?
Transfer Learning: How far can we push the boundaries of context transfer?
Multimodal Integration: How do we combine understanding across vision, language, touch, and other senses?

These adaptable world models represent a significant step toward AI systems that can truly understand and interact with the world in ways that more closely resemble human understanding, moving well beyond the capabilities of language-only models.

Conclusion

World Foundation Models represent a significant step forward in AI’s ability to understand and interact with the physical world. By extracting context-invariant actions and enabling transfer learning across different scenarios, these models are bringing us closer to AI systems that can understand the world more like humans do—recognizing patterns, predicting outcomes, and adapting to new situations without extensive retraining. As research continues to advance, we can expect these models to become increasingly sophisticated, potentially serving as the foundation for more capable and flexible AI systems across numerous domains.

Key Takeaways

AI is evolving beyond language-only models to understand actions and visual data
The latest research uses continuous latent action representation to model actions in videos
Context-invariant actions allow AI to transfer learning across different scenarios
The approach combines variational auto-encoders with diffusion models
Code for this research is publicly available on GitHub under Apache 2 license
The presenter encourages a more thoughtful approach to AI development instead of simply processing more data

Beyond LLMs: New AI Foundation Model