
{"id":8504,"date":"2026-06-12T09:09:00","date_gmt":"2026-06-12T01:09:00","guid":{"rendered":"https:\/\/meta-quantum.today\/?p=8504"},"modified":"2026-06-11T17:08:38","modified_gmt":"2026-06-11T09:08:38","slug":"yann-lecun-world-models-enabling-the-next-ai-revolution","status":"publish","type":"post","link":"https:\/\/meta-quantum.today\/?p=8504","title":{"rendered":"Yann LeCun \u2014 &#8220;World Models: Enabling the Next AI Revolution&#8221;"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Yann LeCun \u2014 Turing Award laureate, NYU professor, and former Meta Chief AI Scientist \u2014 lays out his case for why world models, not large language models, are the path to genuinely intelligent machines. He opens provocatively (&#8220;machine learning sucks&#8221;) by contrasting machine learning with human and animal learning: a teenager learns to drive in roughly 20 hours, while self-driving companies with millions of hours of training data still cannot reach Level 5 autonomy. This is the Moravec paradox in action \u2014 language and chess are easy for computers, while the messy, continuous, high-dimensional real world remains hard. The talk also marks a personal turning point: LeCun has left Meta to found a new company, AMI Labs, focused on &#8220;AI for the real world.&#8221;<\/p>\n\n\n\n<figure class=\"wp-block-video\"><video height=\"720\" style=\"aspect-ratio: 1280 \/ 720;\" width=\"1280\" autoplay controls loop muted poster=\"https:\/\/meta-quantum.today\/wp-content\/uploads\/2026\/06\/WorldModels-01-scaled.jpg\" src=\"https:\/\/meta-quantum.today\/wp-content\/uploads\/2026\/06\/WorldModels-1.mp4\" playsinline><\/video><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">New Features and Concepts<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Intelligence as adaptivity, not accumulation.<\/strong> Channeling Jean Piaget (&#8220;intelligence is what you do when you don&#8217;t know&#8221; \u2014 a quote LeCun notes is apocryphal), he argues intelligence is neither declarative knowledge (what LLMs accumulate) nor a collection of skills, but the ability to learn new tasks quickly with little training. On this basis he calls the notion of AGI &#8220;complete nonsense&#8221; \u2014 human intelligence itself is specialized and adaptive.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>The data argument against LLM scaling.<\/strong> A striking calculation: a modern LLM trains on ~20 trillion words (~10\u00b9\u2074 bytes), equivalent to 400,000 years of human reading. A four-year-old child has absorbed roughly the same 10\u00b9\u2074 bytes through vision alone in 16,000 waking hours. Conclusion: human-like intelligence will not emerge from text alone. Video&#8217;s redundancy, often dismissed as a bug, is actually a feature for self-supervised learning.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Inference by optimization, not propagation.<\/strong> LeCun contrasts two inference modes: reactive forward propagation through fixed layers (the LLM approach, where &#8220;reasoning&#8221; is coerced by generating more tokens) versus searching for an action sequence that minimizes an energy function at inference time. The latter requires a world model \u2014 perceive the state, imagine actions, predict outcomes, optimize against task objectives. This is essentially Model Predictive Control (MPC), classical optimal control dating to the 1960s.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Intrinsic safety via guardrail objectives.<\/strong> Because such a system can only act by optimizing its guardrail and task objectives, it can be made intrinsically safe \u2014 unlike LLMs, whose safety fine-tuning can always be jailbroken.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>JEPA (Joint Embedding Predictive Architecture).<\/strong> His core technical proposal: instead of generative models that reconstruct pixels (which fail because video futures are unpredictable in detail, producing blurry averages), JEPA encodes both input and target and predicts in abstract representation space, discarding unpredictable detail. All the best self-supervised vision systems use joint embedding, none use reconstruction.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Preventing collapse: SIGReg.<\/strong> The central challenge of joint embedding is representation collapse. LeCun&#8217;s preferred remedy is information maximization, and his newest technique is SIGReg (Sketched Isotropic Gaussian Regularization): project batch representations along many random directions, match each empirical cumulative distribution to a Gaussian via gradient signals \u2014 a theorem guarantees the joint distribution converges to an isotropic Gaussian (maximally independent variables). It trains on a single GPU with open-source code.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">All about World Models, include installation, setup and config with sample cases<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">The Landscape at a Glance<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The open-source JEPA\/world-model stack today has four main entry points, each serving a different purpose:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>V-JEPA 2 \/ 2.1<\/strong> (<code>facebookresearch\/vjepa2<\/code>) \u2014 pretrained video encoders. Use this when you want state-of-the-art video\/image representations off the shelf, or the action-conditioned variant (V-JEPA 2-AC) for robot planning. V-JEPA 2.1 adds a dense predictive loss where all tokens contribute to training, deep self-supervision at intermediate layers, and multi-modal tokenizers for images and videos.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>EB-JEPA<\/strong> (<code>facebookresearch\/eb_jepa<\/code>) \u2014 the educational\/research library. Modular, self-contained implementations going from image-level SSL to video to action-conditioned world models, each designed for single-GPU training within a few hours \u2014 reaching 91% probe accuracy on CIFAR-10 and a 97% planning success rate on the Two Rooms navigation task. This is the best starting point for learning.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>LeJEPA<\/strong> (<code>rbalestr-lab\/lejepa<\/code>) \u2014 the SIGReg implementation. This is the technique LeCun highlighted in the talk: it identifies the isotropic Gaussian as the optimal embedding distribution and introduces Sketched Isotropic Gaussian Regularization to constrain embeddings toward it, with a single trade-off hyperparameter and linear time\/memory complexity.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>JEPA-WMs<\/strong> (<code>facebookresearch\/jepa-wms<\/code>) \u2014 the planning benchmark suite. Provides pretrained JEPA world models plus DINO-WM and V-JEPA-2-AC baselines across simulated environments (PointMaze, RoboSuite) and real-robot setups, with decoder heads for visualizing rollouts.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Also worth knowing: <strong>I-JEPA<\/strong> (images only, the original 2023 codebase) and <strong>DINO-WM<\/strong> (frozen DINOv2 features + learned dynamics for planning). On the generative side of the world-model family \u2014 which LeCun explicitly distinguishes from his approach \u2014 there are DreamerV3 (RL with latent imagination) and Genie-style video world models.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Environment Setup (Common Base)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">All of these are PyTorch projects. A clean base that works for everything:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Recommended: Python 3.10\u20133.12, CUDA-enabled PyTorch\nconda create -n worldmodels python=3.12 -y\nconda activate worldmodels\n\npip install torch torchvision --index-url &lt;https:\/\/download.pytorch.org\/whl\/cu124&gt;\npip install timm einops numpy pillow opencv-python decord\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">GPU guidance: feature extraction with V-JEPA 2 ViT-L runs on a single 16\u201324 GB GPU; EB-JEPA and LeJEPA examples are deliberately single-GPU friendly; full V-JEPA 2 pretraining is cluster-scale and not something to attempt locally.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">V-JEPA 2 \u2014 Installation and Sample Case<\/h3>\n\n\n\n<h3 class=\"wp-block-heading\">Install<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The fastest path requires no clone at all \u2014 install PyTorch, timm, and einops, then load models directly via torch.hub:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import torch\n\n# Preprocessor + encoder via torch.hub\nprocessor      = torch.hub.load('facebookresearch\/vjepa2', 'vjepa2_preprocessor')\nvjepa2_vit_large = torch.hub.load('facebookresearch\/vjepa2', 'vjepa2_vit_large')\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Or via Hugging Face Transformers (V-JEPA 2 models are hosted under the <code>facebook<\/code> org):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from transformers import AutoVideoProcessor, AutoModel\n\nprocessor = AutoVideoProcessor.from_pretrained(\"facebook\/vjepa2-vitl-fpc64-256\")\nmodel     = AutoModel.from_pretrained(\"facebook\/vjepa2-vitl-fpc64-256\").cuda().eval()\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">For the full repo (training, evals, V-JEPA 2-AC planning):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>git clone &lt;https:\/\/github.com\/facebookresearch\/vjepa2.git&gt;\ncd vjepa2\npip install -e .\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Model sizes range from ViT-L\/16 at 300M parameters up to ViT-g\/16 at 1B for V-JEPA 2, and ViT-B (80M) through ViT-G (2B) at 384 resolution for V-JEPA 2.1, each with downloadable checkpoints and pretraining configs.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Sample Case A \u2014 Video feature extraction + action recognition probe<\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>import torch, numpy as np\nfrom decord import VideoReader\n\n# 1. Load 64 frames from a video\nvr = VideoReader(\"sample.mp4\")\nidx = np.linspace(0, len(vr) - 1, 64).astype(int)\nvideo = vr.get_batch(idx).asnumpy()           # (64, H, W, 3)\n\n# 2. Preprocess and encode\ninputs = processor(list(video), return_tensors=\"pt\").to(\"cuda\")\nwith torch.no_grad():\n    feats = model.get_vision_features(**inputs)   # (1, N_tokens, D)\n\n# 3. Frozen-backbone classification: pool + linear probe\nclip_embedding = feats.mean(dim=1)            # (1, D)\n# train a small nn.Linear(D, num_classes) on top \u2014 backbone stays frozen\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This is the canonical JEPA usage pattern: the encoder is never fine-tuned; you train only a lightweight &#8220;attentive probe&#8221; or linear head per task \u2014 exactly the &#8220;tiny projection head, very few samples&#8221; answer LeCun gave in the Q&amp;A.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Sample Case B \u2014 Physical plausibility detection (the &#8220;surprise&#8221; test)<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Using the predictor&#8217;s internal prediction error as an anomaly score, sliding a 16-frame window across a video \u2014 error spikes indicate physically implausible events. The repo&#8217;s evaluation suite includes the IntPhys-style benchmarks for this. Pseudocode of the pattern:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>errors = &#91;]\nfor t in range(0, T - 16):\n    window = video&#91;t : t + 16]\n    pred   = predictor(encoder(window&#91;:8]))        # predict future reps\n    target = encoder(window&#91;8:])\n    errors.append((pred - target).pow(2).mean().item())\n# A spike in `errors` \u2248 violation of expectation\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">EB-JEPA \u2014 The Best Learning Path (Single GPU)<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">Install<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">EB-JEPA uses uv for package management: run <code>uv sync<\/code>, then either activate the venv or use <code>uv run<\/code>. A conda + uv hybrid is also supported: create a Python 3.12 conda env, then <code>uv pip install -e . --group dev<\/code>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>git clone &lt;https:\/\/github.com\/facebookresearch\/eb_jepa.git&gt;\ncd eb_jepa\nuv sync\nsource .venv\/bin\/activate\n<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Config<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Two environment variables control paths \u2014 add to ~\/.bashrc: <code>export EBJEPA_DSETS=\/path\/to\/eb_jepa\/datasets<\/code> (required) and optionally <code>EBJEPA_CKPTS=\/path\/to\/checkpoints<\/code> for logs and checkpoints. Each example ships with a YAML config (encoder dims, mask ratios, EMA momentum, learning rate) you can edit directly.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Sample Cases \u2014 The Three-Step Ladder<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Training is launched per example: <code>python -m examples.image_jepa.main<\/code>, <code>examples.video_jepa.main<\/code>, or <code>examples.ac_video_jepa.main<\/code>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Step 1, <strong>Image JEPA on CIFAR-10<\/strong>: masked-patch prediction in latent space; probe the frozen features afterward (~91% accuracy expected). Step 2, <strong>Video JEPA on Moving MNIST<\/strong>: extends the same objective to multi-step temporal prediction. Step 3, <strong>Action-conditioned JEPA + planning on Two Rooms<\/strong>: trains a world model conditioned on actions, then runs MPC-style planning over latent rollouts \u2014 this is the talk&#8217;s full architecture (perception \u2192 world model \u2192 energy minimization over action sequences) in miniature, achieving ~97% navigation success. Each stage trains in a few hours on one GPU, making this the most accessible end-to-end demonstration of &#8220;plan by optimizing through a learned world model.&#8221;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">LeJEPA \/ SIGReg \u2014 Collapse Prevention as a Library<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">Install<\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>git clone &lt;https:\/\/github.com\/rbalestr-lab\/lejepa.git&gt;\ncd lejepa\npip install -e .\n<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Sample Case \u2014 Drop SIGReg into your own training loop<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">The minimal usage: choose a univariate statistical test (e.g., Epps-Pulley), wrap it in the multivariate slicing test with a number of random projection slices, and apply it to your embedding batch:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import lejepa\n\nunivariate_test = lejepa.univariate.EppsPulley(num_points=17)\nsigreg = lejepa.multivariate.SlicingUnivariateTest(\n    univariate_test=univariate_test,\n    num_slices=1024,          # random projection directions\n)\n\n# Inside your training loop:\nz_ctx = encoder(view_a)                   # (batch, dim)\nz_tgt = encoder(view_b)\npred_loss = (predictor(z_ctx) - z_tgt.detach()).pow(2).mean()\nreg_loss  = sigreg(z_ctx)                 # pushes embeddings toward isotropic Gaussian\nloss = pred_loss + lam * reg_loss         # lam: the single trade-off hyperparameter\nloss.backward()\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This is exactly the projection-and-CDF-matching trick LeCun described in the talk: project embeddings along many random directions, compare each empirical marginal against a Gaussian CDF, and backpropagate the correction. No EMA teacher, no stop-gradient heuristics, no negative samples. The augmentation recipe in the repo follows a DINO-style multi-crop approach \u2014 2 global views and 6 local views per image at different scales. A nice independent reproduction note: SIGReg prevents collapse without contrastive samples even on CIFAR-100 with ViT-Tiny, though nano-scale models can exhibit a head-collapse phenomenon \u2014 worth knowing if you experiment at very small scale.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">JEPA-WMs \u2014 Planning Benchmarks and Robot Environments<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">Install and Config<\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>git clone &lt;https:\/\/github.com\/facebookresearch\/jepa-wms.git&gt;\ncd jepa-wms\nuv pip install -e .\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Configuration is checkpoint-directory driven: set <code>$JEPAWM_OSSCKPT<\/code> pointing to a tree containing V-JEPA v1\/v2 checkpoints and DINOv3 weights; pretrained models are downloadable from Hugging Face Hub (recommended) or fbaipublicfiles. Environment setup has one legacy quirk: PointMaze requires MuJoCo 2.1.0 via mujoco-py (download to ~\/.mujoco, export LD_LIBRARY_PATH), while other environments use the modern mujoco package; RoboSuite is installed from a forked repo via <code>uv pip install -e .<\/code><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Sample Case \u2014 Latent-space MPC planning<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">The workflow mirrors the talk&#8217;s architecture directly: encode the current observation with a frozen encoder (DINOv2\/DINOv3 or V-JEPA), roll out candidate action sequences through the learned latent dynamics model, score each rollout against the encoded goal image (the energy function), and optimize with CEM\/MPPI. Decoder heads are available to visualize rollouts but aren&#8217;t required for training or planning evaluation \u2014 a nice illustration of &#8220;prediction in representation space, pixels only for human inspection.&#8221;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Practical Configuration Cheat Sheet<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For configuration purposes, three knobs matter most across all these repos. First, <strong>masking ratio and strategy<\/strong> (in the YAML configs): video JEPA uses large spatiotemporal block masks (~75\u201390%) \u2014 too little masking makes prediction trivial and degrades features. Second, <strong>collapse prevention choice<\/strong>: EMA-distillation (V-JEPA, DINO style \u2014 proven at scale) versus SIGReg (LeJEPA \u2014 simpler, theoretically grounded, newer). Third, <strong>probe versus fine-tune<\/strong>: the entire paradigm assumes frozen backbones; if your downstream results are poor, fix the probe (try attentive pooling instead of mean pooling) before touching the encoder.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Video about World Models<\/h2>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Yann LeCun: World Models: Enabling the next AI revolution\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/72Xj8k5WQX4?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<div class=\"wp-block-group has-pale-cyan-blue-background-color has-background\"><div class=\"wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained\">\n<h3 class=\"wp-block-heading\">Related Sections of Video<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Developmental psychology as a roadmap.<\/strong> Babies learn the world is 3D within months purely from observation (parallax explains visual change under motion); object permanence comes quickly; intuitive physics like gravity takes ~9 months. Psychologists detect concept acquisition via &#8220;violation of expectation&#8221; \u2014 and remarkably, the same test now works on machines: V-JEPA&#8217;s internal prediction error spikes when shown physically impossible videos (a ball vanishing mid-flight), which LeCun calls the first self-supervised system to acquire a level of physical common sense. This aligns with his long-standing argument that LLM hallucinations stem from autoregressive token prediction, where error probability compounds exponentially and the system never naturally gravitates toward truth (<a href=\"https:\/\/glasp.co\/youtube\/p\/why-llms-hallucinate-yann-lecun-and-lex-fridman\">Glasp summary<\/a>).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Abstraction as the essence of prediction.<\/strong> Science itself works by inventing abstraction hierarchies \u2014 quantum fields \u2192 atoms \u2192 molecules \u2192 cells \u2192 organisms \u2192 societies \u2014 each ignoring lower-level detail to enable longer-range prediction. Aerodynamics simulates airflow with Navier-Stokes equations, not molecular collisions. Hence world models should <em>not<\/em> be pixel-level simulators, digital twins, or video generators: &#8220;if you want to produce cute videos, work on video generation; if you want to control robots, do not.&#8221;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Practical results.<\/strong> Distillation-based JEPA variants (I-JEPA, V-JEPA, DINO\/DINOv3 from Meta) currently produce the best generic image representations, train faster than masked autoencoders, and support planning: a DINO-encoded world model plans action sequences in complex simulated dynamics within 25 steps. V-JEPA 2.1 representations, with a small trained head, predict depth from a single image better than DINOv3 \u2014 evidence the system &#8220;understands&#8221; 3D structure from passive video alone.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Hierarchical planning \u2014 the open problem.<\/strong> Using his NYU-office-to-Paris example (high-level plan: get to airport, catch plane; recursively decompose to sub-goals), LeCun stresses that hierarchical planning remains unsolved and is &#8220;a great PhD topic.&#8221; This echoes his consistent message that training world models by observation, planning with learned world models, and hierarchical planning are the key open research frontiers for robotics and physical AI (<a href=\"https:\/\/glasp.co\/youtube\/p\/yann-lecun-on-tesla-optimus-and-humanoid-robots-yann-lecun-and-lex-fridman\">Glasp summary<\/a>).<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">LeCun closes with his now-famous list of heresies: abandon generative models in favor of joint embedding architectures; abandon probabilistic models in favor of energy-based models; prefer regularized\/information-maximization methods over contrastive ones; and minimize reinforcement learning (&#8220;what you do when you&#8217;re desperate&#8221;) \u2014 use it only on top of good pre-learned representations. Academics, he insists, should not work on LLMs at all, since they can bring nothing to that table. His new venture AMI Labs targets exactly the domains where LLMs are helpless: high-dimensional, continuous, noisy systems \u2014 robotics, industrial process control, physical AI. In the Q&amp;A, he clarifies that task objectives and constraints in representation space require only tiny trained projection heads, learnable from very few samples.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Key Takeaways<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Intelligence = rapid adaptation<\/strong>, not accumulated knowledge or skills; &#8220;AGI&#8221; is a misnomer because even human intelligence is specialized.<\/li>\n\n\n\n<li><strong>Text is a dead end for grounded intelligence<\/strong> \u2014 a 4-year-old&#8217;s visual input matches the entire internet&#8217;s text corpus in bytes.<\/li>\n\n\n\n<li><strong>Predict in representation space, not pixel space<\/strong> \u2014 JEPA discards unpredictable detail, which is why it outperforms generative\/reconstruction approaches for representation learning.<\/li>\n\n\n\n<li><strong>Inference by energy minimization<\/strong> (planning via world models + MPC) is computationally more powerful than fixed-depth forward propagation, and enables intrinsically safe systems via guardrail objectives.<\/li>\n\n\n\n<li><strong>SIGReg<\/strong> is the newest collapse-prevention technique \u2014 provably recovers isotropic Gaussian latents, simple, single-GPU trainable, open source.<\/li>\n\n\n\n<li><strong>V-JEPA demonstrates emergent physical common sense<\/strong> \u2014 prediction error spikes on impossible events, mirroring infant violation-of-expectation experiments.<\/li>\n\n\n\n<li><strong>Hierarchical planning is wide open<\/strong> \u2014 arguably the most important unsolved problem for agentic and physical AI.<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">Related References<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/openreview.net\/pdf?id=BZ5a1r-kVsf\" target=\"_blank\" rel=\"noopener\" title=\"\">LeCun (2022), <em>A Path Towards Autonomous Machine Intelligence<\/em> \u2014 the position paper underpinning this talk<\/a><\/li>\n\n\n\n<li><a href=\"http:\/\/I-JEPA (Assran et al., 2023)\" target=\"_blank\" rel=\"noopener\" title=\"\">I-JEPA (Assran et al., 2023)<\/a> and V-JEPA \/<a href=\"http:\/\/V-JEPA 2 (Bardes et al., 2024\u20132025) \u2014 image and video instantiations\" target=\"_blank\" rel=\"noopener\" title=\"\"> V-JEPA 2 (Bardes et al., 2024\u20132025) \u2014 image and video instantiations<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/ai.meta.com\/blog\/dinov3-self-supervised-vision-model\/\" target=\"_blank\" rel=\"noopener\" title=\"\">DINO \/ DINOv3 (Meta FAIR Paris) \u2014 state-of-the-art self-supervised image encoders<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/arxiv.org\/pdf\/2105.04906\" target=\"_blank\" rel=\"noopener\" title=\"\">VICReg, Barlow Twins, MMCR \u2014 information-maximization collapse-prevention methods<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/arxiv.org\/pdf\/2101.08482\" target=\"_blank\" rel=\"noopener\" title=\"\">BYOL (Google DeepMind) \u2014 origin of the EMA-teacher distillation trick<\/a><\/li>\n\n\n\n<li>Glasp insights: <a href=\"https:\/\/glasp.co\/youtube\/p\/yann-lecun-on-tesla-optimus-and-humanoid-robots-yann-lecun-and-lex-fridman\">Yann LeCun on world models and robotics<\/a> and <a href=\"https:\/\/glasp.co\/youtube\/p\/why-llms-hallucinate-yann-lecun-and-lex-fridman\">Why LLMs hallucinate<\/a><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Yann LeCun argues that LLMs cannot achieve human-like intelligence: a four-year-old absorbs as much data through vision as all internet text. His solution\u2014world models built on JEPA, predicting in abstract representation space rather than pixels\u2014enables planning by energy minimization, intrinsic safety, and emergent physical common sense. Now open-sourced via V-JEPA 2, EB-JEPA, and LeJEPA for hands-on experimentation.<\/p>\n","protected":false},"author":1,"featured_media":8509,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-8504","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"aioseo_notices":[],"featured_image_src":"https:\/\/meta-quantum.today\/wp-content\/uploads\/2026\/06\/WorldModels-00-scaled.jpg","featured_image_src_square":"https:\/\/meta-quantum.today\/wp-content\/uploads\/2026\/06\/WorldModels-00-scaled.jpg","author_info":{"display_name":"coffee","author_link":"https:\/\/meta-quantum.today\/?author=1"},"rbea_author_info":{"display_name":"coffee","author_link":"https:\/\/meta-quantum.today\/?author=1"},"rbea_excerpt_info":"Yann LeCun argues that LLMs cannot achieve human-like intelligence: a four-year-old absorbs as much data through vision as all internet text. His solution\u2014world models built on JEPA, predicting in abstract representation space rather than pixels\u2014enables planning by energy minimization, intrinsic safety, and emergent physical common sense. Now open-sourced via V-JEPA 2, EB-JEPA, and LeJEPA for hands-on experimentation.","category_list":"<a href=\"https:\/\/meta-quantum.today\/?cat=1\" rel=\"category\">Uncategorized<\/a>","comments_num":"0 comments","_links":{"self":[{"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/posts\/8504","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=8504"}],"version-history":[{"count":3,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/posts\/8504\/revisions"}],"predecessor-version":[{"id":8513,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/posts\/8504\/revisions\/8513"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/media\/8509"}],"wp:attachment":[{"href":"https:\/\/meta-quantum.today\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=8504"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=8504"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=8504"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}