
{"id":7778,"date":"2025-05-18T08:00:00","date_gmt":"2025-05-18T00:00:00","guid":{"rendered":"https:\/\/meta-quantum.today\/?p=7778"},"modified":"2025-05-18T09:04:43","modified_gmt":"2025-05-18T01:04:43","slug":"how-to-build-qwen3s-dual-mode-ai-0-6b-to-235b","status":"publish","type":"post","link":"https:\/\/meta-quantum.today\/?p=7778","title":{"rendered":"HOW TO Build Qwen3&#8217;s Dual Mode AI (0.6B to 235B)"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n\n\n\n<p>This analysis examines a technical report from the Qwen3 team that focuses on their innovative dual-mode AI system. The system dynamically switches between &#8220;syncing&#8221; (thinking) and &#8220;non-syncing&#8221; modes. This capability enables Qwen3 to perform complex multi-step reasoning when necessary while delivering rapid, context-driven responses when immediate answers are needed. Performance data highlights significant accuracy improvements in complex tasks when using the syncing mode, demonstrating the value of this dual capability in modern AI systems.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Qwen3&#8217;s Dual Mode AI: Syncing and Non-Syncing Modes<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What Are the Two Modes?<\/h3>\n\n\n\n<p>Qwen3&#8217;s revolutionary architecture introduces a dynamic switching capability between two distinct operational modes within a single model:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Syncing Mode (Thinking Mode)<\/strong>: Enables explicit step-by-step reasoning for complex problems requiring multi-stage analysis<\/li>\n\n\n\n<li><strong>Non-Syncing Mode<\/strong>: Provides rapid, context-driven responses without visible reasoning steps<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Key Benefits of Dual Mode Architecture<\/h3>\n\n\n\n<p>The dual mode approach solves a fundamental tradeoff in AI systems. Performance data from AIM24 and AIM25 benchmarks showed significant accuracy improvements when using syncing mode for complex reasoning tasks while maintaining efficiency for simpler queries through non-syncing mode. This eliminates the need to deploy separate specialized models.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How the Dual Mode System Works<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Implementation Method<\/strong><\/h4>\n\n\n\n<p>The implementation is elegantly simple yet powerful. During the &#8220;syncing fusion&#8221; stage of training (the third of four fine-tuning stages), Qwen3 developers:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Started with a model already trained for reasoning capabilities<\/li>\n\n\n\n<li>Performed continual supervised fine-tuning using a clever template system<\/li>\n\n\n\n<li>Used nearly identical input formats with one critical difference:\n<ol class=\"wp-block-list\">\n<li>Syncing mode: Includes a dedicated section labeled for syncing content<\/li>\n\n\n\n<li>Non-syncing mode: Contains a &#8220;no syncing&#8221; label with the syncing content section removed<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n\n\n\n<p>This simple formatting difference was sufficient for the model to learn when to engage in explicit reasoning versus when to provide direct responses.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Data Preparation<\/strong><\/h4>\n\n\n\n<p>The training process required carefully curated data sets:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Non-syncing data covering diverse tasks: coding, mathematics, instruction following, multilingual capabilities, creative writing, question answering, and role playing<\/li>\n\n\n\n<li>Syncing data focused on problems requiring explicit reasoning chains<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Controllable Thinking Budget<\/h3>\n\n\n\n<p>An interesting capability that emerged naturally from the syncing fusion approach is the ability to control the &#8220;thinking budget&#8221;:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Users can specify a maximum token length (up to 38,912 tokens) for the syncing process<\/li>\n\n\n\n<li>This is implemented via a &#8220;stop syncing&#8221; command<\/li>\n\n\n\n<li>The model naturally attempts to complete its reasoning within the specified budget<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Technical Implementation in Training Pipeline<\/h3>\n\n\n\n<p>The dual mode capability was developed through a specific sequence:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Pre-training<\/strong>: Three-stage curriculum learning (general knowledge \u2192 domain expertise \u2192 long context)<\/li>\n\n\n\n<li><strong>Supervised Fine-tuning<\/strong>: Chain-of-thought cold start<\/li>\n\n\n\n<li><strong>Reasoning RLHF<\/strong>: GRPO (Generalized Reward Policy Optimization) for reasoning capabilities<\/li>\n\n\n\n<li><strong>Syncing Fusion<\/strong>: Integration of non-syncing capabilities into reasoning models<\/li>\n\n\n\n<li><strong>General RLHF<\/strong>: Final reinforcement learning to enhance instruction following, format adherence, and agent abilities<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">Video about the How to Build Qwen3&#8217;s Dual Mode<\/h2>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"0.6B to 235B: HOW TO Build Qwen3\u2019s Dual Mode AI\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/_8Rv2p4RVmU?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Related Section of Video<\/h2>\n\n\n\n<p>The video breaks down Qwen3&#8217;s training process into several key stages:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pre-training (Curriculum Learning)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Stage 1<\/strong>: General pre-training with approximately 30 trillion tokens from diverse domains across 119 languages<\/li>\n\n\n\n<li><strong>Stage 2<\/strong>: Knowledge-intensive pre-training focused on reasoning and domain expertise in science, mathematics, and coding<\/li>\n\n\n\n<li><strong>Stage 3<\/strong>: Long-context pre-training extending to 32k tokens, using techniques like YARN (Yet Another RoPE Method) to optimize attention mechanisms<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Fine-tuning Process<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Cold Start with Chain-of-Thought<\/strong>: Creating a comprehensive dataset spanning various categories, paired with verified reference answers or code-based test cases<\/li>\n\n\n\n<li><strong>Reasoning Reinforcement Learning<\/strong>: Using RLHF with GRPO (Generalized Reward Policy Optimization) on challenging query-verifier pairs, showing performance improvement from 70 to 85 on benchmarks<\/li>\n\n\n\n<li><strong>Syncing Fusion<\/strong>: The critical step where non-syncing capabilities are integrated into the thinking models through chat templates that distinguish between syncing and non-syncing modes<\/li>\n\n\n\n<li><strong>General Reinforcement Learning<\/strong>: Final GRPO implementation targeting instruction following, format adherence, reference alignment, and agent abilities<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Distillation and Model Variants<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Created two main models: a 235B mixture-of-experts model (with 22B active parameters) and a 32B dense model<\/li>\n\n\n\n<li>Distilled smaller models ranging from 0.6B to 14B parameters, plus a 30B MoE model with 3B active parameters<\/li>\n\n\n\n<li>Used both off-policy and on-policy distillation techniques<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Applications of Dual Mode AI<\/h2>\n\n\n\n<p>The dual mode approach is particularly valuable for:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Complex reasoning tasks (mathematics, science, coding) where step-by-step thinking improves accuracy<\/li>\n\n\n\n<li>Conversational AI where immediate responses are more natural<\/li>\n\n\n\n<li>Agent systems that need to switch between deep analysis and quick actions<\/li>\n\n\n\n<li>Retrieval-Augmented Generation (RAG) systems<\/li>\n\n\n\n<li>Resource optimization (using thinking mode only when necessary)<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Future Developments<\/h2>\n\n\n\n<p>According to the technical report, Qwen3 developers plan to extend this capability toward agent-based reinforcement learning systems that can:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Learn from environmental feedback<\/li>\n\n\n\n<li>Tackle increasingly complex tasks<\/li>\n\n\n\n<li>Scale inference capabilities at runtime based on task requirements<\/li>\n<\/ul>\n\n\n\n<p>The dual mode architecture represents a significant advancement in creating more flexible, resource-efficient AI systems that can adapt their reasoning approach based on the nature of the task.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>The Qwen3 models represent a significant advancement in AI model architecture by successfully implementing dual-mode capabilities within a single model. The team&#8217;s methodical approach to training\u2014from curriculum-based pre-training to specialized fine-tuning and distillation\u2014has produced models capable of both deep reasoning and rapid responses. The models support 119 languages and are available under the Apache 2 license, making them accessible for various applications.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">5 Key Takeaways:<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Dual-mode innovation<\/strong>: Qwen3 models can dynamically switch between syncing (thinking) mode for complex reasoning and non-syncing mode for immediate responses within the same model architecture.<\/li>\n\n\n\n<li><strong>Three-stage curriculum learning<\/strong>: The pre-training followed a deliberate progression from general knowledge to domain-specific expertise to long-context understanding.<\/li>\n\n\n\n<li><strong>Syncing fusion simplicity<\/strong>: The mechanism enabling dual-mode capability is elegantly simple\u2014using different templates with or without &#8220;syncing content&#8221; sections.<\/li>\n\n\n\n<li><strong>Controllable thinking budget<\/strong>: Users can set a maximum token length for the thinking process, a capability that emerged naturally from the syncing fusion approach.<\/li>\n\n\n\n<li><strong>Future focus on agent learning<\/strong>: The Qwen3 team plans to expand into agent-based reinforcement learning systems that can learn from environmental feedback, targeting complex tasks requiring inference-time scaling.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">References<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/arxiv.org\/pdf\/2505.09388\" target=\"_blank\" rel=\"noopener\" title=\"Qwen3 Technical Report (published May 14, 2025)\">Qwen3 Technical Report (published May 14, 2025)<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/QwenLM\/Qwen3\" target=\"_blank\" rel=\"noopener\" title=\"Qwen3 Model Repository (all models available under Apache 2 license)\">Qwen3 Model Repository (all models available under Apache 2 license)<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/iclr.cc\/media\/iclr-2024\/Slides\/17499.pdf\" target=\"_blank\" rel=\"noopener\" title=\"YARN (Yet Another RoPE Method) for long-context attention\">YARN (Yet Another RoPE Method) for long-context attention<\/a><\/li>\n<\/ul>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Qwen3 introduces revolutionary dual-mode AI architecture enabling dynamic switching between &#8220;syncing&#8221; (thinking) and &#8220;non-syncing&#8221; modes within a single model. Syncing mode provides explicit step-by-step reasoning for complex problems, while non-syncing mode delivers rapid, immediate responses. This elegant solution uses simple template differences during training, effectively eliminating the need for separate specialized models while maintaining both reasoning depth and response efficiency.<\/p>\n","protected":false},"author":1,"featured_media":7779,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[15,18],"tags":[],"class_list":["post-7778","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","category-education"],"aioseo_notices":[],"featured_image_src":"https:\/\/meta-quantum.today\/wp-content\/uploads\/2025\/05\/HOW-TO-Build-Qwen3s-Dual-Mode-AI-0.6B-to-235B.jpg","featured_image_src_square":"https:\/\/meta-quantum.today\/wp-content\/uploads\/2025\/05\/HOW-TO-Build-Qwen3s-Dual-Mode-AI-0.6B-to-235B.jpg","author_info":{"display_name":"coffee","author_link":"https:\/\/meta-quantum.today\/?author=1"},"rbea_author_info":{"display_name":"coffee","author_link":"https:\/\/meta-quantum.today\/?author=1"},"rbea_excerpt_info":"Qwen3 introduces revolutionary dual-mode AI architecture enabling dynamic switching between \"syncing\" (thinking) and \"non-syncing\" modes within a single model. Syncing mode provides explicit step-by-step reasoning for complex problems, while non-syncing mode delivers rapid, immediate responses. This elegant solution uses simple template differences during training, effectively eliminating the need for separate specialized models while maintaining both reasoning depth and response efficiency.","category_list":"<a href=\"https:\/\/meta-quantum.today\/?cat=15\" rel=\"category\">AI<\/a>, <a href=\"https:\/\/meta-quantum.today\/?cat=18\" rel=\"category\">Education<\/a>","comments_num":"0 comments","_links":{"self":[{"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/posts\/7778","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=7778"}],"version-history":[{"count":3,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/posts\/7778\/revisions"}],"predecessor-version":[{"id":7783,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/posts\/7778\/revisions\/7783"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/media\/7779"}],"wp:attachment":[{"href":"https:\/\/meta-quantum.today\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=7778"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=7778"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=7778"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}