{"id":8403,"date":"2026-05-19T08:18:00","date_gmt":"2026-05-19T00:18:00","guid":{"rendered":"https:\/\/meta-quantum.today\/?p=8403"},"modified":"2026-05-18T20:26:01","modified_gmt":"2026-05-18T12:26:01","slug":"4-llms-tested-in-codex-claude-code-hermes-openclaw","status":"publish","type":"post","link":"https:\/\/meta-quantum.today\/?p=8403","title":{"rendered":"4 LLMs Tested in Codex, Claude Code, Hermes &amp; OpenClaw"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">A large-scale benchmark study published on <strong>May 13, 2026<\/strong> by a consortium including Yale, Columbia, NVIDIA, NYU, Stevens Institute, Universit\u00e9 de Montr\u00e9al, Mila \u2013 Quebec AI Institute, Vrije Universiteit Amsterdam, NUS Singapore, and several other institutions. Powered by an <strong>NVIDIA academic grant of 32,000 A100 GPU hours<\/strong>, the study asks a deceptively simple question: when you pair a frontier LLM with an agent framework, <em>which combination actually performs best on real financial intelligence tasks?<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Four models (Claude Sonnet 4.6, GPT-5.4, Qwen 3.5 ~400B open-source, and Qwen 3.5 27B) were combined with five agent frameworks (<strong>ReAct, Claude Code, Codex, Hermes, OpenClaw<\/strong>) \u2014 yielding 20 configurations per workflow. The video&#8217;s narrator frames this as a control experiment to isolate the <em>agentic framework variable<\/em> from the LLM backbone variable, something the field has long needed.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Features and Concept<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The study evaluates each configuration across four financial workflows that map onto increasingly difficult agentic skills:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Trading<\/strong> \u2014 daily market timing for a single asset (buy\/sell\/hold), 3-month horizon with daily increments.<\/li>\n\n\n\n<li><strong>Hedging<\/strong> \u2014 market-neutral pairs trading (long\/short, short\/long, hold, close), 3-month horizon. This tests cross-asset relational reasoning \u2014 analogous to exploiting local divergences in coupled oscillators.<\/li>\n\n\n\n<li><strong>Market Insights<\/strong> \u2014 weekly investment reports with structured 8-section ratings to industry standard.<\/li>\n\n\n\n<li><strong>Auditing<\/strong> \u2014 exact arithmetic and graph traversal logic across knowledge graphs and SEC filings.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Evaluation metrics include cumulative return, Sharpe ratio, maximum drawdown, and structural error rates. The conceptual framing borrows from theoretical physics: professional finance is treated as a <strong>non-equilibrium dynamical system driven by stochastic processes<\/strong>, requiring agents to integrate trajectories over long horizons, make sequential irreversible commitments under uncertainty, and preserve &#8220;topological invariants&#8221; (i.e., accounting identities).<\/p>\n\n\n\n<figure class=\"wp-block-video\"><video height=\"720\" style=\"aspect-ratio: 1280 \/ 720;\" width=\"1280\" autoplay controls loop muted src=\"https:\/\/meta-quantum.today\/wp-content\/uploads\/2026\/05\/The_Financial_AI_Stress_Test.mp4\" playsinline><\/video><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Key Findings by Framework \u00d7 Model<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Auditing: Where the Framework Matters Most<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The most striking result. With the <em>same<\/em> Claude Sonnet 4.6 backbone:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Claude Code + Sonnet 4.6<\/strong> \u2192 <strong>66.15% accuracy<\/strong><\/li>\n\n\n\n<li><strong>OpenClaw + Sonnet 4.6<\/strong> \u2192 <strong>66.15% accuracy<\/strong><\/li>\n\n\n\n<li><strong>ReAct + Sonnet 4.6<\/strong> \u2192 only <strong>20% accuracy<\/strong> (80% structural error rate from hallucinated tool calls and schema violations)<\/li>\n\n\n\n<li><strong>Hermes + Sonnet 4.6<\/strong> \u2192 also ~20%<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">This is a clean demonstration that <em>the integrator matters as much as the Hamiltonian<\/em> \u2014 the agentic framework can swing accuracy by <strong>3\u00d7<\/strong> with an identical LLM underneath.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Trading: Surprising Open-Source Wins<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Claude Code + Qwen 3.5 400B<\/strong> generated remarkable returns (profitable shorting \/ volatility-harvesting policy on MSFT).<\/li>\n\n\n\n<li><strong>OpenClaw + Sonnet 4.6<\/strong> also delivered strong alpha.<\/li>\n\n\n\n<li><strong>Codex + Qwen 3.5 27B<\/strong> failed to even complete runs \u2014 the planner and tool-orchestration policy of Codex acts as an unstable integrator for that model.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The narrator interprets this as <strong>Codex vs. Claude Code being primarily a difference in planner and tool-orchestration policy<\/strong>, not surface capability \u2014 making this a control test of <em>planning style<\/em>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Hedging: Open-Source Standouts in Hermes<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The hedging chart yielded the most unexpected result: <strong>Hermes + Qwen 3.5 400B outperformed every proprietary configuration<\/strong>, with Claude Sonnet 4.6 placed second. GPT-5.4 underperformed in this framework. This raises the question of whether open-source backbones may have an unappreciated edge in certain agent architectures for relational reasoning tasks.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Live Out-of-Distribution Test (April\u2013May 2026): The Collapse<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The truly important section. After in-sample evaluation, the team ran a <strong>live forward test<\/strong> on unseen April\u2013May 2026 data. Microsoft flipped from a bearish to a strongly bullish regime, and the previously winning <strong>Claude Code + Qwen 400B<\/strong> configuration captured only <strong>~4% of the buy-and-hold baseline<\/strong> \u2014 a catastrophic failure to adapt.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The interpretation: current agents exhibit <strong>temporal overfitting<\/strong>. They learn local linear approximations of recent training data rather than the underlying invariant laws of market dynamics. The moment the stochastic drift enters a new regime, policy functions break down.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Related Sections and Broader Themes<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Generation \u2260 Verification<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Market Insights (a generative task on latent statistical mapping) approached a performance ceiling. But the same agents failed on hedging (persistent state memory across assets) and auditing (exact arithmetic and graph traversal). Strong performance in one financial task does <strong>not<\/strong> predict strong performance in another.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Brittle Generalization Over Long Horizons<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">&#8220;Long horizon&#8221; here means 6+ reasoning steps. Even Sonnet 4.6 and GPT-5.4 struggle to generate stable alpha across different assets \u2014 an agent might beat the buy-and-hold baseline on Microsoft but catastrophically fail on Tesla. The learned representations are <strong>not invariant across volatility regimes<\/strong>, meaning what looks like reasoning is closer to pattern matching. This echoes a broader concern in the literature that scaling backbone parameters alone is insufficient for genuine reasoning generalization \u2014 a theme also raised in work showing that emergent abilities in LLMs may depend heavily on chosen evaluation metrics rather than reflecting deep capability shifts (<a href=\"https:\/\/glasp.co\/hatch\/JCDlaCsoYeMogZD8SijXUAV0bMF3\/p\/aK89t521jpC24j05jNTo\" target=\"_blank\" rel=\"noopener\" title=\"\">Glasp: Evolution of Emergent Abilities in LLMs<\/a>).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Why Frameworks Differ<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Claude Code operates as an autonomous loop with native file editing and batch execution, which successfully bounds Qwen 400B&#8217;s multi-trajectory exploration and preserves alpha across long horizons. ReAct, by contrast, suffers rapid trajectory collapse on the same backbone. This connects to a long-standing observation that LLMs alone struggle with long-horizon planning, and that pairing them with proper planning\/orchestration scaffolds (e.g., LLM+P-style architectures combining LLMs with classical planners) is what unlocks reliable execution (<a href=\"https:\/\/glasp.co\/hatch\/rH1SCUH4mKZpNDMkE0mpfgMNehJ2\/p\/x4piH9yMFK332aWNWTNH\" target=\"_blank\" rel=\"noopener\" title=\"\">Glasp: LLM+P \u2013 Empowering LLMs with Optimal Planning Proficiency<\/a>).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Reasoning Depth Differentiator<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">On reasoning quality specifically, the authors note <strong>Claude Sonnet 4.6 demonstrates superior analytical depth<\/strong>, explicitly leveraging domain knowledge and applying financial theorems, whereas <strong>GPT-5.4&#8217;s analysis remains relatively shallow<\/strong> \u2014 a distinct capability gap in complex analytical synthesis.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The Path Forward: External Harness Over Tensor Retraining<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A key insight teased at the end: the control loop can be <strong>externalized to an AI harness<\/strong> rather than baked into the LLM tensor weights. This is consistent with the broader move toward multi-agent and modular architectures where specialized components collaborate rather than relying on one monolithic backbone (<a href=\"https:\/\/glasp.co\/hatch\/7IB7Hs9ZYES3Hyuy5hBMpIM8qFd2\/p\/ob6D8ANtvb5lRyyra9MR\" target=\"_blank\" rel=\"noopener\" title=\"\">Glasp: Harnessing the Power of Multi-Stage Language Model Programs<\/a>). The narrator points toward an upcoming Princeton\/Google solution involving a continual harness with dynamic memory and skill integration.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">YouTube : <strong>Unlocking Agentic Alpha: 4 LLMS in Codex, Claude Code, Hermes, OpenClaw on Financial Markets<\/strong><\/h2>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"4 LLMs Tested in Codex, Claude Code, Hermes &amp; OpenClaw (FinAI)\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/tlB06vIARwQ?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion and Key Takeaways<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The video delivers a sober but clarifying picture. AI has <strong>solved the kinematics<\/strong> of financial reasoning \u2014 parsing filings, summarizing, fluent analytical output. It has <strong>not solved the dynamics<\/strong> \u2014 stable multi-step execution policies under temporal flux.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Key takeaways:<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Framework choice can swing accuracy by 3\u00d7<\/strong> on the same LLM backbone. ReAct is now structurally inferior to CLI-oriented frameworks (Claude Code, OpenClaw) for tasks requiring tool fidelity.<\/li>\n\n\n\n<li><strong>Claude Code and OpenClaw consistently top the charts<\/strong> for Sonnet 4.6 across auditing and trading.<\/li>\n\n\n\n<li><strong>Hermes paired with Qwen 3.5 400B<\/strong> is a sleeper combination worth watching \u2014 an open-source stack beating proprietary alternatives in hedging.<\/li>\n\n\n\n<li><strong>Codex + Qwen models is an unstable pairing<\/strong> \u2014 the planner mismatch produces divergent trajectories.<\/li>\n\n\n\n<li><strong>Temporal overfitting is real and severe<\/strong> \u2014 in live April\u2013May 2026 data, even the winning configurations captured only ~4% of buy-and-hold returns when regime shifted.<\/li>\n\n\n\n<li><strong>Frontier scaling is no longer enough<\/strong> \u2014 the next frontier is the geometry of the adjugate control loop, likely solved via external harnesses rather than bigger weights.<\/li>\n\n\n\n<li><strong>Reasoning depth still favors Claude Sonnet 4.6<\/strong> over GPT-5.4 for complex financial analytical synthesis.<\/li>\n\n\n\n<li><strong>Apply with caution<\/strong>: no current LLM+agent configuration is safe for real-money financial prediction.<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">Related References<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/thevalue.engineering\/news\/herculean-benchmark-financial-ai-agents.html\" target=\"_blank\" rel=\"noopener\" title=\"\">Study (May 13, 2026): Yale, Columbia, NVIDIA, NYU, Stevens, Universit\u00e9 de Montr\u00e9al, Mila Quebec, VU Amsterdam, NUS Singapore, et al. \u2014 agentic benchmark for financial intelligence<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.polyu.edu.hk\/ubda\/services\/ai-academic-grant-program\/\" target=\"_blank\" rel=\"noopener\" title=\"\">NVIDIA Academic Grant Program (32,000 A100 GPU hours)<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.researchgate.net\/publication\/396373592_Profit_Mirage_Revisiting_Information_Leakage_in_LLM-based_Financial_Agents\" target=\"_blank\" rel=\"noopener\" title=\"\">Related work on information leakage in LLM-based financial agents: <em>Profit Mirage: Revisiting Information Leakage in LLM-based Financial Agents<\/em> (arXiv 2510.07920)<\/a><\/li>\n\n\n\n<li><a href=\"TimeSeek: Temporal Reliability of Agentic Forecasters (arXiv 2604.04220) \u2014 complementary work on time-aware evaluation\" target=\"_blank\" rel=\"noopener\" title=\"\"><em>TimeSeek: Temporal Reliability of Agentic Forecasters<\/em> (arXiv 2604.04220) \u2014 complementary work on time-aware evaluation<\/a><\/li>\n\n\n\n<li><a href=\"TrustTrade: Human-Inspired Selective Consensus Reduces Decision Uncertainty in LLM Trading Agents \u2014 on stabilizing LLM trading behavior\" target=\"_blank\" rel=\"noopener\" title=\"\"><em>TrustTrade: Human-Inspired Selective Consensus Reduces Decision Uncertainty in LLM Trading Agents<\/em> \u2014 on stabilizing LLM trading behavior<\/a><\/li>\n\n\n\n<li>Glasp insight: <a href=\"https:\/\/glasp.co\/hatch\/rH1SCUH4mKZpNDMkE0mpfgMNehJ2\/p\/x4piH9yMFK332aWNWTNH\">LLM+P \u2013 Empowering LLMs with Optimal Planning Proficiency<\/a><\/li>\n\n\n\n<li>Glasp insight: <a href=\"https:\/\/glasp.co\/hatch\/7IB7Hs9ZYES3Hyuy5hBMpIM8qFd2\/p\/ob6D8ANtvb5lRyyra9MR\">Harnessing the Power of Multi-Stage Language Model Programs<\/a><\/li>\n\n\n\n<li>Glasp insight: <a href=\"https:\/\/glasp.co\/hatch\/JCDlaCsoYeMogZD8SijXUAV0bMF3\/p\/aK89t521jpC24j05jNTo\">The Evolution of Emergent Abilities in Large Language Models<\/a><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A landmark NVIDIA-funded study (32,000 GPU hours) benchmarks 4 LLMs across 5 agent frameworks on real financial tasks. Claude Code and OpenClaw dominate auditing at 66% accuracy, while ReAct collapses to 20% with the same Sonnet 4.6 backbone. Hermes + Qwen 400B surprises in hedging. But all agents catastrophically fail under temporal regime shifts \u2014 exposing surface-level pattern matching, not true reasoning.<\/p>\n","protected":false},"author":1,"featured_media":8405,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[15,18,28,13,7],"tags":[],"class_list":["post-8403","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","category-education","category-nvidia","category-quantum-and-u","category-quantum-mindset-programme"],"aioseo_notices":[],"featured_image_src":"https:\/\/meta-quantum.today\/wp-content\/uploads\/2026\/05\/LLM-Frameworks-in-FIN-AI-scaled.png","featured_image_src_square":"https:\/\/meta-quantum.today\/wp-content\/uploads\/2026\/05\/LLM-Frameworks-in-FIN-AI-scaled.png","author_info":{"display_name":"coffee","author_link":"https:\/\/meta-quantum.today\/?author=1"},"rbea_author_info":{"display_name":"coffee","author_link":"https:\/\/meta-quantum.today\/?author=1"},"rbea_excerpt_info":"A landmark NVIDIA-funded study (32,000 GPU hours) benchmarks 4 LLMs across 5 agent frameworks on real financial tasks. Claude Code and OpenClaw dominate auditing at 66% accuracy, while ReAct collapses to 20% with the same Sonnet 4.6 backbone. Hermes + Qwen 400B surprises in hedging. But all agents catastrophically fail under temporal regime shifts \u2014 exposing surface-level pattern matching, not true reasoning.","category_list":"<a href=\"https:\/\/meta-quantum.today\/?cat=15\" rel=\"category\">AI<\/a>, <a href=\"https:\/\/meta-quantum.today\/?cat=18\" rel=\"category\">Education<\/a>, <a href=\"https:\/\/meta-quantum.today\/?cat=28\" rel=\"category\">NVIDIA<\/a>, <a href=\"https:\/\/meta-quantum.today\/?cat=13\" rel=\"category\">Quantum and U<\/a>, <a href=\"https:\/\/meta-quantum.today\/?cat=7\" rel=\"category\">Quantum Mindset Programme<\/a>","comments_num":"0 comments","_links":{"self":[{"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/posts\/8403","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=8403"}],"version-history":[{"count":2,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/posts\/8403\/revisions"}],"predecessor-version":[{"id":8407,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/posts\/8403\/revisions\/8407"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=\/wp\/v2\/media\/8405"}],"wp:attachment":[{"href":"https:\/\/meta-quantum.today\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=8403"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=8403"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/meta-quantum.today\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=8403"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}