4 LLMs Tested in Codex, Claude Code, Hermes & OpenClaw
A landmark NVIDIA-funded study (32,000 GPU hours) benchmarks 4 LLMs across 5 agent frameworks on real financial tasks. Claude Code and OpenClaw dominate auditing at 66% accuracy, while ReAct collapses to 20% with the same Sonnet 4.6 backbone. Hermes + Qwen 400B surprises in hedging. But all agents catastrophically fail under temporal regime shifts — exposing surface-level pattern matching, not true reasoning.











