On Open Source Agentic Coding
TL;DR: Open source is making giant leaps, but is still quite behind (no surprise, duh!).
I have wanted to test a few new open-source models for agentic coding tasks on consumer-grade hardware for a while now. Why? Because, at some point in the future, you won't want to be locked in with the major players - they'll "own" all your accumulated work and knowledge. 'Own' in the sense that it won't be easy to escape their ecosystem (think Google) if you want to switch to a better inference provider, change employer or careers, etc.
So, this Sunday afternoon, I ran the recently released Qwen 3.6 35B A3B on a Spark DGX - not exactly consumer grade, but close enough if you're an enthusiast. It works like a charm. Combined with OpenCode, it feels like using Claude Code with Sonnet or the "last summer edition" of Gemini CLI, except the number of tokens is unlimited - you can never have enough tokens!
From real-world testing, Qwen 3.6 is nowhere near Opus or Codex, but it's a great local model. The MoE is fast, and it's better than the 3.5 version. I'm waiting for the 3.6 dense model to see if it performs as well as the 3.5 MoE vs dense model.
I gave Codex (5.4, medium), Claude (4.6, Opus, medium) and Qwen (3.6, OpenCode) the same real-world, user-created ticket with a poor, half-hearted description of the desired feature. I know the software well and what the outcome should have been, so I let all three do their thing in plan mode. Encouragingly, they all asked similar questions, but their implementations could not have been more different. Qwen's implementation, funny enough, looked a lot like mine on a bad day 10 years ago. It 'hacked' GUI class instance functionality for business logic, where the GUI and logic were not cleanly separated, whereas the 'big brothers' did the refactoring all at once without asking. What none of them got right: they all directly called a method on an object in a separate thread from the main thread. While this is technically 'okay-ish' in Python/Qt most of the time, it should be avoided. So, which implementation am I keeping? I kind of identify with Qwen's a lot ;)
I also used the model for questions and answers in OpenWebUI, and I'd say the quality and speed (on the DGX) are really OK for batch inference, local RAG or evaluation pipelines. However, the remaining gap is usually where agentic tasks break. Frontier still justifies its price in terms of reliability, but open source is catching up fast. For example, Simon Willison's Flamingo SVG test also shows this: Qwen 3.6 on his laptop outperformed Opus 4.7 visually. Combined with long context regressions of 4.7 (256K, 90–60), open-source models are handling workloads that frontier models are not particularly good at. However, you never know when even the Flamingo SVG test will be benchmaxxed to the limit...