Qwen3.6 35B A3B is currently punching far above its weight class. By utilizing a highly efficient Mixture-of-Experts (MoE) architecture, it has completely lowered the hardware barrier to entry for flagship local AI.
Just how good is it?
The folks at AIEstatech recently switched to Qwen3.6 35B and found its reasoning capabilities to be highly comparable to massive proprietary models like Opus 4.1 and GPT-5.1. More importantly, this power is completely accessible to the common man. You don't need a multi-GPU enterprise server. By leveraging smart CPU/GPU memory splitting, a standard consumer 16GB VRAM GPU (like the RTX 5060 Ti) can hit base inference speeds of 60 to 90 tokens per second (tok/s). Even under maximum stress-utilizing full context and multimodal vision processing-AIEstatech confirmed the 5060 Ti maintains an incredibly stable 40+ tok/s. But if you want to achieve that full-context multimodal performance without crashing your server, there is a critical VRAM trap you need to avoid. The Multimodal OOM Trap- To hit these high tok/s speeds on a 16GB card, dynamic VRAM allocators aggressively pack the hottest text layers into your VRAM, offloading the rest to system RAM.
Benchmark / Metric 💻 Qwen3.6 35B-A3B 🧠 Claude Opus 4.1 🤖 GPT-5.1 SWE-Bench Verified (Coding) 73.4% 74.5% 76.3% Terminal-Bench 2.0 (Agentic) 51.5% 43.3% 52.8% GPQA Diamond (Reasoning) 86.0% 80.9% 88.1% Reference Links:
Qwen3.6 Benchmarks: Qwen AI Official Blog Release
Opus 4.1 Benchmarks: Anthropic Claude Opus 4.1 Details
GPT-5.1 Benchmarks: OpenAI GPT-5.1 Evals
- If you load Qwen3.6's vision capabilities via an image projection reference (--mmproj), this aggressive VRAM saturation becomes a liability. The moment you pass an image into the prompt, the vision encoder has zero VRAM left to load its tensors, causing an immediate Out-of-Memory (OOM) crash.
The Fix:
--fit-target 1536 - You must explicitly reserve memory by adding --fit-target 1536 to your configuration. This forces the allocator to keep exactly 1536 MiB (~1.5GB) of VRAM completely free. This acts as a dedicated buffer, allowing the multimodal image projection tensors to load, process visual data cleanly, and unload without starving your KV cache or crashing your 16GB card. The Optimal 16GB Launch Config For maximum throughput and stable 40+ tok/s multimodal inference, your launch command should look like this:
llama-server -m Qwen3.6-35B-A3B.gguf \--mmproj mmproj-model-f16.gguf \--fit on \--fit-target 1536 \-ctk q8_0 -ctv q8_0 \-np 1
- Helpful Parameter Breakdown:
--fit on: Manages the dynamic CPU/GPU layer split to keep you in the max speed sweet spot.
--fit-target 1536: Secures your 1.5GB multimodal safety net so images don't OOM your server.
-ctk q8_0 -ctv q8_0: Quantizes your KV cache to 8-bit. This is mandatory for hitting that full-context benchmark on a 5060 Ti, saving massive VRAM payloads with virtually zero output degradation.
-np 1: Prevents VRAM waste on parallel recurrent states if you're the sole local user. For deeper community hardware benchmarks and discussion on this specific setup, check out the active thread here: r/LocalLLaMA Discussion.
Community hardware benchmarks & discussion: https://www.reddit.com/r/LocalLLaMA/s/l5wdUmrHNL
Uncensored model:
https://huggingface.co/HauhauCS/Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive
#LocalLLaMA #MachineLearning #OpenSource #AIHardware #Qwen #AIEstatech
19 April 2026
by
aiestatech
in ENGINEERING