Benchmarking Qwen3.6 Variants for OpenCode on NVIDIA RTX PRO 6000 Blackwell

Yesterday, I conducted a series of local benchmarks using BenchLocal.app and a custom voicebot script. The goal was to evaluate different Qwen3.6 variants to determine the best candidate for the OpenCode project.

The tests were performed on an NVIDIA RTX PRO 6000 Blackwell Max-Q GPU. Below are the results for the four models evaluated.

Benchmark Results

1. Qwen/Qwen3.6-35B-A3B-FP8

BenchLocal.app Results	Performance & Voicebot Metrics
Score: 86.6	PP: 958 t/s
ToolCall: 100	TTFT (warm): 64ms
InstrucFollow: 90	TG: 194.1 t/s
DataExtract: 86	Voicebot: 80% (279ms)
BugFind: 84
HermesAgent: 73	Metrics: 5/6, 7/7, 5/9, 3/4, 4/4, 4/4, 1/3, 4/4

2. nvidia/Qwen3.6-35B-A3B-NVFP4

BenchLocal.app Results	Performance & Voicebot Metrics
Score: 84.4	PP: 1354 t/s
ToolCall: 97	TTFT (warm): 67ms
InstrucFollow: 98	TG: 230.6 t/s
DataExtract: 85	Voicebot: 82% (211ms)
BugFind: 84
HermesAgent: 58	Metrics: 5/6, 7/7, 6/9, 2/4, 4/4, 4/4, 2/3, 4/4

3. Qwen/Qwen3.6-27B-FP8

BenchLocal.app Results	Performance & Voicebot Metrics
Score: 86.2	PP: 3847 t/s
ToolCall: 97	TTFT (warm): 222ms
InstrucFollow: 97	TG: 82.0 t/s
DataExtract: 85	Voicebot: 85% (729ms)
BugFind: 90
HermesAgent: 62	Metrics: 6/6, 7/7, 8/9, 3/4, 4/4, 2/4, 2/3, 3/4

4. sakamakismile/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP

BenchLocal.app Results	Performance & Voicebot Metrics
Score: 83.4	PP: 5465 t/s
ToolCall: 90	TTFT (warm): 153ms
InstrucFollow: 97	TG: 105 t/s
DataExtract: 82	Voicebot: 78% (679ms)
BugFind: 95
HermesAgent: 53	Metrics: 5/6, 7/7, 7/9, 3/4, 3/4, 2/4, 1/3, 4/4

vLLM Deployment Recipe

To launch the Qwen/Qwen3.6-35B-A3B-FP8 variant on Blackwell hardware, use the following Docker Compose configuration. Note the flashinfer attention backend required for FP8 KV cache support.

Other models require different recipies.

services:
  vllm:
    image: "vllm/vllm-openai:nightly"
    runtime: nvidia
    restart: unless-stopped
    ipc: host
    shm_size: "32G"
    environment:
      - VLLM_LOGGING_LEVEL=info
      - VLLM_HOST_IP=0.0.0.0
      - HF_HOME=/root/.cache/huggingface
      - HF_TOKEN=hf_Il....KbaA
      - FLASHINFER_DISABLE_VERSION_CHECK=1
      - FLASHINFER_CUDA_ARCH_LIST=12.0f
      - VLLM_MOE_FORCE_MARLIN=1
      - CUTE_DSL_ARCH=sm_120a
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - TRANSFORMERS_VERBOSITY=warning
    volumes:
      - ./llama-server/huggingface:/root/.cache/huggingface
      - ./llama-server/vllm_cache:/root/.cache/vllm
    ports:
      - "8000:8000"
    command: >
      Qwen/Qwen3.6-35B-A3B-FP8
      --served-model-name Qwen3.6-35B-A3B-FP8
      --trust-remote-code
      --host 0.0.0.0
      --port 8000
      --tensor-parallel-size 1
      --dtype auto
      --kv-cache-dtype fp8_e4m3
      --attention-backend flashinfer
      --moe-backend marlin
      --gpu-memory-utilization 0.96
      --max-model-len 262144
      --max-num-seqs 32
      --max-num-batched-tokens 65536
      --enable-chunked-prefill
      --enable-prefix-caching
      --async-scheduling
      --reasoning-parser qwen3
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --structured-outputs-config.backend xgrammar
      --default-chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}'
      --speculative-config '{"method":"mtp","num_speculative_tokens":2,"moe_backend":"triton"}'
      --override-generation-config '{"temperature":0.6,"top_p":0.95,"presence_penalty":0.05,"repetition_penalty":1.05}'
      --language-model-only
      -O3
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s
    logging:
      driver: "json-file"
      options:
        max-size: "20m"
        max-file: "5"