Skip to main content

Command Palette

Search for a command to run...

HumanEval on a MacBook — 81.7% pass@1, Wi-Fi off

Qwen 3 Coder 30B (8-bit MLX) scored 81.7% pass@1 on HumanEval running on a single M5 Max MacBook with Wi-Fi off. Real run, all 164 problems, 14 minutes wall-clock. The first measured number for this variant on this hardware.

Published
3 min read
M
I build cannabis commerce tech at Divine Tribe — 4 WordPress/WooCommerce stores (ineedhemp.com, nicedreamzwholesale.com, tribeseedbank.com, marijuanaunion.com), an A-Frame WebXR marketplace called The Farmstand, Python/Flask HQ dashboards, and trading/automation agents. 13 years of vaporizer hardware design. Writing here about indie commerce, WebXR, and right-to-repair hardware.

The M5 Max MacBook Pro with 128 GB of unified memory is the first laptop that can hold a frontier-class coding agent entirely in RAM. No GPU rack. No cloud. No subscription.

I just ran HumanEval on it. Wi-Fi off the entire run.

  • 81.7% pass@1 on the full 164-problem benchmark
  • Qwen 3 Coder 30B-A3B-Instruct (8-bit MLX)
  • 14 minutes wall-clock, $0/month after the model download

YouTube walkthrough (three real problems, code streaming live, tests going green): https://www.youtube.com/watch?v=muq7VdgxqRk

Why this number matters

The Qwen team didn't publish HumanEval scores for any Qwen3-Coder variant — they consider the benchmark saturated and went straight to agentic ones (SWE-bench Verified, BFCL, Aider-Polyglot). For the 30B variant — the one that actually fits on a laptop — there were no published HumanEval/MBPP numbers. Until this run.

I also ran MBPP (sanitized): 83.3% pass@1 on a 168-problem sample. Pass rate stable since n=120; full 427-run was impractical because a few outlier tasks induce very long model responses (10+ minutes each).

Methodology

SettingValue
BenchmarkHumanEval — 164 Python tasks (full)
Metricpass@1 (first attempt only)
Temperature0.0 — deterministic
Samplingsingle sample per problem, no best-of-N
ExecutionPython subprocess, 10s timeout
HardwareM5 Max MacBook Pro · 128 GB unified memory
ModelQwen3-Coder-30B-A3B-Instruct-MLX-8bit
NetworkWi-Fi OFF the entire run
Wall clock14 minutes

For context — Qwen3-Coder 480B's official agentic benchmarks

The Qwen team's published numbers for the 480B flagship sibling (the bigger sibling of the 30B running on this MacBook):

BenchmarkQwen3-Coder 480BClaude Sonnet 4GPT-4.1
SWE-bench Verified (500-turn)69.670.4
Terminal-Bench37.535.525.3
BFCL-v368.773.362.9
Aider-Polyglot61.856.452.4

Source: Qwen team's official blog.

Why the offline part matters

If a tool needs the internet, three things are true:

  1. Someone else can read what you sent.
  2. Someone else can charge you for it.
  3. Someone else can take it away.

If the same tool runs locally, none of those are true. That's a different category of software — and for law firms, medical practices, and accountants handling client material, it's the only legal one.

Reproduce it yourself

For law firms, medical practices, and accountants who want help getting this stack running on their own hardware — that's what AirGap is. 14-day pilot, fixed scope, the data never leaves your machines.

— matt


Originally published at Marijuana Union. For premium vaporizers visit iNeedHemp, wholesale at Nice Dreamz, and seeds at Tribe Seed Bank. Explore the 3D cannabis marketplace at The Farmstand.