HumanEval on a MacBook — 81.7% pass@1, Wi-Fi off

The M5 Max MacBook Pro with 128 GB of unified memory is the first laptop that can hold a frontier-class coding agent entirely in RAM. No GPU rack. No cloud. No subscription.

I just ran HumanEval on it. Wi-Fi off the entire run.

81.7% pass@1 on the full 164-problem benchmark
Qwen 3 Coder 30B-A3B-Instruct (8-bit MLX)
14 minutes wall-clock, $0/month after the model download

YouTube walkthrough (three real problems, code streaming live, tests going green): https://www.youtube.com/watch?v=muq7VdgxqRk

Why this number matters

The Qwen team didn't publish HumanEval scores for any Qwen3-Coder variant — they consider the benchmark saturated and went straight to agentic ones (SWE-bench Verified, BFCL, Aider-Polyglot). For the 30B variant — the one that actually fits on a laptop — there were no published HumanEval/MBPP numbers. Until this run.

I also ran MBPP (sanitized): 83.3% pass@1 on a 168-problem sample. Pass rate stable since n=120; full 427-run was impractical because a few outlier tasks induce very long model responses (10+ minutes each).

Methodology

Setting	Value
Benchmark	HumanEval — 164 Python tasks (full)
Metric	pass@1 (first attempt only)
Temperature	0.0 — deterministic
Sampling	single sample per problem, no best-of-N
Execution	Python subprocess, 10s timeout
Hardware	M5 Max MacBook Pro · 128 GB unified memory
Model	Qwen3-Coder-30B-A3B-Instruct-MLX-8bit
Network	Wi-Fi OFF the entire run
Wall clock	14 minutes

For context — Qwen3-Coder 480B's official agentic benchmarks

The Qwen team's published numbers for the 480B flagship sibling (the bigger sibling of the 30B running on this MacBook):

Benchmark	Qwen3-Coder 480B	Claude Sonnet 4	GPT-4.1
SWE-bench Verified (500-turn)	69.6	70.4	—
Terminal-Bench	37.5	35.5	25.3
BFCL-v3	68.7	73.3	62.9
Aider-Polyglot	61.8	56.4	52.4

Source: Qwen team's official blog.

Why the offline part matters

If a tool needs the internet, three things are true:

Someone else can read what you sent.
Someone else can charge you for it.
Someone else can take it away.

If the same tool runs locally, none of those are true. That's a different category of software — and for law firms, medical practices, and accountants handling client material, it's the only legal one.

Reproduce it yourself

Open-source launchers: github.com/nicedreamzapp/claude-code-local
HumanEval dataset: github.com/openai/human-eval
Hardware: any M-series MacBook with ≥32 GB RAM (128 GB Max preferred for full 8-bit weights)
Total monthly cost: $0 after the model download

For law firms, medical practices, and accountants who want help getting this stack running on their own hardware — that's what AirGap is. 14-day pilot, fixed scope, the data never leaves your machines.

— matt

Originally published at Marijuana Union. For premium vaporizers visit iNeedHemp, wholesale at Nice Dreamz, and seeds at Tribe Seed Bank. Explore the 3D cannabis marketplace at The Farmstand.

HumanEval on a MacBook — 81.7% pass@1, Wi-Fi off

Why this number matters

Methodology

For context — Qwen3-Coder 480B's official agentic benchmarks

Why the offline part matters

Reproduce it yourself

Comments

More from this blog

The Day Local AI Caught the Cloud: ds4, DeepSeek V4 Flash, and What Just Changed for Devs

I Just Watched One Hacker Catch Up to a Trillion-Dollar Data Center

Pulling 10x My Subscription Value Out of Claude — While Quietly Building the Backup Plan

Free AI on a MacBook vs $100-a-Month Claude Code — Hexagon Shootout

Command Palette

Why this number matters

Methodology

For context — Qwen3-Coder 480B's official agentic benchmarks

Why the offline part matters

Reproduce it yourself

Comments

More from this blog