HumanEval on a MacBook — 81.7% pass@1, Wi-Fi off
Qwen 3 Coder 30B (8-bit MLX) scored 81.7% pass@1 on HumanEval running on a single M5 Max MacBook with Wi-Fi off. Real run, all 164 problems, 14 minutes wall-clock. The first measured number for this variant on this hardware.
The M5 Max MacBook Pro with 128 GB of unified memory is the first laptop that can hold a frontier-class coding agent entirely in RAM. No GPU rack. No cloud. No subscription.
I just ran HumanEval on it. Wi-Fi off the entire run.
- 81.7% pass@1 on the full 164-problem benchmark
- Qwen 3 Coder 30B-A3B-Instruct (8-bit MLX)
- 14 minutes wall-clock, $0/month after the model download
YouTube walkthrough (three real problems, code streaming live, tests going green): https://www.youtube.com/watch?v=muq7VdgxqRk
Why this number matters
The Qwen team didn't publish HumanEval scores for any Qwen3-Coder variant — they consider the benchmark saturated and went straight to agentic ones (SWE-bench Verified, BFCL, Aider-Polyglot). For the 30B variant — the one that actually fits on a laptop — there were no published HumanEval/MBPP numbers. Until this run.
I also ran MBPP (sanitized): 83.3% pass@1 on a 168-problem sample. Pass rate stable since n=120; full 427-run was impractical because a few outlier tasks induce very long model responses (10+ minutes each).
Methodology
| Setting | Value |
| Benchmark | HumanEval — 164 Python tasks (full) |
| Metric | pass@1 (first attempt only) |
| Temperature | 0.0 — deterministic |
| Sampling | single sample per problem, no best-of-N |
| Execution | Python subprocess, 10s timeout |
| Hardware | M5 Max MacBook Pro · 128 GB unified memory |
| Model | Qwen3-Coder-30B-A3B-Instruct-MLX-8bit |
| Network | Wi-Fi OFF the entire run |
| Wall clock | 14 minutes |
For context — Qwen3-Coder 480B's official agentic benchmarks
The Qwen team's published numbers for the 480B flagship sibling (the bigger sibling of the 30B running on this MacBook):
| Benchmark | Qwen3-Coder 480B | Claude Sonnet 4 | GPT-4.1 |
| SWE-bench Verified (500-turn) | 69.6 | 70.4 | — |
| Terminal-Bench | 37.5 | 35.5 | 25.3 |
| BFCL-v3 | 68.7 | 73.3 | 62.9 |
| Aider-Polyglot | 61.8 | 56.4 | 52.4 |
Source: Qwen team's official blog.
Why the offline part matters
If a tool needs the internet, three things are true:
- Someone else can read what you sent.
- Someone else can charge you for it.
- Someone else can take it away.
If the same tool runs locally, none of those are true. That's a different category of software — and for law firms, medical practices, and accountants handling client material, it's the only legal one.
Reproduce it yourself
- Open-source launchers: github.com/nicedreamzapp/claude-code-local
- HumanEval dataset: github.com/openai/human-eval
- Hardware: any M-series MacBook with ≥32 GB RAM (128 GB Max preferred for full 8-bit weights)
- Total monthly cost: $0 after the model download
For law firms, medical practices, and accountants who want help getting this stack running on their own hardware — that's what AirGap is. 14-day pilot, fixed scope, the data never leaves your machines.
— matt
Originally published at Marijuana Union. For premium vaporizers visit iNeedHemp, wholesale at Nice Dreamz, and seeds at Tribe Seed Bank. Explore the 3D cannabis marketplace at The Farmstand.

