Anthropic Performance Take‑Home: Open‑Source Benchmark
What Is the Anthropic Performance Take‑Home?
Anthropic recently published a Performance Take‑Home repository on GitHub that invites the community to tackle a real‑world optimisation challenge. The goal is simple: write code that completes a specific task in fewer clock cycles than the benchmark set by Claude Opus 4.5 in a 2‑hour test harness. The repo is deliberately stripped of advanced features so that everyone starts from the same slow baseline.
“The original take‑home was a 4‑hour one that starts close to the contents of this repo, after Claude Opus 4 beat most humans at that, it was updated to a 2‑hour one which started with code which achieved 18532 cycles (7.97x faster than this repo starts you).” – README excerpt
Why Do You Care?
- Hands‑on AI benchmarking – If you’re an AI engineer, this gives you a concrete way to measure a system’s performance against a leading model.
- Job‑ready experience – Anthropic explicitly invites submissions that beat the record: “If you optimize below 1487 cycles, beating Claude Opus 4.5’s best performance at launch, email us at performance‑[email protected].”
- Learning opportunity – The repo demonstrates real‑world Python optimisation, how to structure tests, and the pitfalls of cheating by modifying the test harness.
Repo Anatomy
The repo is intentionally straightforward. Below is a high‑level snapshot of its structure:
├─ .gitignore # Standard ignores
├─ Readme.md # Challenge description & benchmarks
├─ perf_takehome.py # Reference implementation (slow baseline)
├─ problem.py # Problem logic (core algorithm)
├─ watch_trace.py # Simple profiling helper
├─ watch_trace.html # HTML visualisation of trace data
└─ tests/
├─ __init__.py
├─ submission_tests.py # Runner that prints cycle count
└─ ... (fixtures & helper scripts)
Key Files
problem.py– The algorithmic core; you’ll modify this if you want to change how the problem is solved.perf_takehome.py– A convenience wrapper that drives tests and prints cycle count.tests/submission_tests.py– The only thing you should run to validate your solution.
Running the Tests
The repo comes with a minimal test harness. Run the following commands in the repo root:
# Ensure your test directory is untouched
git diff origin/main tests/
# Execute the benchmark and print cycle count
python tests/submission_tests.py
If the output is empty after the git diff command, you haven’t modified any test files – that’s a safety check against “cheating” tricks. The second command outputs something like:
Total cycles: 18532
Your mission is to reduce that number.
How to Beat the Benchmark
- Start With the Baseline – Clone the repo and run the tests. Note your cycle count.
- Profile First – Use
watch_trace.pyor a standard profiler (e.g.,cProfile) to spot hotspots. - Micro‑optimise – Typical gains come from:
- Eliminating unnecessary loops
- Using built‑in functions over pure Python code
- Avoiding global lookups inside tight loops
- Algorithmic Tweaks – In some cases a better data structure or algorithm can shave large chunks of cycles.
- Check Multi‑core – The repo explicitly disables multicore support; any attempt to hack the core count will be flagged as cheating.
- Validate – After each tweak, rerun
tests/submission_tests.pyto verify the new cycle count. - Submit – Once you’re below 1487, send a pull request + a note that you’ve verified the tests.
Common Pitfalls
- Altering test files – Even a tiny change can drastically lower cycle counts, but it’s disallowed.
- Re‑implementation of the test harness – Replicating the harness inside your code to shortcut checks is considered cheating.
- Skipping edge cases – The tests cover a comprehensive scenario set; ignoring them leads to failed submissions.
Benchmark Numbers
For context, here are the recorded cycle counts that Anthropic used as reference:
| Model | Cycles | Notes |
|---|---|---|
| Claude Opus 4 | 2164 | After many hours in the test harness |
| Claude Opus 4.5 | 1790 | Casual Claude Code session |
| Claude Opus 4.5 | 1579 | 2‑hour harness |
| Claude Sonnet 4.5 | 1548 | After many hours |
| Claude Opus 4.5 | 1487 | 11.5‑hour harness |
| Claude Opus 4.5 | 1363 | Improved harness |
Your goal is to come in below the 1487 milestone.
Beyond the Challenge
Even if you can’t beat the record, the process teaches:
- Profiling skills – How to isolate bottlenecks in Python.
- Algorithmic thinking – Balancing time vs. space.
- Reproducibility – The importance of a clean test harness.
You can also experiment with the repo by adding your own tests, or by porting the logic to a different language. Feel free to fork and share your findings with the community.
Final Thoughts
Anthropic’s Performance Take‑Home is more than a coding exercise; it’s a window into the real‑world engineering that runs behind state‑of‑the‑art language models. Whether you’re aiming to land a role at a leading AI lab or simply enjoy squeezing performance out of Python, this repo offers a concrete, measurable challenge.
Now that you’ve got the map, it’s time to roll up your sleeves. Clone the repo, profile, tweak, and see if you can beat the 2‑hour mark. Good luck!