Anthropic Performance Take‑Home: Open‑Source Benchmark

What Is the Anthropic Performance Take‑Home?

Anthropic recently published a Performance Take‑Home repository on GitHub that invites the community to tackle a real‑world optimisation challenge. The goal is simple: write code that completes a specific task in fewer clock cycles than the benchmark set by Claude Opus 4.5 in a 2‑hour test harness. The repo is deliberately stripped of advanced features so that everyone starts from the same slow baseline.

“The original take‑home was a 4‑hour one that starts close to the contents of this repo, after Claude Opus 4 beat most humans at that, it was updated to a 2‑hour one which started with code which achieved 18532 cycles (7.97x faster than this repo starts you).” – README excerpt

Why Do You Care?

  • Hands‑on AI benchmarking – If you’re an AI engineer, this gives you a concrete way to measure a system’s performance against a leading model.
  • Job‑ready experience – Anthropic explicitly invites submissions that beat the record: “If you optimize below 1487 cycles, beating Claude Opus 4.5’s best performance at launch, email us at performance‑[email protected].”
  • Learning opportunity – The repo demonstrates real‑world Python optimisation, how to structure tests, and the pitfalls of cheating by modifying the test harness.

Repo Anatomy

The repo is intentionally straightforward. Below is a high‑level snapshot of its structure:

├─ .gitignore          # Standard ignores
├─ Readme.md            # Challenge description & benchmarks
├─ perf_takehome.py     # Reference implementation (slow baseline)
├─ problem.py          # Problem logic (core algorithm)
├─ watch_trace.py      # Simple profiling helper
├─ watch_trace.html     # HTML visualisation of trace data
└─ tests/
   ├─ __init__.py
   ├─ submission_tests.py  # Runner that prints cycle count
   └─ ... (fixtures & helper scripts)

Key Files

  • problem.py – The algorithmic core; you’ll modify this if you want to change how the problem is solved.
  • perf_takehome.py – A convenience wrapper that drives tests and prints cycle count.
  • tests/submission_tests.py – The only thing you should run to validate your solution.

Running the Tests

The repo comes with a minimal test harness. Run the following commands in the repo root:

# Ensure your test directory is untouched
git diff origin/main tests/
# Execute the benchmark and print cycle count
python tests/submission_tests.py

If the output is empty after the git diff command, you haven’t modified any test files – that’s a safety check against “cheating” tricks. The second command outputs something like:

Total cycles: 18532

Your mission is to reduce that number.

How to Beat the Benchmark

  1. Start With the Baseline – Clone the repo and run the tests. Note your cycle count.
  2. Profile First – Use watch_trace.py or a standard profiler (e.g., cProfile) to spot hotspots.
  3. Micro‑optimise – Typical gains come from:
  4. Eliminating unnecessary loops
  5. Using built‑in functions over pure Python code
  6. Avoiding global lookups inside tight loops
  7. Algorithmic Tweaks – In some cases a better data structure or algorithm can shave large chunks of cycles.
  8. Check Multi‑core – The repo explicitly disables multicore support; any attempt to hack the core count will be flagged as cheating.
  9. Validate – After each tweak, rerun tests/submission_tests.py to verify the new cycle count.
  10. Submit – Once you’re below 1487, send a pull request + a note that you’ve verified the tests.

Common Pitfalls

  • Altering test files – Even a tiny change can drastically lower cycle counts, but it’s disallowed.
  • Re‑implementation of the test harness – Replicating the harness inside your code to shortcut checks is considered cheating.
  • Skipping edge cases – The tests cover a comprehensive scenario set; ignoring them leads to failed submissions.

Benchmark Numbers

For context, here are the recorded cycle counts that Anthropic used as reference:

Model Cycles Notes
Claude Opus 4 2164 After many hours in the test harness
Claude Opus 4.5 1790 Casual Claude Code session
Claude Opus 4.5 1579 2‑hour harness
Claude Sonnet 4.5 1548 After many hours
Claude Opus 4.5 1487 11.5‑hour harness
Claude Opus 4.5 1363 Improved harness

Your goal is to come in below the 1487 milestone.

Beyond the Challenge

Even if you can’t beat the record, the process teaches:

  • Profiling skills – How to isolate bottlenecks in Python.
  • Algorithmic thinking – Balancing time vs. space.
  • Reproducibility – The importance of a clean test harness.

You can also experiment with the repo by adding your own tests, or by porting the logic to a different language. Feel free to fork and share your findings with the community.

Final Thoughts

Anthropic’s Performance Take‑Home is more than a coding exercise; it’s a window into the real‑world engineering that runs behind state‑of‑the‑art language models. Whether you’re aiming to land a role at a leading AI lab or simply enjoy squeezing performance out of Python, this repo offers a concrete, measurable challenge.

Now that you’ve got the map, it’s time to roll up your sleeves. Clone the repo, profile, tweak, and see if you can beat the 2‑hour mark. Good luck!

Original Article: View Original

Share this article