K2 Vendor Verifier: A Practical Tool for Evaluating Kimi K2 APIs
K2 Vendor Verifier: A Practical Tool for Evaluating Kimi K2 APIs
Kimi K2 is a newly‑released large‑language‑model platform that promises high‑quality “agentic” dialogue via powerful tool‑calling capabilities. However, like any commercial AI product, the practical ability of K2 to fire and parse tool calls varies wildly across providers. The K2 Vendor Verifier solves this exact pain point by providing a robust, open‑source benchmark that measures both precision and schema accuracy for any third‑party deployment.
Why an Evaluation Tool Is Needed
- Tool‑call reliability matters – In agentic workflows, a single missed or malformed call can break an entire workflow.
- Vendor drift – Different hosting solutions (e.g. Fireworks, vLLM, SGLang) can diverge in latency, cost, and internal engine versions.
- Open‑source transparency – Developers can verify results instead of relying on vendor‑provided numbers.
The verifier fills that gap with a command‑line utility that:
- Loads a curated dataset of 4,000+ tool‑call prompts.
- Sends concurrent requests to any provider.
- Captures the model’s
finish_reasonand JSON payload. - Calculates tool_call_f1 and schema_accuracy.
- Generates a clean CSV or JSON summary.
Core Features
| Feature | Description |
|---|---|
| Batch Evaluation | Run 4k+ prompts automatically, configurable concurrency. |
| Metric Suite | Tool‑call‑trigger similarity, schema validity, overall scores. |
| Cross‑Vendor Comparison | Side‑by‑side tables for dozens of APIs (Moonshot, Fireworks, VLLM, etc.). |
| Guided Encoding | Enforce correct JSON schema via model prompts – useful for vendors. |
| Extensible | Import custom datasets, change base URL, add custom payloads. |
| Open‑source | All code on GitHub under MIT license. |
Getting Started
- Clone the Repo
git clone https://github.com/MoonshotAI/K2-Vendor-Verifier.git cd K2-Vendor-Verifier - Build Dependencies (requires Python 3.9+ and
uv)uv sync - Run the Benchmark – Replace
YOUR_API_KEYand provider endpoint.python tool_calls_eval.py samples.jsonl \ --model kimi-k2-0905-preview \ --base-url https://api.moonshot.cn/v1 \ --api-key YOUR_API_KEY \ --concurrency 5 \ --output results.jsonl \ --summary summary.json - View Results –
summary.jsoncontains overall metrics;results.jsonlbreaks down each request.
Tip: For OpenRouter‑based vendors use the
--extra-bodyflag to filter the provider list.
Evaluation Metrics Explained
| Metric | Formula | What It Captures |
|---|---|---|
| tool_call_precision | TP / (TP + FP) | How often a called tool was actually needed |
| tool_call_recall | TP / (TP + FN) | How many needed calls the model triggered |
| tool_call_f1 | 2 × precision × recall / (precision + recall) | Balance between precision and recall |
| schema_accuracy | successful_calls / total_tool_calls | Valid JSON payloads only |
The project sets a benchmark: tool_call_f1 > 73 % for the kimi‑k2‑thinking model and > 80 % for the kimi‑k2‑0905‑preview. If your provider falls below, the verifier highlights potential precision or schema issues.
Vendor‑Specific Guidance
- Version Check – Use the minimum recommended API version (e.g.,
vllm v0.11.0for the 0905 benchmark). Older implementations often mis‑format JSON. - Tool‑ID Normalization – Rename legacy IDs to
functions.func_name:idxto match Kimi K2 expectations. - Guided Encoding – Add explicit prompts that force the model to adhere to your schema. The repo includes a helper JSON schema file.
Contributing & Community
Contributions are welcomed:
- Add new vendor benchmarks.
- Improve metric calculations.
- Create better visualizations for the summary.
Open issues and pull requests are tracked on GitHub. For rapid feedback, community members can join the project’s Discord channel (link in the repo description).
Bottom Line
K2 Vendor Verifier is more than a curiosity—it's a critical audit tool for anyone deploying or using Kimi K2 in production. By quantifying both the trigger and schema quality of tool calls, it gives developers a clear, actionable path to improving reliability and user experience.
Try it today, compare your results to the published tables, and help push the Kimi K2 ecosystem towards standardized, trustworthy tool‑calling performance.