K2 Vendor Verifier: A Practical Tool for Evaluating Kimi K2 APIs

January 28, 2026

Category: Practical Open Source Projects

Tags:

KimiK2 APIBenchmark OpenSourceTool ToolCallEvaluation LLMTesting

K2 Vendor Verifier: A Practical Tool for Evaluating Kimi K2 APIs

Kimi K2 is a newly‑released large‑language‑model platform that promises high‑quality “agentic” dialogue via powerful tool‑calling capabilities. However, like any commercial AI product, the practical ability of K2 to fire and parse tool calls varies wildly across providers. The K2 Vendor Verifier solves this exact pain point by providing a robust, open‑source benchmark that measures both precision and schema accuracy for any third‑party deployment.

Why an Evaluation Tool Is Needed

Tool‑call reliability matters – In agentic workflows, a single missed or malformed call can break an entire workflow.
Vendor drift – Different hosting solutions (e.g. Fireworks, vLLM, SGLang) can diverge in latency, cost, and internal engine versions.
Open‑source transparency – Developers can verify results instead of relying on vendor‑provided numbers.

The verifier fills that gap with a command‑line utility that:

Loads a curated dataset of 4,000+ tool‑call prompts.
Sends concurrent requests to any provider.
Captures the model’s finish_reason and JSON payload.
Calculates tool_call_f1 and schema_accuracy.
Generates a clean CSV or JSON summary.

Core Features

Feature	Description
Batch Evaluation	Run 4k+ prompts automatically, configurable concurrency.
Metric Suite	Tool‑call‑trigger similarity, schema validity, overall scores.
Cross‑Vendor Comparison	Side‑by‑side tables for dozens of APIs (Moonshot, Fireworks, VLLM, etc.).
Guided Encoding	Enforce correct JSON schema via model prompts – useful for vendors.
Extensible	Import custom datasets, change base URL, add custom payloads.
Open‑source	All code on GitHub under MIT license.

Getting Started

Clone the Repo

git clone https://github.com/MoonshotAI/K2-Vendor-Verifier.git
cd K2-Vendor-Verifier

Build Dependencies (requires Python 3.9+ and uv)
```
uv sync
```

Run the Benchmark – Replace YOUR_API_KEY and provider endpoint.

python tool_calls_eval.py samples.jsonl \
  --model kimi-k2-0905-preview \
  --base-url https://api.moonshot.cn/v1 \
  --api-key YOUR_API_KEY \
  --concurrency 5 \
  --output results.jsonl \
  --summary summary.json

View Results –summary.json contains overall metrics; results.jsonl breaks down each request.

Tip: For OpenRouter‑based vendors use the --extra-body flag to filter the provider list.

Evaluation Metrics Explained

Metric	Formula	What It Captures
tool_call_precision	TP / (TP + FP)	How often a called tool was actually needed
tool_call_recall	TP / (TP + FN)	How many needed calls the model triggered
tool_call_f1	2 × precision × recall / (precision + recall)	Balance between precision and recall
schema_accuracy	successful_calls / total_tool_calls	Valid JSON payloads only

The project sets a benchmark: tool_call_f1 > 73 % for the kimi‑k2‑thinking model and > 80 % for the kimi‑k2‑0905‑preview. If your provider falls below, the verifier highlights potential precision or schema issues.

Vendor‑Specific Guidance

Version Check – Use the minimum recommended API version (e.g., vllm v0.11.0 for the 0905 benchmark). Older implementations often mis‑format JSON.
Tool‑ID Normalization – Rename legacy IDs to functions.func_name:idx to match Kimi K2 expectations.
Guided Encoding – Add explicit prompts that force the model to adhere to your schema. The repo includes a helper JSON schema file.

Contributing & Community

Contributions are welcomed:

Add new vendor benchmarks.
Improve metric calculations.
Create better visualizations for the summary.

Open issues and pull requests are tracked on GitHub. For rapid feedback, community members can join the project’s Discord channel (link in the repo description).

Bottom Line

K2 Vendor Verifier is more than a curiosity—it's a critical audit tool for anyone deploying or using Kimi K2 in production. By quantifying both the trigger and schema quality of tool calls, it gives developers a clear, actionable path to improving reliability and user experience.

Try it today, compare your results to the published tables, and help push the Kimi K2 ecosystem towards standardized, trustworthy tool‑calling performance.

Original Article: View Original

Share this article