ReaderLM-v2: The Next Evolution in HTML-to-Text Conversion

June 04, 2025

Practical Open Source Projects

ReaderLM-v2 Project Summary

Project Description

ReaderLM-v2 is a 1.5B parameter language model developed by Jina AI. It specializes in converting raw HTML into well-formatted Markdown or JSON. The model offers superior accuracy, improved handling of longer contexts (up to 512K tokens combined input/output), and comprehensive multilingual support (29 languages). It enhances content extraction, HTML parsing, and transformation tasks.

What's New in ReaderLM-v2

Better Markdown Generation: Improved generation of complex elements like code fences, nested lists, tables, and LaTeX equations.
JSON Output: Direct HTML-to-JSON generation using predefined schemas.
Longer Context Handling: Supports up to 512K tokens with better performance on long content.
Multilingual Support: Expanded to 29 languages.
Enhanced Stability: Alleviates degeneration issues for long sequences through contrastive loss.

Model Overview

Model Type: Autoregressive, decoder-only transformer
Parameter Count: 1.54 Billion
Context Window: Up to 512K tokens (combined input and output)
Supported Languages: English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and 29 total.

Usage Instructions

Via Reader API

ReaderLM-v2 is integrated into the Reader API. To use it, specify x-engine: readerlm-v2 in request headers and enable Accept: text/event-stream.

curl https://r.jina.ai/https://news.ycombinator.com/ -H 'x-engine: readerlm-v2' -H 'Accept: text/event-stream'

On Google Colab

A Google Colab notebook is available to demonstrate HTML-to-Markdown conversion, JSON extraction, and instruction-following. It is optimized for Colab's free T4 GPU tier, requiring vllm and triton.

Local Usage

Install dependencies:
```
pip install transformers
```

Load and run the model:

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # or "cpu"
tokenizer = AutoTokenizer.from_pretrained("jinaai/ReaderLM-v2")
model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)

(Optional) Pre-clean HTML: Provided Python functions can remove scripts, styles, comments, meta tags, and handle base64 images and SVGs.

import re

# ... (cleaning functions provided in the original text)

def clean_html(html: str, clean_svg: bool = False, clean_base64: bool = False):
    # ... (implementation provided in the original text)
    pass

Create a prompt:

def create_prompt(
    text: str, tokenizer=None, instruction: str = None, schema: str = None
) -> str:
    # ... (implementation provided in the original text)
    pass

HTML to Markdown Example:

html = "<html><body><h1>Hello, world!</h1></body></html>"
html = clean_html(html)
input_prompt = create_prompt(html, tokenizer=tokenizer)
inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
outputs = model.generate(
    inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08
)
print(tokenizer.decode(outputs[0]))

HTML to JSON Example:

schema = """
{
"type": "object",
"properties": {
"title": {
"type": "string"
},
... (full schema in original text)
}
"""
html = clean_html(html)
input_prompt = create_prompt(html, tokenizer=tokenizer, schema=schema)
inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
outputs = model.generate(
    inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08
)
print(tokenizer.decode(outputs[0]))

Key Features

Converts raw HTML to Markdown or JSON.
Supports 29 languages.
Handles up to 512K tokens (combined input and output).
Generates complex Markdown elements (code fences, nested lists, tables, LaTeX).
Direct JSON output using predefined schemas.
Enhanced stability for long sequence generation.
Outperforms larger models in HTML-to-Markdown (ROUGE-L: 0.84, Levenshtein Distance: 0.22, Jaro-Winkler Similarity: 0.82).
Competitive performance in HTML-to-JSON (F1 Score: 0.81, Precision: 0.82, Recall: 0.81, Pass-Rate: 0.98).
Strong qualitative evaluation in Content Integrity (39/50), Structural Accuracy (35/50), and Format Compliance (36/50).

Target Users

Developers
Data scientists
Researchers
Individuals or organizations needing to parse and extract structured or Markdown-formatted content from HTML.

Project Links

Hugging Face Model Card: https://huggingface.co/jinaai/ReaderLM-v2
Jina AI Blog: https://jina.ai/news
Reader API: https://r.jina.ai/
Google Colab Notebook: Linked from the Hugging Face model card for a hands-on experience.

Application Scenarios

Content Extraction: Extracting main content from web pages for summarization, analysis, or archiving.
Data Structuring: Converting unstructured HTML data into structured JSON format for database ingestion or API consumption.
Web Scraping: Improving the efficiency and accuracy of data collection from websites.
Knowledge Base Creation: Transforming diverse web content into consistent Markdown for knowledge management systems.
Text Processing Pipelines: Acting as a pre-processing step for large language models (LLMs) by converting HTML into an LLM-friendly format.