ReaderLM-v2: The Next Evolution in HTML-to-Text Conversion
June 04, 2025
ReaderLM-v2 Project Summary
Project Description
ReaderLM-v2 is a 1.5B parameter language model developed by Jina AI. It specializes in converting raw HTML into well-formatted Markdown or JSON. The model offers superior accuracy, improved handling of longer contexts (up to 512K tokens combined input/output), and comprehensive multilingual support (29 languages). It enhances content extraction, HTML parsing, and transformation tasks.
What's New in ReaderLM-v2
- Better Markdown Generation: Improved generation of complex elements like code fences, nested lists, tables, and LaTeX equations.
- JSON Output: Direct HTML-to-JSON generation using predefined schemas.
- Longer Context Handling: Supports up to 512K tokens with better performance on long content.
- Multilingual Support: Expanded to 29 languages.
- Enhanced Stability: Alleviates degeneration issues for long sequences through contrastive loss.
Model Overview
- Model Type: Autoregressive, decoder-only transformer
- Parameter Count: 1.54 Billion
- Context Window: Up to 512K tokens (combined input and output)
- Supported Languages: English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and 29 total.
Usage Instructions
Via Reader API
ReaderLM-v2 is integrated into the Reader API.
To use it, specify x-engine: readerlm-v2
in request headers and enable Accept: text/event-stream
.
curl https://r.jina.ai/https://news.ycombinator.com/ -H 'x-engine: readerlm-v2' -H 'Accept: text/event-stream'
On Google Colab
A Google Colab notebook is available to demonstrate HTML-to-Markdown conversion, JSON extraction, and instruction-following. It is optimized for Colab's free T4 GPU tier, requiring vllm
and triton
.
Local Usage
- Install dependencies:
pip install transformers
- Load and run the model:
from transformers import AutoModelForCausalLM, AutoTokenizer device = "cuda" # or "cpu" tokenizer = AutoTokenizer.from_pretrained("jinaai/ReaderLM-v2") model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)
- (Optional) Pre-clean HTML: Provided Python functions can remove scripts, styles, comments, meta tags, and handle base64 images and SVGs.
import re # ... (cleaning functions provided in the original text) def clean_html(html: str, clean_svg: bool = False, clean_base64: bool = False): # ... (implementation provided in the original text) pass
- Create a prompt:
def create_prompt( text: str, tokenizer=None, instruction: str = None, schema: str = None ) -> str: # ... (implementation provided in the original text) pass
- HTML to Markdown Example:
html = "<html><body><h1>Hello, world!</h1></body></html>" html = clean_html(html) input_prompt = create_prompt(html, tokenizer=tokenizer) inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device) outputs = model.generate( inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08 ) print(tokenizer.decode(outputs[0]))
- HTML to JSON Example:
schema = """ { "type": "object", "properties": { "title": { "type": "string" }, ... (full schema in original text) } """ html = clean_html(html) input_prompt = create_prompt(html, tokenizer=tokenizer, schema=schema) inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device) outputs = model.generate( inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08 ) print(tokenizer.decode(outputs[0]))
Key Features
- Converts raw HTML to Markdown or JSON.
- Supports 29 languages.
- Handles up to 512K tokens (combined input and output).
- Generates complex Markdown elements (code fences, nested lists, tables, LaTeX).
- Direct JSON output using predefined schemas.
- Enhanced stability for long sequence generation.
- Outperforms larger models in HTML-to-Markdown (ROUGE-L: 0.84, Levenshtein Distance: 0.22, Jaro-Winkler Similarity: 0.82).
- Competitive performance in HTML-to-JSON (F1 Score: 0.81, Precision: 0.82, Recall: 0.81, Pass-Rate: 0.98).
- Strong qualitative evaluation in Content Integrity (39/50), Structural Accuracy (35/50), and Format Compliance (36/50).
Target Users
- Developers
- Data scientists
- Researchers
- Individuals or organizations needing to parse and extract structured or Markdown-formatted content from HTML.
Project Links
- Hugging Face Model Card: https://huggingface.co/jinaai/ReaderLM-v2
- Jina AI Blog: https://jina.ai/news
- Reader API: https://r.jina.ai/
- Google Colab Notebook: Linked from the Hugging Face model card for a hands-on experience.
Application Scenarios
- Content Extraction: Extracting main content from web pages for summarization, analysis, or archiving.
- Data Structuring: Converting unstructured HTML data into structured JSON format for database ingestion or API consumption.
- Web Scraping: Improving the efficiency and accuracy of data collection from websites.
- Knowledge Base Creation: Transforming diverse web content into consistent Markdown for knowledge management systems.
- Text Processing Pipelines: Acting as a pre-processing step for large language models (LLMs) by converting HTML into an LLM-friendly format.