ReaderLM-v2: The Next Evolution in HTML-to-Text Conversion

ReaderLM-v2 Project Summary

Project Description

ReaderLM-v2 is a 1.5B parameter language model developed by Jina AI. It specializes in converting raw HTML into well-formatted Markdown or JSON. The model offers superior accuracy, improved handling of longer contexts (up to 512K tokens combined input/output), and comprehensive multilingual support (29 languages). It enhances content extraction, HTML parsing, and transformation tasks.

What's New in ReaderLM-v2

  • Better Markdown Generation: Improved generation of complex elements like code fences, nested lists, tables, and LaTeX equations.
  • JSON Output: Direct HTML-to-JSON generation using predefined schemas.
  • Longer Context Handling: Supports up to 512K tokens with better performance on long content.
  • Multilingual Support: Expanded to 29 languages.
  • Enhanced Stability: Alleviates degeneration issues for long sequences through contrastive loss.

Model Overview

  • Model Type: Autoregressive, decoder-only transformer
  • Parameter Count: 1.54 Billion
  • Context Window: Up to 512K tokens (combined input and output)
  • Supported Languages: English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and 29 total.

Usage Instructions

Via Reader API

ReaderLM-v2 is integrated into the Reader API. To use it, specify x-engine: readerlm-v2 in request headers and enable Accept: text/event-stream.

curl https://r.jina.ai/https://news.ycombinator.com/ -H 'x-engine: readerlm-v2' -H 'Accept: text/event-stream'

On Google Colab

A Google Colab notebook is available to demonstrate HTML-to-Markdown conversion, JSON extraction, and instruction-following. It is optimized for Colab's free T4 GPU tier, requiring vllm and triton.

Local Usage

  1. Install dependencies:
    pip install transformers
    
  2. Load and run the model:
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    device = "cuda" # or "cpu"
    tokenizer = AutoTokenizer.from_pretrained("jinaai/ReaderLM-v2")
    model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)
    
  3. (Optional) Pre-clean HTML: Provided Python functions can remove scripts, styles, comments, meta tags, and handle base64 images and SVGs.
    import re
    
    # ... (cleaning functions provided in the original text)
    
    def clean_html(html: str, clean_svg: bool = False, clean_base64: bool = False):
        # ... (implementation provided in the original text)
        pass
    
  4. Create a prompt:
    def create_prompt(
        text: str, tokenizer=None, instruction: str = None, schema: str = None
    ) -> str:
        # ... (implementation provided in the original text)
        pass
    
  5. HTML to Markdown Example:
    html = "<html><body><h1>Hello, world!</h1></body></html>"
    html = clean_html(html)
    input_prompt = create_prompt(html, tokenizer=tokenizer)
    inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
    outputs = model.generate(
        inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08
    )
    print(tokenizer.decode(outputs[0]))
    
  6. HTML to JSON Example:
    schema = """
    {
    "type": "object",
    "properties": {
    "title": {
    "type": "string"
    },
    ... (full schema in original text)
    }
    """
    html = clean_html(html)
    input_prompt = create_prompt(html, tokenizer=tokenizer, schema=schema)
    inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
    outputs = model.generate(
        inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08
    )
    print(tokenizer.decode(outputs[0]))
    

Key Features

  • Converts raw HTML to Markdown or JSON.
  • Supports 29 languages.
  • Handles up to 512K tokens (combined input and output).
  • Generates complex Markdown elements (code fences, nested lists, tables, LaTeX).
  • Direct JSON output using predefined schemas.
  • Enhanced stability for long sequence generation.
  • Outperforms larger models in HTML-to-Markdown (ROUGE-L: 0.84, Levenshtein Distance: 0.22, Jaro-Winkler Similarity: 0.82).
  • Competitive performance in HTML-to-JSON (F1 Score: 0.81, Precision: 0.82, Recall: 0.81, Pass-Rate: 0.98).
  • Strong qualitative evaluation in Content Integrity (39/50), Structural Accuracy (35/50), and Format Compliance (36/50).

Target Users

  • Developers
  • Data scientists
  • Researchers
  • Individuals or organizations needing to parse and extract structured or Markdown-formatted content from HTML.

Application Scenarios

  • Content Extraction: Extracting main content from web pages for summarization, analysis, or archiving.
  • Data Structuring: Converting unstructured HTML data into structured JSON format for database ingestion or API consumption.
  • Web Scraping: Improving the efficiency and accuracy of data collection from websites.
  • Knowledge Base Creation: Transforming diverse web content into consistent Markdown for knowledge management systems.
  • Text Processing Pipelines: Acting as a pre-processing step for large language models (LLMs) by converting HTML into an LLM-friendly format.

Share this article