LLM Scraper: Turn Webpages Into Structured Data

Unleash the Power of LLMs for Web Scraping with LLM Scraper

In the dynamic world of web development and data analysis, efficiently extracting structured information from the vast expanse of the internet is a common challenge. Enter LLM Scraper, a cutting-edge open-source project that promises to revolutionize how you gather data from webpages. Developed using TypeScript and built upon the robust Playwright framework, LLM Scraper harnesses the power of Large Language Models (LLMs) to convert any webpage into structured, usable data.

What is LLM Scraper?

LLM Scraper is a versatile TypeScript library designed to simplify the process of extracting data from web content. It leverages LLMs, through their function-calling capabilities, to interpret and structure information based on your defined schemas. Whether you're working with pre-processed HTML, raw HTML, Markdown, or even visual data from screenshots via multi-modal LLMs, LLM Scraper has you covered.

Key Features and Benefits

LLM Scraper boasts an impressive set of features that cater to a wide range of use cases:

  • Broad LLM Support: Integrates with popular LLM providers including GPT (OpenAI), Sonnet (Anthropic), Gemini (Google), Llama (Meta), and Qwen.
  • Flexible Schema Definition: Define your data structures using either Zod or JSON Schema, ensuring robust type-safety in your extracted data.
  • Playwright Foundation: Built on Playwright, a powerful tool for end-to-end testing and automation, ensuring reliable browser interactions.
  • Multiple Formatting Modes: Supports html, raw_html, markdown, text (via Readability.js), and image (for multi-modal LLMs) formats for versatile data input.
  • Streaming Objects: Retrieve data incrementally as it's processed, which is beneficial for large datasets or real-time applications.
  • Code Generation: A unique feature that allows you to generate reusable Playwright scripts based on your defined schemas, streamlining repetitive tasks.

Getting Started with LLM Scraper

Launching your data extraction journey with LLM Scraper is straightforward:

  1. Install Dependencies:

    npm i zod playwright llm-scraper
    

  2. Initialize Your LLM: Choose your preferred LLM provider and set it up. Examples are provided for OpenAI, Anthropic, Google, Groq, and Ollama.

    • OpenAI Example:
      import { openai } from '@ai-sdk/openai'
      const llm = openai.chat('gpt-4o')
      
  3. Create an LLMScraper Instance: Instantiate the scraper with your chosen LLM.

    import LLMScraper from 'llm-scraper'
    const scraper = new LLMScraper(llm)
    

  4. Run the Scraper: Define your desired schema using Zod and then execute the scraper on a Playwright page.

    import { chromium } from 'playwright'
    import { z } from 'zod'
    import { openai } from '@ai-sdk/openai'
    import LLMScraper from 'llm-scraper'
    
    const browser = await chromium.launch()
    const llm = openai.chat('gpt-4o')
    const scraper = new LLMScraper(llm)
    const page = await browser.newPage()
    await page.goto('https://news.ycombinator.com')
    
    const schema = z.object({
      top: z.array(z.object({
        title: z.string(),
        points: z.number(),
        by: z.string(),
        commentsURL: z.string(),
      })).length(5).describe('Top 5 stories on Hacker News'),
    })
    
    const { data } = await scraper.run(page, schema, { format: 'html' })
    console.log(data.top)
    
    await page.close()
    await browser.close()
    

For those who prefer real-time data processing, the stream function allows for iterative data retrieval.

Contributing to LLM Scraper

LLM Scraper is a community-driven project. The developers encourage contributions, bug reports, and feature requests through GitHub issues and pull requests, making it a truly collaborative open-source endeavor.

With its sophisticated capabilities and user-friendly design, LLM Scraper is an invaluable tool for developers, researchers, and anyone looking to efficiently extract valuable structured data from the web using the power of AI.

Original Article: View Original

Share this article