LangExtract: LLM Text Structuring Made Easy

LangExtract: Streamlining Text Structuring with LLMs

LangExtract is a cutting-edge open-source Python library that empowers users to extract structured information from unstructured text with unprecedented accuracy and ease. Leveraging the power of Large Language Models (LLMs), this tool is designed to process diverse text formats, from clinical notes and reports to literary works, and organize key details into a consistent, usable schema.

Why Choose LangExtract?

LangExtract stands out due to its unique set of features:

  • Precise Source Grounding: Every piece of extracted information is meticulously mapped back to its exact location in the original text. This allows for easy verification and visual highlighting within the source document, ensuring traceability.
  • Reliable Structured Outputs: By utilizing few-shot examples and controlled generation in supported models like Google Gemini, LangExtract enforces a consistent output schema, leading to robust and predictable structured results.
  • Optimized for Long Documents: The library tackles the challenge of extracting information from large texts through an efficient strategy involving text chunking, parallel processing, and multiple extraction passes, significantly improving recall.
  • Interactive Visualization: LangExtract generates self-contained, interactive HTML files, making it simple to visualize and review thousands of extracted entities within their original textual context.
  • Flexible LLM Support: Whether you prefer cloud-based LLMs like Gemini or local open-source models via Ollama, LangExtract offers broad compatibility and can be extended to other third-party APIs.
  • Adaptable to Any Domain: Define extraction tasks for any field using just a few examples, eliminating the need for model fine-tuning.

Getting Started with LangExtract

Installation is straightforward via pip:

pip install langextract

For cloud-based models, API key setup is generally required. LangExtract supports setting the API key via environment variables or a .env file for secure management. Options include Gemini models via AI Studio or Vertex AI, and OpenAI models.

Basic Usage involves defining your extraction task with a prompt and providing examples:

import langextract as lx
import textwrap

# Define the extraction task
prompt = textwrap.dedent("""
Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities.
Provide meaningful attributes for each entity to add context."""
)

# Provide guiding examples
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(extraction_class="character", extraction_text="ROMEO", attributes={"emotional_state": "wonder"}),
            lx.data.Extraction(extraction_class="emotion", extraction_text="But soft!", attributes={"feeling": "gentle awe"}),
            lx.data.Extraction(extraction_class="relationship", extraction_text="Juliet is the sun", attributes={"type": "metaphor"})
        ]
    )
]

# Input text
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"

# Run the extraction
result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
)

# Save results and visualize
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl")
html_content = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f: f.write(html_content)

This process generates an interactive HTML file, allowing for easy review and analysis of extracted entities.

Advanced Capabilities

LangExtract excels with longer documents, supporting direct processing from URLs (e.g., Project Gutenberg) and offering parameters like extraction_passes and max_workers to optimize performance and recall. The library also showcases specialized applications like Medication Extraction and RadExtract, a demo for structuring radiology reports available on Hugging Face Spaces.

Contribute to the project via its GitHub repository, and explore the detailed examples for advanced usage and insights into its powerful capabilities for natural language processing and data structuring tasks.

Original Article: View Original

Share this article