Building a GDPR-compliant, agentic extraction pipeline using Pydantic AI and LangGraph for high-precision medical data processing.

Introduction

European healthcare providers manage a vast volume of unstructured data, ranging from handwritten prescriptions and laboratory reports to complex discharge summaries. The technical challenge is twofold: first, the data is highly sensitive and subject to strict GDPR Article 9 requirements; second, the documents often lack a standardized format, varying significantly between different EU member states and regional health authorities.

Traditional Optical Character Recognition (OCR) systems often fail in this context because they lack semantic understanding. A traditional system might extract text accurately but fail to distinguish between a patient's current medication and a discontinued one. To solve this, we move beyond simple OCR to Intelligent Document Processing (IDP).

This tutorial demonstrates how to build an agentic IDP pipeline. We utilize Pydantic AI for structured data extraction with rigorous validation and LangGraph to manage the workflow state, including a human-in-the-loop (HITL) mechanism for low-confidence extractions. This architecture ensures that the system remains robust, verifiable, and compliant with the high-stakes requirements of the healthcare sector.

Objectives

By the end of this tutorial, you will:

  1. Define complex medical schemas using Pydantic for type-safe data extraction.
  2. Implement an extraction agent using Pydantic AI that handles multi-modal inputs.
  3. Construct a stateful workflow using LangGraph to manage document processing stages.
  4. Integrate a human-in-the-loop validation step for data quality assurance.
  5. Deploy the system within a containerized environment suitable for private cloud hosting.

Prerequisites

To follow this tutorial, you require the following tools and environment:

  • Python 3.12 or higher: Python Downloads
  • Docker and Docker Compose: Docker Documentation
  • An LLM API Key (e.g., OpenAI GPT-4o or Google Gemini 1.5 Pro).
  • Basic familiarity with asynchronous Python (async/await).
  • Pydantic AI and LangGraph libraries installed via pip.

Implementation

Step 1: Environment Configuration

The first step involves setting up a clean development environment. We use a virtual environment to manage dependencies and ensure reproducibility.

$ mkdir healthcare-idp-automation
$ cd healthcare-idp-automation
$ python -m venv venv
$ source venv/bin/activate
$ pip install pydantic-ai langgraph fastapi uvicorn motor python-multipart

In a European healthcare context, data residency is critical. While this tutorial uses cloud-based LLMs for demonstration, the architecture is designed to be compatible with local deployments of models like Llama 3 or Mistral via vLLM or Ollama, ensuring data never leaves the sovereign boundary if required.

Step 2: Defining the Medical Schema

Precision in healthcare automation starts with the schema. We use Pydantic to define the structure of the data we expect to extract. This provides immediate validation; if the LLM attempts to return a malformed date or an invalid medication dosage, the Pydantic model will raise a validation error, which Pydantic AI can then use to self-correct.

from datetime import date
from typing import List, Optional
from pydantic import BaseModel, Field, validator

class Medication(BaseModel):
    name: str = Field(description="Generic or brand name of the medication")
    dosage: str = Field(description="The amount of medication (e.g., 500mg)")
    frequency: str = Field(description="How often the medication is taken (e.g., BID, daily)")
    route: str = Field(description="Administration route (e.g., oral, intravenous)")

class Diagnosis(BaseModel):
    icd10_code: str = Field(description="The ICD-10 code for the condition")
    description: str = Field(description="The textual description of the diagnosis")
    status: str = Field(description="Current status: active, resolved, or suspected")

class PatientRecord(BaseModel):
    patient_name: str
    date_of_birth: date
    document_date: date
    diagnoses: List[Diagnosis]
    medications: List[Medication]
    confidence_score: float = Field(
        description="The agent's self-assessed confidence score from 0.0 to 1.0",
        ge=0.0,
        le=1.0
    )

    @validator("document_date")
    def date_not_in_future(cls, v):
        if v > date.today():
            raise ValueError("Document date cannot be in the future")
        return v

The use of Field descriptions is not merely for documentation; Pydantic AI passes these descriptions to the LLM as part of the tool definition, significantly improving extraction accuracy by providing semantic context.

Step 3: Implementing the Extraction Agent with Pydantic AI

Pydantic AI simplifies the process of structured extraction by wrapping the LLM call in a way that enforces the schema. It handles the retry logic automatically if the model produces invalid JSON.

from pydantic_ai import Agent, RunContext
from pydantic_ai.models.openai import OpenAIModel
import os

# Define the model - in production, use environment variables for keys
model = OpenAIModel('gpt-4o', api_key=os.getenv("OPENAI_API_KEY"))

# Initialize the agent with the PatientRecord schema as the result type
extraction_agent = Agent(
    model,
    result_type=PatientRecord,
    system_prompt=(
        "You are a specialized medical coding assistant. "
        "Extract patient information from the provided document text. "
        "Ensure all ICD-10 codes are accurate. "
        "If information is missing, do not hallucinate; leave the field null. "
        "Assign a confidence score based on the clarity of the source text."
    )
)

async def extract_medical_data(text: str) -> PatientRecord:
    """
    Triggers the agent to process the unstructured text.
    """
    result = await extraction_agent.run(text)
    # The result.data is already a validated PatientRecord instance
    return result.data

The architectural choice here is to use Pydantic AI's result_type. This forces the LLM to use function calling (or tool calling) to return data, which is significantly more reliable than asking for raw JSON in a standard prompt.

Step 4: Orchestrating the Workflow with LangGraph

Healthcare workflows are rarely linear. A document might need OCR, then extraction, then a confidence check, and potentially a human review. LangGraph allows us to define this as a state machine.

We define a State object that tracks the document's progress through the pipeline.

from typing import TypedDict, Annotated, Union
from langgraph.graph import StateGraph, END

class AgentState(TypedDict):
    raw_text: str
    extracted_data: Optional[PatientRecord]
    requires_review: bool
    review_complete: bool
    error: Optional[str]

def extraction_node(state: AgentState):
    """
    Node to handle the LLM extraction logic.
    """
    try:
        # In a real scenario, we would use await here
        # For the graph, we wrap the async call
        import asyncio
        data = asyncio.run(extract_medical_data(state['raw_text']))
        
        # Determine if human review is needed based on confidence
        requires_review = data.confidence_score < 0.85
        
        return {
            "extracted_data": data,
            "requires_review": requires_review,
            "error": None
        }
    except Exception as e:
        return {"error": str(e)}

def review_decision(state: AgentState):
    """
    Conditional edge logic to route to human review or end.
    """
    if state["error"]:
        return "error_handler"
    if state["requires_review"]:
        return "human_review"
    return END

# Initialize the graph
workflow = StateGraph(AgentState)

# Add nodes
workflow.add_node("extract", extraction_node)
workflow.add_node("human_review", lambda state: {"review_complete": True})
workflow.add_node("error_handler", lambda state: {"error": "Processing Failed"})

# Define edges
workflow.set_entry_point("extract")
workflow.add_conditional_edges(
    "extract",
    review_decision,
    {
        "human_review": "human_review",
        "error_handler": "error_handler",
        END: END
    }
)
workflow.add_edge("human_review", END)
workflow.add_edge("error_handler", END)

# Compile the graph
app = workflow.compile()

Why LangGraph over a simple loop?

In healthcare, auditability is paramount. LangGraph provides a persistent state at every step of the process. If a system fails midway through processing a 500-page medical record, the state can be recovered. Furthermore, LangGraph's support for "breakpoints" allows the execution to pause and wait for external input (the human-in-the-loop) before resuming, which is a requirement for clinical safety.

Step 5: Human-in-the-Loop (HITL) Integration

The HITL component is implemented by pausing the graph execution when requires_review is true. In a production FastAPI application, this would involve saving the state to a database (like MongoDB or PostgreSQL) and notifying a medical coder via a frontend dashboard.

| Feature | Traditional OCR | Agentic IDP (Azura Stack) | | : | : | : | | Extraction Method | Pattern matching / RegEx | Semantic LLM Reasoning | | Validation | Manual or basic type checks | Pydantic Schema Enforcement | | Error Handling | Silent failures common | Self-correcting via Agent retries | | Multi-lingual | Requires specific models | Native multi-lingual support | | Context Awareness | None | High (understands medical intent) | | Compliance | Hardcoded logic | Traceable, stateful workflows |

Step 6: Deployment and Scalability

For European enterprises, deployment usually happens on-premises or within a specific Azure/AWS region (e.g., germanywestcentral). Using Docker ensures that the entire stack—including the OCR engine (like Tesseract or Azure Read API), the FastAPI gateway, and the worker agents—is portable.

# docker-compose.yml
services:
  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - DATABASE_URL=mongodb://mongo:27017
    depends_on:
      - mongo
  mongo:
    image: mongo:latest
    volumes:
      - mongo_data:/data/db
volumes:
  mongo_data:

Technical Considerations for European Healthcare

When implementing this system for EU-based entities, several technical adjustments are mandatory:

  1. PII Redaction: Before sending data to a cloud-based LLM, use a local Presidio or custom NER model to redact Personally Identifiable Information (PII) if the LLM provider is not covered by a Data Processing Agreement (DPA).
  2. DPIA Alignment: The stateful nature of LangGraph allows for detailed logging of how a decision was reached, which is a core requirement for a Data Protection Impact Assessment (DPIA).
  3. Terminology Mapping: Use Pydantic AI's RunContext to inject local medical terminologies (like German-specific ICD-10-GM