Building a GDPR-compliant, agentic extraction pipeline using Pydantic AI and LangGraph for high-precision medical data processing.
Introduction
European healthcare providers manage a vast volume of unstructured data, ranging from handwritten prescriptions and laboratory reports to complex discharge summaries. The technical challenge is twofold: first, the data is highly sensitive and subject to strict GDPR Article 9 requirements; second, the documents often lack a standardized format, varying significantly between different EU member states and regional health authorities.
Traditional Optical Character Recognition (OCR) systems often fail in this context because they lack semantic understanding. A traditional system might extract text accurately but fail to distinguish between a patient's current medication and a discontinued one. To solve this, we move beyond simple OCR to Intelligent Document Processing (IDP).
This tutorial demonstrates how to build an agentic IDP pipeline. We utilize Pydantic AI for structured data extraction with rigorous validation and LangGraph to manage the workflow state, including a human-in-the-loop (HITL) mechanism for low-confidence extractions. This architecture ensures that the system remains robust, verifiable, and compliant with the high-stakes requirements of the healthcare sector.
Objectives
By the end of this tutorial, you will:
- Define complex medical schemas using Pydantic for type-safe data extraction.
- Implement an extraction agent using Pydantic AI that handles multi-modal inputs.
- Construct a stateful workflow using LangGraph to manage document processing stages.
- Integrate a human-in-the-loop validation step for data quality assurance.
- Deploy the system within a containerized environment suitable for private cloud hosting.
Prerequisites
To follow this tutorial, you require the following tools and environment:
- Python 3.12 or higher: Python Downloads
- Docker and Docker Compose: Docker Documentation
- An LLM API Key (e.g., OpenAI GPT-4o or Google Gemini 1.5 Pro).
- Basic familiarity with asynchronous Python (async/await).
- Pydantic AI and LangGraph libraries installed via pip.
Implementation
Step 1: Environment Configuration
The first step involves setting up a clean development environment. We use a virtual environment to manage dependencies and ensure reproducibility.
$ mkdir healthcare-idp-automation
$ cd healthcare-idp-automation
$ python -m venv venv
$ source venv/bin/activate
$ pip install pydantic-ai langgraph fastapi uvicorn motor python-multipart
In a European healthcare context, data residency is critical. While this tutorial uses cloud-based LLMs for demonstration, the architecture is designed to be compatible with local deployments of models like Llama 3 or Mistral via vLLM or Ollama, ensuring data never leaves the sovereign boundary if required.
Step 2: Defining the Medical Schema
Precision in healthcare automation starts with the schema. We use Pydantic to define the structure of the data we expect to extract. This provides immediate validation; if the LLM attempts to return a malformed date or an invalid medication dosage, the Pydantic model will raise a validation error, which Pydantic AI can then use to self-correct.
from datetime import date
from typing import List, Optional
from pydantic import BaseModel, Field, validator
class Medication(BaseModel):
name: str = Field(description="Generic or brand name of the medication")
dosage: str = Field(description="The amount of medication (e.g., 500mg)")
frequency: str = Field(description="How often the medication is taken (e.g., BID, daily)")
route: str = Field(description="Administration route (e.g., oral, intravenous)")
class Diagnosis(BaseModel):
icd10_code: str = Field(description="The ICD-10 code for the condition")
description: str = Field(description="The textual description of the diagnosis")
status: str = Field(description="Current status: active, resolved, or suspected")
class PatientRecord(BaseModel):
patient_name: str
date_of_birth: date
document_date: date
diagnoses: List[Diagnosis]
medications: List[Medication]
confidence_score: float = Field(
description="The agent's self-assessed confidence score from 0.0 to 1.0",
ge=0.0,
le=1.0
)
@validator("document_date")
def date_not_in_future(cls, v):
if v > date.today():
raise ValueError("Document date cannot be in the future")
return v
The use of Field descriptions is not merely for documentation; Pydantic AI passes these descriptions to the LLM as part of the tool definition, significantly improving extraction accuracy by providing semantic context.
Step 3: Implementing the Extraction Agent with Pydantic AI
Pydantic AI simplifies the process of structured extraction by wrapping the LLM call in a way that enforces the schema. It handles the retry logic automatically if the model produces invalid JSON.
from pydantic_ai import Agent, RunContext
from pydantic_ai.models.openai import OpenAIModel
import os
# Define the model - in production, use environment variables for keys
model = OpenAIModel('gpt-4o', api_key=os.getenv("OPENAI_API_KEY"))
# Initialize the agent with the PatientRecord schema as the result type
extraction_agent = Agent(
model,
result_type=PatientRecord,
system_prompt=(
"You are a specialized medical coding assistant. "
"Extract patient information from the provided document text. "
"Ensure all ICD-10 codes are accurate. "
"If information is missing, do not hallucinate; leave the field null. "
"Assign a confidence score based on the clarity of the source text."
)
)
async def extract_medical_data(text: str) -> PatientRecord:
"""
Triggers the agent to process the unstructured text.
"""
result = await extraction_agent.run(text)
# The result.data is already a validated PatientRecord instance
return result.data
The architectural choice here is to use Pydantic AI's result_type. This forces the LLM to use function calling (or tool calling) to return data, which is significantly more reliable than asking for raw JSON in a standard prompt.
Step 4: Orchestrating the Workflow with LangGraph
Healthcare workflows are rarely linear. A document might need OCR, then extraction, then a confidence check, and potentially a human review. LangGraph allows us to define this as a state machine.
We define a State object that tracks the document's progress through the pipeline.
from typing import TypedDict, Annotated, Union
from langgraph.graph import StateGraph, END
class AgentState(TypedDict):
raw_text: str
extracted_data: Optional[PatientRecord]
requires_review: bool
review_complete: bool
error: Optional[str]
def extraction_node(state: AgentState):
"""
Node to handle the LLM extraction logic.
"""
try:
# In a real scenario, we would use await here
# For the graph, we wrap the async call
import asyncio
data = asyncio.run(extract_medical_data(state['raw_text']))
# Determine if human review is needed based on confidence
requires_review = data.confidence_score < 0.85
return {
"extracted_data": data,
"requires_review": requires_review,
"error": None
}
except Exception as e:
return {"error": str(e)}
def review_decision(state: AgentState):
"""
Conditional edge logic to route to human review or end.
"""
if state["error"]:
return "error_handler"
if state["requires_review"]:
return "human_review"
return END
# Initialize the graph
workflow = StateGraph(AgentState)
# Add nodes
workflow.add_node("extract", extraction_node)
workflow.add_node("human_review", lambda state: {"review_complete": True})
workflow.add_node("error_handler", lambda state: {"error": "Processing Failed"})
# Define edges
workflow.set_entry_point("extract")
workflow.add_conditional_edges(
"extract",
review_decision,
{
"human_review": "human_review",
"error_handler": "error_handler",
END: END
}
)
workflow.add_edge("human_review", END)
workflow.add_edge("error_handler", END)
# Compile the graph
app = workflow.compile()
Why LangGraph over a simple loop?
In healthcare, auditability is paramount. LangGraph provides a persistent state at every step of the process. If a system fails midway through processing a 500-page medical record, the state can be recovered. Furthermore, LangGraph's support for "breakpoints" allows the execution to pause and wait for external input (the human-in-the-loop) before resuming, which is a requirement for clinical safety.
Step 5: Human-in-the-Loop (HITL) Integration
The HITL component is implemented by pausing the graph execution when requires_review is true. In a production FastAPI application, this would involve saving the state to a database (like MongoDB or PostgreSQL) and notifying a medical coder via a frontend dashboard.
| Feature | Traditional OCR | Agentic IDP (Azura Stack) | | : | : | : | | Extraction Method | Pattern matching / RegEx | Semantic LLM Reasoning | | Validation | Manual or basic type checks | Pydantic Schema Enforcement | | Error Handling | Silent failures common | Self-correcting via Agent retries | | Multi-lingual | Requires specific models | Native multi-lingual support | | Context Awareness | None | High (understands medical intent) | | Compliance | Hardcoded logic | Traceable, stateful workflows |
Step 6: Deployment and Scalability
For European enterprises, deployment usually happens on-premises or within a specific Azure/AWS region (e.g., germanywestcentral). Using Docker ensures that the entire stack—including the OCR engine (like Tesseract or Azure Read API), the FastAPI gateway, and the worker agents—is portable.
# docker-compose.yml
services:
api:
build: .
ports:
- "8000:8000"
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- DATABASE_URL=mongodb://mongo:27017
depends_on:
- mongo
mongo:
image: mongo:latest
volumes:
- mongo_data:/data/db
volumes:
mongo_data:
Technical Considerations for European Healthcare
When implementing this system for EU-based entities, several technical adjustments are mandatory:
- PII Redaction: Before sending data to a cloud-based LLM, use a local Presidio or custom NER model to redact Personally Identifiable Information (PII) if the LLM provider is not covered by a Data Processing Agreement (DPA).
- DPIA Alignment: The stateful nature of LangGraph allows for detailed logging of how a decision was reached, which is a core requirement for a Data Protection Impact Assessment (DPIA).
- Terminology Mapping: Use Pydantic AI's
RunContextto inject local medical terminologies (like German-specific ICD-10-GM