This guide demonstrates how to build a production-grade KYC pipeline using LangGraph and Pydantic AI to automate document verification and data extraction.
Introduction
Know Your Customer (KYC) processes are a fundamental requirement for financial institutions, fintech startups, and any service provider operating under Anti-Money Laundering (AML) regulations. Traditionally, KYC involved manual review of identity documents—passports, national ID cards, and utility bills—to verify a user's identity. This manual approach is non-scalable, prone to human error, and introduces significant latency in user onboarding.
The emergence of Large Language Models (LLMs) with multimodal capabilities, such as GPT-4o and Gemini 1.5 Pro, has shifted the paradigm from simple Optical Character Recognition (OCR) to Intelligent Document Processing (IDP). Unlike traditional OCR, which merely converts images to text, IDP leverages LLMs to understand the context, structure, and validity of the data extracted.
This tutorial explores the implementation of an automated KYC verification system. We will utilize LangGraph for workflow orchestration and Pydantic AI for structured data extraction. This architecture ensures that the system is not only autonomous but also deterministic where necessary, allowing for human-in-the-loop (HITL) interventions when confidence scores fall below a defined threshold.
Objectives
By the end of this tutorial, you will:
- Design a multi-stage KYC workflow using LangGraph to manage state and transitions.
- Implement structured data extraction from identity documents using Pydantic AI.
- Integrate validation logic to cross-reference extracted data against Machine Readable Zone (MRZ) standards.
- Configure a human-in-the-loop mechanism for exception handling in high-stakes compliance environments.
- Deploy the solution using Docker for consistent environment management.
Prerequisites
To follow this tutorial, you require the following tools and accounts:
- Python 3.12+: The latest stable version of Python is recommended for better type hinting and performance. Python Downloads
- Docker: For containerization and local deployment. Docker Documentation
- OpenAI API Key or Google AI Studio Key: To access GPT-4o or Gemini 1.5 Pro models.
- Poetry: For dependency management. Poetry Installation
Architectural Overview
A robust KYC system cannot rely on a single LLM prompt. It requires a stateful orchestration layer to handle various edge cases, such as blurry images, expired documents, or mismatched data.
We use LangGraph because it allows us to define the KYC process as a directed acyclic graph (DAG) where each node represents a specific task (extraction, validation, risk scoring) and edges represent the logic flow. Pydantic AI is used within these nodes to enforce strict schema adherence, ensuring that the LLM output matches our database requirements exactly.
Comparison of Extraction Technologies
| Feature | Traditional OCR (Tesseract) | Cloud OCR (AWS Textract) | Agentic AI (Pydantic AI + LLM) | | : | : | : | : | | Accuracy | Low (requires high contrast) | Moderate | High (context-aware) | | Data Structuring | Manual Regex required | Key-Value pairs | Native Pydantic Objects | | Validation | None | Basic | Logical (e.g., Checksum validation) | | Handling Unstructured Data | Poor | Fair | Excellent | | Cost per Document | Very Low | Low | Moderate |
Implementation Step-by-Step
Step 1: Project Initialization
Initialize a new Python project and install the necessary dependencies.
$ mkdir kyc-automation-ai
$ cd kyc-automation-ai
$ poetry init --no-interaction
$ poetry add langgraph pydantic-ai pydantic fastjsonschema python-dotenv pillow
$ poetry add --group dev pytest
Step 2: Defining the Data Schema
The first step in any IDP project is defining the target schema. We need to extract specific fields from an identity document. We use Pydantic to define these models, which provides built-in validation for data types.
from datetime import date
from typing import Optional, List
from pydantic import BaseModel, Field, validator
class IdentityDocument(BaseModel):
"""Schema for extracted identity document data."""
first_name: str = Field(description="The given names of the individual.")
last_name: str = Field(description="The surname of the individual.")
date_of_birth: date = Field(description="The individual's date of birth.")
document_number: str = Field(description="The unique identifier of the document.")
expiry_date: date = Field(description="The date the document expires.")
issuing_country: str = Field(description="The ISO 3166-1 alpha-3 country code.")
document_type: str = Field(description="Type of document: Passport, ID_Card, or Driver_License.")
mrz_code: Optional[str] = Field(None, description="The Machine Readable Zone string if present.")
class VerificationResult(BaseModel):
"""Schema for the final verification status."""
is_valid: bool
confidence_score: float = Field(ge=0, le=1)
flags: List[str] = Field(default_factory=list, description="List of potential issues found.")
requires_manual_review: bool
Step 3: Implementing the Extraction Node
We use Pydantic AI to interface with the LLM. Pydantic AI excels at "Structured Generation," which is critical for KYC to ensure that dates are formatted correctly and mandatory fields are not missing.
import os
from dotenv import load_dotenv
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
load_dotenv()
# Define the model and agent
model = OpenAIModel('gpt-4o')
extraction_agent = Agent(
model,
result_type=IdentityDocument,
system_prompt=(
"You are a specialized KYC extraction agent. "
"Extract all relevant information from the provided identity document image. "
"If a field is illegible, do not guess; leave it null if optional or flag it. "
"Ensure dates are in ISO 8601 format."
)
)
async def extract_document_data(image_path: str) -> IdentityDocument:
"""
Sends the image to the LLM and returns a structured IdentityDocument object.
"""
# In a production scenario, you would pass the image as a base64 string or URL
result = await extraction_agent.run(f"Process this document: {image_path}")
return result.data
Step 4: Orchestrating the Workflow with LangGraph
The power of this system lies in the orchestration. We define a State object that tracks the progress of the KYC check and a StateGraph to manage the transitions.
from typing import TypedDict, Annotated, Union
from langgraph.graph import StateGraph, END
class KYCState(TypedDict):
"""The state maintained throughout the KYC process."""
image_path: str
extracted_data: Optional[IdentityDocument]
verification_result: Optional[VerificationResult]
error: Optional[str]
def extraction_node(state: KYCState):
"""Node for data extraction."""
try:
# In a real implementation, this would be an async call
data = extract_document_data(state['image_path'])
return {"extracted_data": data}
except Exception as e:
return {"error": str(e)}
def validation_node(state: KYCState):
"""Node for business logic validation (e.g., checking expiry)."""
data = state['extracted_data']
flags = []
if data.expiry_date < date.today():
flags.append("DOCUMENT_EXPIRED")
# Logic for MRZ checksum validation would go here
is_valid = len(flags) == 0
requires_review = len(flags) > 0 or data is None
verification = VerificationResult(
is_valid=is_valid,
confidence_score=0.95 if is_valid else 0.4,
flags=flags,
requires_manual_review=requires_review
)
return {"verification_result": verification}
def should_continue(state: KYCState):
"""Conditional edge to determine if manual review is needed."""
if state.get("error") or state["verification_result"].requires_manual_review:
return "manual_review"
return END
# Define the Graph
workflow = StateGraph(KYCState)
workflow.add_node("extract", extraction_node)
workflow.add_node("validate", validation_node)
workflow.add_node("manual_review", lambda state: {"error": "Pending manual review"})
workflow.set_entry_point("extract")
workflow.add_edge("extract", "validate")
workflow.add_conditional_edges("validate", should_continue)
workflow.add_edge("manual_review", END)
app = workflow.compile()
Step 5: Handling Machine Readable Zones (MRZ)
For passports and many ID cards, the MRZ provides a deterministic way to verify the data extracted from the visual zone. A senior technical implementation should include a validation step that compares the LLM's visual extraction with a parsed version of the MRZ string.
The MRZ contains check digits calculated using a specific weighting (7, 3, 1). Implementing this in Python ensures that the AI hasn't hallucinated a document number.
def verify_mrz_checksum(mrz_string: str) -> bool:
"""
Implements ICAO Doc 9303 checksum validation for MRZ.
"""
if not mrz_string:
return False
# Simplified example of the weighting logic
weights = [7, 3, 1]
total = 0
for i, char in enumerate(mrz_string):
if char == '<':
val = 0
elif char.isdigit():
val = int(char)
else:
val = ord(char) - 55 # A=10, B=11...
total += val * weights[i % 3]
return total % 10 == int(mrz_string[-1]) # Simplified check digit comparison
Security and Compliance Considerations
When building KYC systems, data privacy is paramount. In the European Union, GDPR mandates strict controls over Personally Identifiable Information (PII).
Data Minimization and Retention
The system should be designed to delete images immediately after extraction and validation. Only the structured data and the verification status should be persisted in the primary database.
Private Cloud Deployment
For enterprise-grade KYC, using public LLM endpoints may be restricted by compliance policies. Deploying models like Llama 3 or Mistral on private infrastructure using vLLM or TGI (Text Generation Inference) ensures that PII never leaves the controlled environment.
Audit Trails
Every transition in the LangGraph workflow should be logged. LangGraph's built-in persistence layer (Checkpointers) allows for full traceability of how a decision was reached, which is a requirement for regulatory audits.
Scaling the System
To handle thousands of verifications per hour, the system should be deployed as a microservice.
- FastAPI Wrapper: Wrap the LangGraph logic in a FastAPI application to provide an asynchronous REST API.
- Task Queue: Use Celery or RabbitMQ to handle document processing out-of-band. Document extraction is an I/O bound task that can take several seconds; it should never block the main request thread.
- Horizontal Scaling: Containerize the application using Docker and deploy on Kubernetes. The extraction nodes can be scaled independently based on the length of the processing queue.
Dockerization Strategy
FROM python:3.12-slim
WORKDIR /app
RUN apt-get update && apt-get install -y \
build-essential \
libmagic1 \
&& rm -rf /var/lib/apt/lists/*
COPY pyproject.toml poetry.lock ./
RUN pip install poetry && poetry install --no-dev
COPY . .
CMD ["poetry", "run", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Error Handling and Edge Cases
In production, AI-driven KYC systems encounter several common failure modes:
- Low Image Quality: If the LLM cannot identify the document type, the workflow should immediately route to a "Request Re-upload" state rather than attempting extraction.
- Unsupported Documents: Use a classification node at the start of the graph to identify the document type. If it is not a supported ID, terminate the process.
- Model Hallucinations: By using Pydantic AI's
result_type, we enforce schema validation. If the LLM returns a string where a date is expected, Pydantic will raise aValidationError, which the LangGraph state can catch and handle via a retry logic or manual escalation.
Conclusion
Automating KYC with AI moves beyond simple character recognition into the realm of cognitive understanding. By combining LangGraph's stateful orchestration with Pydantic AI's structured