When you're onboarding 50 new employees a week, manual document verification becomes a serious bottleneck. HR teams spend hours checking driver's licenses, work permits, SIN cards, and compliance documents — work that's repetitive, error-prone, and doesn't require human judgment for the vast majority of cases.
We automated it with AWS Textract and a custom LLM verification layer. Here's exactly how it works.
The Architecture
The verification pipeline has five stages:
- Step 1 — Document ingestion: Documents arrive as PDFs or images. PDFs are converted to PNG using PyMuPDF — one page at a time, up to 2 pages to control Textract API costs. Each page is processed independently.
- Step 2 — Textract analysis: Each page is sent to AWS Textract's
AnalyzeDocumentAPI with theIDENTITY_DOCUMENTfeature type. Textract extracts structured fields: name, date of birth, document number, expiry date, and more. - Step 3 — LLM screening: The extracted data is passed to AWS Bedrock with a custom screening prompt. The LLM validates the content against the expected document type and flags anomalies. Each field type can have its own screening prompt.
- Step 4 — SIN validation: For Canadian SIN cards, we run a Luhn algorithm check on the extracted number. This catches OCR errors and fraudulent documents that pass visual inspection.
- Step 5 — Result merging: For multi-page documents, we merge results across pages — the page with the highest confidence score wins, but extracted data is unioned across all pages so nothing is lost.
The SIN Luhn Check
The Luhn algorithm is a simple checksum formula used to validate Canadian Social Insurance Numbers. Digits at even positions (1-indexed) are doubled; if the result exceeds 9, subtract 9. The SIN is valid if the total of all 9 digits is divisible by 10.
This catches two important failure modes: OCR errors where a digit is misread, and fraudulent documents where the SIN number doesn't follow the valid format. We run this check after Textract extraction, before the document is marked as verified.
Handling Edge Cases
- Low confidence (<70%): Document goes to manual review queue with the specific failure reason
- Failed verification: Employee is notified and asked to resubmit with a clearer photo
- Escalation: HR is notified with the specific failure reason and the extracted data for review
- Audit trail: Every verification decision is logged with timestamp, confidence score, and extracted data
Results
- 97%+ accuracy on standard identity documents
- Average verification time: 3.2 seconds per document
- Manual review rate: less than 8% of documents
- Fraud detection rate significantly higher than manual review