Talk "Efficient Medical Data Extraction: Automating ICD-10 Code Identification from Documents"

Efficient Medical Data Extraction: Automating ICD-10 Code Identification from Documents by Alexander Hoeweler was presented at Munich Datageeks - March Edition 2025

Abstract

In this talk, we will explore how we automate the extraction of complex medical data from documents, focusing on the identification of ICD-10 codes. We will also delve into the architecture of our extraction pipeline and discuss how this scalable solution can be adapted for various stakeholders across other Allianz Lines of Business.

About the speaker

Alexander Hoeweler is a Senior Data Scientist at Allianz, specializing in building end-to-end AI services. He is passionate about sharing his experience on how scalable AI projects can be implemented in large-scale companies to create real business value.

Transcript summary

Business Problem and Use Case Context

The presentation addresses the challenge of extracting medical data from insurance documents at Allianz. The current manual process for claim handlers is cumbersome, time-consuming, and inefficient. When an accident occurs and an insured party is injured, doctors create medical documentation containing diagnoses and other relevant information. Claim handlers must manually extract this information from documents to process claims efficiently and help both the policyholder and the injured party quickly.

The core challenge is extracting medical information from documents and mapping it to ICD-10 codes. ICD-10 (International Statistical Classification of Diseases and Related Health Problems, 10th revision) is maintained by the World Health Organization and provides standardized codes for diagnoses. For example, code S13.4 represents sprain and strain of the cervical spine. These codes are structured hierarchically with chapters (like injuries), categories (injuries of the neck), and subcategories (specific injury types).

Technical Architecture and Workflow

The solution operates in AWS cloud infrastructure using multiple services. The workflow begins with the claims management system containing millions of documents across thousands of claims. For a specific claim, the system identifies relevant documents through a JSON-based structured data extraction process.

The technical stack includes AWS Textract for OCR processing to extract raw text from documents, AWS Bedrock with Claude 3.5 Sonnet as the large language model, a Cohere embedding model for vector representations, and FAISS (Facebook AI Similarity Search) as the vector database. The solution is served through AWS.

The process flow moves from document retrieval through OCR, then uses the large language model with carefully crafted prompts to understand document content. The system can perform either direct extraction when ICD-10 codes are already present in documents, or indirect mapping when only diagnosis descriptions exist. After extraction and mapping, results are formatted as JSON and returned to the claims management system.

OCR and Initial Extraction Challenges

Document processing presents several challenges. While the demonstration uses a relatively straightforward example, real-world documents can be far more complex, including 200-page documents written entirely in handwriting. Even simpler documents contain complications like doctor's handwriting, which can be difficult or impossible for humans to read but which Textract can sometimes successfully interpret.

The OCR process extracts text line by line with confidence scores for each line. However, extraction errors do occur, as demonstrated in the example where the first line was incorrectly captured. The large language model compensates for these errors through context understanding during the subsequent processing step.

Diagnosis Extraction with Large Language Models

Claude 3.5 Sonnet processes the OCR output using custom prompts to extract structured medical information. The model successfully identifies diagnoses from unstructured text and distinguishes between primary and secondary diagnoses. This distinction is critical because the system must identify only diagnoses directly related to the accident, excluding any pre-existing conditions mentioned in the medical documentation.

The model extracts diagnosis information with context, including details like head contusions, cervical spine injuries, and high-velocity trauma from car accidents. It also provides reasoning for its extractions and maintains awareness of diagnosis hierarchy and relevance to the accident event.

Indirect Mapping: The Core Innovation

The most sophisticated component addresses cases where documents contain diagnosis descriptions but no ICD-10 codes. Simply asking a large language model to provide the appropriate code is insufficient because the system requires adherence to a specific official list of ICD-10 codes. This list, which is the officially implemented version in Allianz's claim system, contains approximately 84,000 different codes and changes annually.

The solution employs a separate embedding model to create vector representations of all 84,000 ICD-10 codes in a 1,024-dimensional space. Each diagnosis description from the Cohere embedding model is converted to a vector in this same space. Using FAISS for efficient similarity search, the system applies k-nearest neighbor algorithms to find the closest matching ICD-10 code for each extracted diagnosis.

For example, when the system extracts head contusion from a document, it embeds this string, searches for the nearest neighbor in the vector space, and identifies S00.95 (head contusion) as the matching ICD-10 code with a very small distance metric, indicating high similarity. This approach handles the vast space of 84,000 codes efficiently and accurately maps even complex medical diagnoses.

Additional Medical Context Extraction

Beyond basic diagnosis mapping, the system extracts important contextual information. It identifies laterality - whether injuries affect the left or right side of the body, which is crucial for accurate medical documentation. It also recognizes temporal relationships, particularly "status post" conditions that indicate diagnoses resulting from the accident rather than pre-existing conditions.

The system demonstrates robustness against OCR errors. Even when extraction produces malformed text, the large language model's contextual understanding allows it to correctly interpret the intended medical information and proceed with accurate mapping.

Evaluation Challenges and Ground Truth

Evaluating this system presents significant challenges because establishing ground truth is difficult. When asking multiple doctors to provide the correct ICD-10 code for a specific diagnosis, responses vary considerably - asking ten doctors yields eleven different answers. This variation exists because ICD-10 codes inherently leave space for interpretation.

To establish a baseline, the team had a corporate doctor review a sample of diagnoses and assign corresponding ICD-10 codes. While this doesn't eliminate interpretation variability, it provides a necessary starting point for evaluation. The evaluation then compares model performance against both this medical ground truth and human claim handler performance.

Performance Metrics and Results

The system demonstrates strong performance across multiple metrics. For injured person identification, ensuring documents correspond to the correct individual, accuracy reaches 100 percent. Diagnosis extraction achieves 94 percent accuracy, significantly outperforming the human baseline.

For direct ICD-10 code mapping - cases where codes are explicitly written in documents - the model achieves 97 percent accuracy compared to 95 percent for human claim handlers. The most challenging task, indirect ICD-10 code mapping from diagnosis descriptions, shows the most significant improvement. The model achieves 68 percent accuracy compared to only 43 percent for human claim handlers, representing a 25 percentage point improvement.

This performance gap is particularly important given the time difference in processing. Human claim handlers require several minutes to read documents and identify indirect ICD-10 codes, while the model completes the same task in seconds. This speed improvement, combined with higher accuracy, substantially accelerates claim processing, allowing faster assistance to both policyholders and injured parties.

Current Development and Future Improvements

The solution is scheduled to go live in September. Current development focuses on improving the indirect ICD-10 code mapping, which at 68 percent accuracy still has room for improvement despite exceeding human performance.

The main optimization approach involves narrowing the 84,000-code search space to only relevant categories. For instance, the injuries chapter (designated by "S" codes) is highly relevant for accident-related claims, while other medical categories may be irrelevant for this use case. By restricting the vector search to pertinent code categories, the system can improve similarity metrics and potentially increase mapping accuracy. This targeted approach should reduce the dimensionality of the search space while maintaining coverage of all accident-relevant medical conditions.