Munich Datageeks e.V.
Talk "Massively Multimodal Input Management in CIB"

Talk "Massively Multimodal Input Management in CIB"

Felix Reuthlinger

Massively Multimodal Input Management in CIB flow by Konrad Grosser was presented at Munich Datageeks - January Edition 2025

Abstract

Documents are rich in both linguistic and visual information, playing a critical role in many business processes. The automatic understanding of such documents is an evolving research field, requiring robust and advanced multimodal approaches. Document classification, a key aspect of document understanding, demands detailed, context-dependent features that range from task-specific to general-purpose. In this presentation, we will demonstrate how CIB leverages sophisticated feature engineering to train a massively multimodal document classification network. We will showcase the use of expert AI features and explain how these can be tailored to address complex customer challenges, even with limited amounts of labelled data. Finally, we will illustrate how our classification network integrates seamlessly into the CIB flow business process engine, working in tandem with our OCR and data extraction tools to deliver a comprehensive, end-to-end solution for document processing.

About the speaker

Konrad Grosser is currently employed as Senior Data Scientist in the AI team of CIB. He studied mathematics at the ETH Zurich and holds a MSc degree in population genetics from the University of Vienna. Before joining the AI team at CIB, he worked as a research scientist in statistical genetics at the LMU, focusing on the population genetics of hematopoietic neoplasms. He has a broad interest and experience in machine learning and statistics, with a focus on automated document lifecycle management.

Transcript summary

Introduction to Document Input Management at CIB

The presentation focuses on CIB's document input management process, a critical component of their workflow solution called CIB Flow. Companies receive documents from various sources including email, physical mail, and mobile apps. CIB has developed an automated document preparation system that incorporates two main AI modules: CIB Recognize (a classification system) and specialized extraction tools that work based on the document type identified.

The Challenge with Traditional AI Approaches

Traditional document classification methods typically use complicated features like full document images or complete text, training AI systems on millions of documents to form abstract features. CIB initially attempted this approach using semantic vectors for classification, but achieved only 80% accuracy. This meant that 20% of the time, human workers had to correct the AI's decisions, resulting in significant manual labor and internal feedback concerns.

Learning from Human Classification Methods

Humans classify documents more effectively than AI because they leverage social learning and use specialized features rather than reading documents completely. Humans recognize patterns through fonts, layouts, and other visual cues learned from others. The speaker decided to apply this principle to AI by preparing expert features capable of completely abstracting documents.

Feature Engineering: Converting Documents to Vectors

The core concept involves converting document properties into numerical vectors that AI can process. Multiple features can be combined by concatenating vectors. Key features developed include:

Font Recognition: A CNN detects fonts for each segmented word in a document, counting font types to create a vector (e.g., 100 words in Arial, 200 words in Times New Roman).

Variational Autoencoder: A network trained to reduce document size by representing each tile with a single integer, then reconstructing the original image. This creates a highly compressed, abstract representation of the document.

Semantic Features: Counting occurrences of specific important words or names that help classify documents.

BERT Network: Trained on text with missing words, the network learns to predict correct words based on context. Later layers produce vectors that can be extracted and combined with other features.

These various vectors are fused together to create a comprehensive 4,000-dimensional vector representing most aspects of a document in abstract form.

Visualizing High-Dimensional Data

To understand these 4,000-dimensional vectors, the team used dimension reduction techniques (UMAP) to project them into 2D space. Testing on a customer dataset with limited labeled documents revealed that different document types naturally formed distinct clusters without any classification training. For example, ID documents (Ausweis), salary statements (Gehaltsweise), and car registration documents (ZBII) separated into different regions. Interestingly, salary statements formed two clusters, which upon investigation revealed two distinct types of salary statements in their data.

Customer Project Results

For a customer with only 631 pages (mostly contracts), the team first separated contracts from other documents using the vector approach combined with algorithmic post-processing, achieving 100% classification accuracy on the test set. For the remaining document types, they achieved approximately 94% classification accuracy without extensive training, using confusion matrices to validate the results.

Fraud Detection Application

An interesting project involved detecting fraud in physician billing. Different physicians produce invoices with distinct "fingerprints" based on their document creation patterns. Even if someone creates a fraudulent invoice with correct numbers, the document would look different from authentic ones if the fraudster lacks a good template. Using document feature vectors (now called "fingerprints"), the system achieved nearly 100% separation between different physicians' billing documents, with each physician's documents forming distinct clusters.

Topology-Based Fraud Detection

The system uses topological features of document layouts, where word coordinates and their spatial relationships create unique patterns. By varying scales and analyzing how these topological features change, the system vectorizes layout characteristics, significantly improving separation between different physicians' document layouts.

Applying the System to CIB's Own Mailbox

CIB's internal mailbox proved more challenging than customer projects due to high variability, multilingual content across different countries, and diverse document systems. Simple clustering achieved only 88% accuracy. Analysis revealed that a catch-all document class called correspondence (Schriftverkehr) - essentially meaning "whatever doesn't fit elsewhere" - intermingles with all other clusters, making direct clustering difficult. However, UMAP visualization showed that related documents (like contracts or invoices) still clustered together, indicating useful signal despite the chaos.

Neural Network Enhancement

Training a simple five-layer MLP (Multi-Layer Perceptron) on the fingerprint features significantly improved results. Comparison of approaches showed:

  • Simple feature vectors (like prototype embeddings): Poor performance, lacking sufficient document information
  • Fingerprints with clustering: Better, with features slightly outperforming basic parameters
  • Fingerprints plus neural network: Best results at 94% accuracy

The improvement from 80% to 94% represents a four-fold reduction in remaining error, meaning workers need to correct the AI four times less often - a substantial productivity gain.

Entity Extraction System

Following classification, different document types require different information extraction approaches. CIB developed a three-pillar mixture-of-experts system:

1. Mask-Based Form Extraction: For documents with static entity positions (like standardized forms), knowing exact coordinates beats any LLM approach. The system includes dewarping capabilities to handle curved or moved documents, achieving over 95% extraction accuracy even on poor-quality documents like salary statements.

2. In-House BERT-Based System: Uses an older but lightweight architecture with BERT embeddings for named entity recognition. Highly adaptable with only 10 examples per entity needed for training. The key advantage is on-premise deployment capability on CPU or small GPU, making it suitable for sensitive data and customer-specific adaptations without external dependencies.

3. Large Language Models: Models like GPT-4 or DeepSeek can be prompted directly for data extraction without any training data. CIB uses specialized templates for LLM extraction in cases where this approach is most effective.

Conclusion

The highly multimodal document fingerprints enable a classification system that outperforms classical AI approaches on CIB's datasets. The system is highly adaptable, with the input management process tailored to specific needs through the combination of classification and expert extraction systems. The presentation concluded with an apocryphal quote attributed to Manfred Eigen: "In a sufficiently high dimension vector space, every raindrop finds its way into the ocean."