In the routine operations of all our clients, dealing with incoming invoices and ensuring their accurate reconciliation with ERP system records is a constant challenge. The messiness of real-world data, with its diverse formats, imperfections and variations, poses significant obstacles for both traditional OCR and even the latest VisionLLM systems. At PwC, we have developed a revolutionary method to create high-fidelity synthetic invoice data that addresses these challenges head-on, providing the essential labelled data needed for training, fine-tuning and testing these advanced systems.
Invoices in the real world come in all shapes and sizes, literally and figuratively. They vary widely in:
Each supplier may use a different template, leading to a plethora of formats that an OCR system must recognise.
Real invoices often feature smudges, creases, handwritten notes or other marks that can confuse OCR systems.
Fields such as dates, amounts and supplier information are dynamic, changing with each invoice.
The logical flow and relationship between different fields must be accurately interpreted for effective reconciliation.
Handling this complexity requires extensive, high-quality labelled data to train and test OCR and VisionLLM systems. However, obtaining and annotating real-world data is not only challenging but also fraught with privacy concerns and logistical issues.
Labelled synthetic data offers a solution to these challenges. By generating artificial data that mimics real-world invoices, we can create a diverse, extensive and secure dataset for training AI systems. Here’s how labelled synthetic data transforms OCR and VisionLLM system development:
Synthetic data can be generated in large volumes, enabling comprehensive testing of OCR and VisionLLM systems under various scenarios.
Labelled synthetic data ensures that the AI models receive consistent and accurately annotated data, which is crucial for effective learning.
With synthetic data, it's easy to introduce a wide range of variations and imperfections, helping AI models learn to handle the messiness of real-world invoices.
At PwC, we leverage a combination of latest Generative AI models, traditional typesetting tools and sophisticated image manipulation techniques to produce high-fidelity synthetic invoice data. Here’s a glimpse into our innovative process:
We use advanced GenAI models to create diverse invoice templates that reflect the wide range of formats and layouts found in real-world data. These models introduce natural variations in the content and style of the invoice, ensuring a rich dataset.
Utilising the best-in-class typesetting tools, we generate printed invoices with precise control over formatting and structure. This approach allows us to produce large amounts of invoices with multipage tables and other elements often found in real invoices that are challenging for OCR-based records linkage systems.
To enhance the realism of our synthetic data, we apply various image manipulation techniques to simulate real-world imperfections. This includes adding smudges, creases, handwritten notes and other artefacts that are common in scanned documents. These manipulations ensure that our synthetic data closely mirrors the challenges an OCR system will face in practice.
The entire technology stack is built to mass-produce synthetic data in an unsupervised fashion and ensures the highest fidelity of the produced data. By analysing our client's existing invoices and statistically modelling distribution of specific features, we can produce synthetic datasets that are virtually indistinguishable from the real thing.
Our high-fidelity synthetic invoice data offers the ultimate convenience for businesses adopting AI-based OCR and VisionLLM systems. By providing a robust, varied and realistic dataset, we enable:
AI models trained on our synthetic data are better equipped to handle the diversity and complexity of real-world invoices.
With precise and consistent labelling, our synthetic data ensures that AI models learn to accurately extract and interpret invoice data.
Extensive testing with our synthetic data allows businesses to validate their OCR and VisionLLM systems thoroughly, ensuring reliability and performance prior to deployment.
We are committed to pushing the boundaries of what’s possible with AI in financial operations. Our innovative approach to creating high-fidelity synthetic invoice data is just one example of how we help businesses overcome real-world challenges with cutting-edge technology. By ensuring that our clients’ OCR and VisionLLM systems are trained, fine-tuned and tested with the best possible data, we pave the way for more efficient, accurate and reliable invoice reconciliation processes.
Partner and Forensic Services and Financial Crime Leader, Zurich, PwC Switzerland
+41 58 792 17 60