Revolutionising record linkage with high-fidelity synthetic invoice data

Invoice handling with AI

Revolutionising record linkage with high-fidelity synthetic invoice data
  • Insight
  • 10 minute read
  • 11/07/24
Sebastian Ahrens

Sebastian Ahrens

AI Center of Excellence Leader, PwC Switzerland

In the routine operations of all our clients, dealing with incoming invoices and ensuring their accurate reconciliation with ERP system records is a constant challenge. The messiness of real-world data, with its diverse formats, imperfections and variations, poses significant obstacles for both traditional OCR and even the latest VisionLLM systems. At PwC, we have developed a revolutionary method to create high-fidelity synthetic invoice data that addresses these challenges head-on, providing the essential labelled data needed for training, fine-tuning and testing these advanced systems.
 

The real-world data challenge

Invoices in the real world come in all shapes and sizes, literally and figuratively. They vary widely in:

Formats and layouts

Each supplier may use a different template, leading to a plethora of formats that an OCR system must recognise.

Imperfections

Real invoices often feature smudges, creases, handwritten notes or other marks that can confuse OCR systems.

Dynamic data

Fields such as dates, amounts and supplier information are dynamic, changing with each invoice.

Contextual complexity

The logical flow and relationship between different fields must be accurately interpreted for effective reconciliation.

Handling this complexity requires extensive, high-quality labelled data to train and test OCR and VisionLLM systems. However, obtaining and annotating real-world data is not only challenging but also fraught with privacy concerns and logistical issues.

The need for labelled synthetic data

Labelled synthetic data offers a solution to these challenges. By generating artificial data that mimics real-world invoices, we can create a diverse, extensive and secure dataset for training AI systems. Here’s how labelled synthetic data transforms OCR and VisionLLM system development:

Extensive testing

Synthetic data can be generated in large volumes, enabling comprehensive testing of OCR and VisionLLM systems under various scenarios.

Consistency and accuracy

Labelled synthetic data ensures that the AI models receive consistent and accurately annotated data, which is crucial for effective learning.

Adaptability

With synthetic data, it's easy to introduce a wide range of variations and imperfections, helping AI models learn to handle the messiness of real-world invoices.

Our innovative approach to creating synthetic data

At PwC, we leverage a combination of latest Generative AI models, traditional typesetting tools and sophisticated image manipulation techniques to produce high-fidelity synthetic invoice data. Here’s a glimpse into our innovative process:

Generative AI

We use advanced GenAI models to create diverse invoice templates that reflect the wide range of formats and layouts found in real-world data. These models introduce natural variations in the content and style of the invoice, ensuring a rich dataset.

Typesetting

Utilising the best-in-class typesetting tools, we generate printed invoices with precise control over formatting and structure. This approach allows us to produce large amounts of invoices with multipage tables and other elements often found in real invoices that are challenging for OCR-based records linkage systems.

Sophisticated image manipulation

To enhance the realism of our synthetic data, we apply various image manipulation techniques to simulate real-world imperfections. This includes adding smudges, creases, handwritten notes and other artefacts that are common in scanned documents. These manipulations ensure that our synthetic data closely mirrors the challenges an OCR system will face in practice.

The entire technology stack is built to mass-produce synthetic data in an unsupervised fashion and ensures the highest fidelity of the produced data. By analysing our client's existing invoices and statistically modelling distribution of specific features, we can produce synthetic datasets that are virtually indistinguishable from the real thing.

Providing ultimate comfort in AI-based systems

Our high-fidelity synthetic invoice data offers the ultimate convenience for businesses adopting AI-based OCR and VisionLLM systems. By providing a robust, varied and realistic dataset, we enable:

Robust training

AI models trained on our synthetic data are better equipped to handle the diversity and complexity of real-world invoices.

Enhanced accuracy

With precise and consistent labelling, our synthetic data ensures that AI models learn to accurately extract and interpret invoice data.

Comprehensive validation

Extensive testing with our synthetic data allows businesses to validate their OCR and VisionLLM systems thoroughly, ensuring reliability and performance prior to deployment.

Interested to hear more? Please get in touch.

We are committed to pushing the boundaries of what’s possible with AI in financial operations. Our innovative approach to creating high-fidelity synthetic invoice data is just one example of how we help businesses overcome real-world challenges with cutting-edge technology. By ensuring that our clients’ OCR and VisionLLM systems are trained, fine-tuned and tested with the best possible data, we pave the way for more efficient, accurate and reliable invoice reconciliation processes.

Questions? We are here for you.

Contact us

Sebastian Ahrens

AI Center of Excellence Leader, PwC Switzerland

+41 58 792 16 28

Email

Gianfranco Mautone

Partner and Forensic Services and Financial Crime Leader, Zurich, PwC Switzerland

+41 58 792 17 60

Email

Tobias Scheiwiller

Partner, Risk & Regulatory, PwC Switzerland

+41 79 740 25 04

Email