Overview

Vertesia Semantic DocPrep is a generative AI powered service that transforms documents into XML files in order to dramatically improve processing and understanding of the document's content by LLM models.

Annotated PDF with XML Viewer

What problem does it solve?

When Large Language Models (LLMs) are provided with a simple plain text conversion of a PDF or PowerPoint file, critical information conveyed by the document's layout and formatting is lost. For example, the way a heading signals the start of a new topic, or how data is organized in rows and columns within a table – this structure provides essential context for understanding. Without this information, LLMs can struggle to grasp the true meaning and relationships within complex documents, leading to less accurate processing and difficulties in extracting specific details. For instance, an LLM might not correctly associate a paragraph with its corresponding heading or accurately interpret the data presented in a multi-layered table.

Vertesia Semantic DocPrep intelligently transforms your documents, starting with PDFs, into structured XML files, creating a richer, semantically aware representation of your content. By explicitly encoding the document's layout and formatting within the XML structure, Vertesia preserves the crucial contextual cues that are otherwise lost in plain text. This enhanced structure acts as a clear roadmap for LLMs, dramatically improving their ability to understand the document's content, especially when dealing with intricate layouts and large tables.

Vertesia directly addresses the limitations of processing plain text, paving the way for more accurate, reliable, and insightful interaction and automation with your documents:

Accurate Information Extraction: Knowing the structure helps LLMs extract specific pieces of information more reliably, especially from tables or complex structures.
Deep Linking and Referencing: Precise layout information (like bounding boxes) enables the creation of deep links to specific locations within a document.
Reduced Hallucinations: By grounding the LLM in a structured representation of the document, you can minimize the chances of it generating inaccurate information or "hallucinating" content that isn't actually present or is misinterpreted due to a lack of structural understanding.

Main features

Vertesia Semantic DocPrep includes the following features:

OCR images to extract text content
Transform PDF files into structured XML files which can be easily leveraged in all your interactions and workflows
Generate annotated renditions of pages
Extract tables contained in documents using your own specific target schemas

Supported Format

As of today, the service only supports the PDF format.