Semantic DocPrep
The /objects/:objectId/analyze endpoint exposes a service to transform content into XML files. XML provides a richer, more structured representation than other plain text formats. This enhanced structure facilitates better processing and understanding of the document's by LLM models, particularly for large tables and complex layouts. As of Today, only PDF files are supported.
Besides improving the "understanding" of complex documents by LLM, the XML formats also enables additional capabilities:
- Deep linking. LLM responses can now include not just the exact page of a reference, but also its position. This capability can be leveraged to build awesome user experiences.
- Extracting pieces of content without LLM rewriting nor hallucination such as long tables
Analyze
Use this endpoint to trigger a content analysis for an Object
Endpoint: /objects/:objectId/analyze
Method: POST
Requirements: User must have content:write permission on the object.
Headers
| Header | Value |
|---|---|
Authorization | Bearer <YOUR_JWT_TOKEN> |
Path Parameters
| Parameter | Data Type | Description |
|---|---|---|
| objectId | string | (Required) The ID of the Object to analyze. The object mime-type must be application/pdf |
Input Parameters
| Parameter | Data Type | Description |
|---|---|---|
| features | string[] | The features to activate for the analysis. Currently not used. |
Example Request Payload
{
"features": [],
}
Example Response
{
"workflow_id": "workflow_execution_request:abcdef1234567890abcdef123456",
"workflow_run_id": "abcdef12-abcd-1234-abcd-1234567890ab",
"status": 1
}
Code Example
Analyze
curl --location --request POST \
'https://api.vertesia.io/api/v1/objects/:objectId/analyze' \
--header 'Authorization: Bearer <YOUR_JWT_TOKEN>' \
--header 'Content-Type: application/json' \
--data-raw '{
"features": []
}'
Get Status
Use this endpoint to get the status of a previously requested analysis
Endpoint: /objects/:objectId/analyze/status
Method: GET
Requirements: User must have content:read permission on the object.
Headers
| Header | Value |
|---|---|
Authorization | Bearer <YOUR_JWT_TOKEN> |
Path Parameters
| Parameter | Data Type | Description |
|---|---|---|
| objectId | string | (Required) The ID of the object to retrieve the status for. |
Example Response
{
"workflow_id": "652d77:workflow_execution_request:67ef301df9967ef50104d652",
"workflow_run_id": "0195fe53-f5db-7420-b161-9882945d2339",
"status": 1,
"progress": {
"pages": {
"total": 1,
"processed": 1,
"success": 1,
"failed": 0,
},
"tables": {
"total": 1,
"processed": 1,
"success": 1,
"failed": 0,
},
"images": {
"total": 1,
"processed": 0,
"success": 0,
"failed": 0,
},
"visuals": {
"total": 2,
"processed": 0,
"success": 0,
"failed": 0,
},
"started_at": 1743728670181,
"percent": 40,
},
}
Code Example
Get Status
curl --location --request GET \
'https://api.vertesia.io/api/v1/objects/:objectId/analyze/status' \
--header 'Authorization: Bearer <YOUR_JWT_TOKEN>'
Get Results
Use this endpoint to retrieve the result of an analysis once it is completed. The response will contain the XML conversion of the object.
Endpoint: /objects/:objectId/analyze/results
Method: GET
Requirements: User must have content:read permission on the object.
Headers
| Header | Value |
|---|---|
Authorization | Bearer <YOUR_JWT_TOKEN> |
Path Parameters
| Parameter | Data Type | Description |
|---|---|---|
| objectId | string | (Required) The ID of the object to retrieve the results for. |
Example Response
{
"document": "<xml>...</xml>",
"tables": [],
"images": [],
"annotated": "https://..."
}
Code Example
Get Results
curl --location --request GET \
'https://api.vertesia.io/api/v1/objects/:objectId/analyze/results' \
--header 'Authorization: Bearer <YOUR_JWT_TOKEN>'
Get XML Text
Use this endpoint to fetch the object's corresponding XML string once the analysis is completed.
Endpoint: /objects/:objectId/analyze/xml
Method: GET
Requirements: User must have content:read permission on the object.
Headers
| Header | Value |
|---|---|
Authorization | Bearer <YOUR_JWT_TOKEN> |
Path Parameters
| Parameter | Data Type | Description |
|---|---|---|
| objectId | string | (Required) The ID of the object to retrieve the xml text for. |
Example Response
<document>...</document>
Code Example
Get XML Text
curl --location --request GET \
'https://api.vertesia.io/api/v1/objects/:objectId/analyze/xml' \
--header 'Authorization: Bearer <YOUR_JWT_TOKEN>'
Get Tables
Use this endpoint to fetch the object's table content once the analysis is completed.
Endpoint: /objects/:objectId/analyze/tables
Method: GET
Requirements: User must have content:read permission on the object.
Headers
| Header | Value |
|---|---|
Authorization | Bearer <YOUR_JWT_TOKEN> |
Path Parameters
| Parameter | Data Type | Description |
|---|---|---|
| objectId | string | (Required) The ID of the object to retrieve the tables for. |
Query Parameters
| Parameter | Data Type | Description |
|---|---|---|
| format | 'csv' | 'json' | The format of the adapted tables to return. Defaults to json |
Example Response
[
{
"page_number": 1,
"table_number": 10,
"data": [
{
"Column 1": "Row 1, Cell 1 content",
"Column 2": "Row 1, Cell 2 content"
},
{
"Column 1": "Row 2, Cell 1 content",
"Column 2": "Row 2, Cell 2 content"
}
],
"format": "application/json"
}
]
Code Example
Get Tables
curl --location --request GET \
'https://api.vertesia.io/api/v1/objects/:objectId/analyze/tables?format=json' \
--header 'Authorization: Bearer <YOUR_JWT_TOKEN>'
Get Images
Use this endpoint to retrieve information about the images that are embedded into the source PDF file.
Endpoint: /objects/:objectId/analyze/images
Method: GET
Requirements: User must have content:read permission on the object.
Headers
| Header | Value |
|---|---|
Authorization | Bearer <YOUR_JWT_TOKEN> |
Path Parameters
| Parameter | Data Type | Description |
|---|---|---|
| objectId | string | (Required) The ID of the object to retrieve the images for. |
Example Response
[
{
"id":"55",
"page_number": 1,
"description": "This is an image.",
"width": 100,
"height": 100,
"is_meaningful": false
}
]
Code Example
Get Images
curl --location --request GET \
'https://api.vertesia.io/api/v1/objects/:objectId/analyze/images' \
--header 'Authorization: Bearer <YOUR_JWT_TOKEN>'
Get Annotated
Use this endpoint to get a rendition of the PDF file annotated with the blocks outlines and Ids
Endpoint: /objects/:objectId/analyze/annotated
Method: GET
Requirements: User must have content:read permission on the object.
Headers
| Header | Value |
|---|---|
Authorization | Bearer <YOUR_JWT_TOKEN> |
Path Parameters
| Parameter | Data Type | Description |
|---|---|---|
| objectId | string | (Required) The ID of the object to retrieve the annotated PDF for. |
Example Response
{
"url": "https://storage.googleapis.com/.../annotated.pdf"
}
Code Example
Get Annotated
curl --location --request GET \
'https://api.vertesia.io/api/v1/objects/:objectId/analyze/annotated' \
--header 'Authorization: Bearer <YOUR_JWT_TOKEN>'
Adapt Tables
Use this endpoint to transform tables contained in source the pdf files to a format of your choice. The service will identify the relevant tables and map the columns to the requested format.
For example, this endpoint can be used to extract line items from an invoice document. The document may contain other tables like a packaging list. The service will first identify and consider only the relevant tables and then map the columns to fit the requested format.
Endpoint: /objects/:objectId/analyze/adapt_tables
Method: POST
Requirements: User must have content:write permission on the object.
Headers
| Header | Value |
|---|---|
Authorization | Bearer <YOUR_JWT_TOKEN> |
Path Parameters
| Parameter | Data Type | Description |
|---|---|---|
| objectId | string | (Required) The ID of the object |
Input Parameters
| Parameter | Data Type | Description |
|---|---|---|
| item_name | string | (Required) The name of the item to extract. |
| target_schema | string | (Required) The target schema of the tables. For example, a json schema |
| instructions | string | (Required) The instructions for adapting the tables. Typically a description of what a row should contain |
| environment | string | The environment to use for the workflow. |
| notify_endpoints | string[] | The endpoints to notify when the workflow is complete. |
Example Request Payload
{
"item_name": "invoice line item",
"target_schema": "{\r\n \"$schema\": \"http:\/\/json-schema.org\/draft-07\/schema#\",\r\n \"type\": \"object\",\r\n \"title\": \"Invoice line item schema\",\r\n \"description\": \"A line item\",\r\n \"properties\": {\r\n \"line_item_number\": {\r\n \"type\": \"string\",\r\n \"description\": \"A simple identifier number for the line item which is unique and incremental\"\r\n },\r\n \"product_code\": {\r\n \"type\": \"string\"\r\n },\r\n \"description\": {\r\n \"type\": \"string\"\r\n },\r\n \"quantity\": {\r\n \"type\": \"number\",\r\n \"minimum\": 0\r\n },\r\n \"unit_price\": {\r\n \"type\": \"number\",\r\n \"minimum\": 0\r\n },\r\n \"amount\": {\r\n \"type\": \"number\",\r\n \"minimum\": 0\r\n }\r\n }\r\n}",
"instructions": "A valid invoice line item table features rows such as description, quantity, unit price, and amount columns.",
"environment": "environmentID",
"notify_endpoints": []
}
Example Response
{
"workflow_id": "123456:content_object.workflow_execution_request:abcdef1234567890abcdef123456",
"workflow_run_id": "abcdef12-abcd-1234-abcd-1234567890ab",
"status": "running"
}
Code Example
Adapt Tables
curl --location --request POST \
'https://api.vertesia.io/api/v1/objects/:objectId/analyze/adapt_tables' \
--header 'Authorization: Bearer <YOUR_JWT_TOKEN>' \
--header 'Content-Type: application/json' \
--data-raw '{
"item_name": "invoice line item",
"target_schema": "{\r\n \"$schema\": \"http:\/\/json-schema.org\/draft-07\/schema#\",\r\n \"type\": \"object\",\r\n \"title\": \"Invoice line item schema\",\r\n \"description\": \"A line item\",\r\n \"properties\": {\r\n \"line_item_number\": {\r\n \"type\": \"string\",\r\n \"description\": \"A simple identifier number for the line item which is unique and incremental\"\r\n },\r\n \"product_code\": {\r\n \"type\": \"string\"\r\n },\r\n \"description\": {\r\n \"type\": \"string\"\r\n },\r\n \"quantity\": {\r\n \"type\": \"number\",\r\n \"minimum\": 0\r\n },\r\n \"unit_price\": {\r\n \"type\": \"number\",\r\n \"minimum\": 0\r\n },\r\n \"amount\": {\r\n \"type\": \"number\",\r\n \"minimum\": 0\r\n }\r\n }\r\n}",
"instructions": "A valid invoice line item table features rows such as description, quantity, unit price, and amount columns.",
"environment": "environmentID",
"notify_endpoints": []
}'
Get Adapted Tables
Use this endpoint to retrieve the adapted tables when processing is complete.
Endpoint: /objects/:objectId/analyze/adapt_tables/:runId
Method: GET
Requirements: User must have content:read permission on the object.
Headers
| Header | Value |
|---|---|
Authorization | Bearer <YOUR_JWT_TOKEN> |
Path Parameters
| Parameter | Data Type | Description |
|---|---|---|
| objectId | string | (Required) The ID of the object to retrieve the adapted tables for. |
| runId | string | (Required) The ID of the workflow run to retrieve the adapted tables for. |
Query Parameters
| Parameter | Data Type | Description |
|---|---|---|
| format | 'csv' | 'json' | The format of the adapted tables to return. Defaults to json |
Example Response
description,quantity,price
Row 1, Cell 1,1,10
Row 2, Cell 1,2,20
Code Example
Get Adapted Tables
curl --location --request GET \
'https://api.vertesia.io/api/v1/objects/:objectId/analyze/adapt_tables/:runId?raw=false&format=csv' \
--header 'Authorization: Bearer <YOUR_JWT_TOKEN>'
