Skip to main content
All CollectionsHelp Center
Standardization 🏒
Standardization 🏒

An intro to structuring your documents.

Updated over 2 months ago

Overview

DocuPanda has two fundamental components: parsing, and standardization. These two pieces are the bread and butter of DocuPanda, if you will. Parsing is the initial step, which ingests a raw document and outputs a good internal text representation of that document, including handling of OCR, tables, checkmarks, etc. But what you are left with is still essentially unstructured. It is not immediately useful or clear what to do with that text. That is where standardization comes in.

Bringing Order to Chaos

Standardization is the process of using advanced generative AI (think of it as LLMs, like ChatGPT) to examine a document, and convert it into a structured format. The most generic structured format type is called a JSON, which is a flexible way to represent data. It includes the ability to nest, to have lists, handle text strings, numbers, dates, etc. DocuPanda can also convert these output JSONs into an Excel file for more convenient viewing.

There are two fundamental types of standardization: with or without a schema. Now, what is a schema you ask? A schema is also a JSON (the more proper name is JSON schema), which defines the exact structure of the output you expect. It is like the template / blueprint for what you are looking to extract from a document. If we were to make an analogy to cooking, then the document is the ingredients, the schema is the recipe, and the final JSON output (what we call a standardization) is the finished meal.

So most standardization is done with a schema, because this is what allows for a consistent and repeatable process to extract the same structure from a large amount of documents. But it also possible to not provide a schema, and let the AI improvise whatever output structure makes the most sense for that specific document.

How Does it Work?

We're not going to tell you all our trade secrets! πŸ˜‚

Just kidding! We'll tell you everything. In general, our latest version of Standardization (Version 2), has roughly the following flow:


1. Determine the Modes

First we determine which mode of operation to use on your document. There are two main modes that we consider:

  1. Display Mode: This controls how we show your document to the AI. There are two alternatives:

    1. Spatial: This shows the document to the AI in a format that is similar to how a human sees it: words appear in their appropriate location in a page, but it is text and not an image. Imagine a sort of "ascii art" like formatting of each page.

    2. Sections: This format is the same as the one you can see on DocuPanda's website in the Document Viewer, and that you can download a text file of, and that you get from the document upload API endpoint. It is basically a top-to-bottom list of sections, which attempts to represent the document as best as possible without getting bogged down in spatial locations. Imagine you had to load the PDF into your kindle and just read it in order - this is the way you would do it. Tables are rendered as markdown in a clean format.

    3. Image: this uses the pixels of the document itself, in addition to the text. This is best for cases where there is information besides the text, for instance table lines, images, signatures, handwritten marks that are not captured by the OCR, etc.

  2. Split Mode: Since large documents are very challenging for the AI, it is very important to split documents into more manageable chunks, process each sub document separately, and combine the results later. This is only relevant for schemas that have arrays in them, because arrays are unbounded in the number of instances we may need to extract. There are 3 different split modes:

    1. All: Split as much as possible, with each page being its own sub document. This makes lots of sense for things like bank statements, where each page is just a list of charge line items, each being totally independent.

    2. Never: Do not split this document at all, because all the fields are too interconnected and it would be incorrect to not view the document as a whole. This makes sense sometimes, but can lead to poor performance if there are arrays and we need to extract a large number of instances, because the AI struggles to do this all at once for a large document without getting lazy.

    If you set the parameters for these modes to auto, the AI will decide which is best in the given situation. Otherwise, you can choose yourself and we will respect your selection.

2. Standardize

Now is the part where we take the schema, your document in its optimal representation, and let the AI do the extraction. As we mentioned, array fields are given special handling, and applied to the document after being split into sub documents, because otherwise the performance tends to be poor.

After all the extractions are complete, we combine the results back together into the complete JSON standardization.

3. More Power

There is a third parameter that you can optionally set called Effort Level. By default it is set to standard, but if you set it to high, we will let the AI make another error correcting pass. For now this feature comes at no additional charge, but possibly in the future it will cost more credits. The trade-off is that the process will take longer, and cost more compute - but with better overall performance.

How to Standardize on the Website

OK, now on a technical level, how do we actually standardize with DocuPanda? On the dashboard in website, you can either select it in the left-hand menu nested under Documents β†’ Standardize, or if you are in the Standardization tab you will see a button in the upper right hand corner.

Click here to standardize

Step 1: Select Documents and Schema

This will open a window which let's you select a schema, and multiple documents. Let's assume for now that you have a schema already prepared. If not, see this article for more info on how to make one. If you want to standardize with no schema, go to Schemas β†’ Create Basic. That flow will let you pick a document, standardize it on-the-fly with no schema, and then even infer a new schema from that output so you can use it again in the future.

So select your schema, select as many documents as you want, and hit Standardize at the bottom left. You can also see how many credits this will cost (two credits per page).

Step 2: Choose Parameters

Now we arrive at the final confirmation window, that looks like this:

As we went over earlier, there are multiple parameters you can set:

  • Display Mode: Spatial, Sections, Image, or Auto. By default it is set to Auto, which let's the AI decide what is best, but you can experiment with different options and see empirically what gives the best results for your use-case. If you happen to know that spatial information is very important for your documents, then set that option and don't leave it up to the AI.

  • Split Mode: All, Never, or Auto. By default it is set to Auto, which let's the AI decide what is best, but you can experiment with different options and see empirically what gives the best results for your use-case. If you happen to know that it is totally fine to treat each page independently, then set to All. Conversely, if you know that any splitting here would cause trouble, set to Never.

  • Effort Level: Standard or High. By default it is set to Standard, but you can set it to High for no additional fee - just expect it to take a bit longer.

  • Version: 1 or 2. The previous version 1 is much more naive, with no option to split - hence for large documents with lots of data to extract, results can be poor, or it might just downright fail. The new version 2 is still experimental, but it is the future.

There is also something called Custom Instructions, or also called Guidelines. What this is, is additional instructions that accompany your schema, and help the AI understand things that might not be unintuitive, or things that you have noticed it makes mistakes on.

By default, these instructions are prefilled with the value taken from the schema, but you are free to override it for this specific run. Guidelines are a superpower that let you quickly iterate by spotting what mistakes the AI has made, and adding a comment that might help it avoid that misunderstanding again.

Step 3: Let us Cook

Hit the big green button, and let DocuPanda cook. You can see the progress in the Jobs tab. Typically it can take from 5 seconds to a minute or more, depending on how large the document is and how demanding the schema is.

Once the job is done, you can view the results in the Standardization tab. You can view the results in the viewer, or download to your computer as a JSON or Excel file. In the viewer, you can expand or collapse the various sections of the JSON.

How to Standardize with the API

To do the same thing via code, you can use our handy API. Most of the heavy lifting on DocuPanda is done this way, calling us at scale. Check out our API Docs for more details. Specifically the endpoint for calling a standardization run is here.

Here is some Python code for calling standardization V2:

import requests

url = "https://app.docupanda.io/v2/standardize/batch"

payload = {
"documentIds": ["INSERT_DOC_ID_1, INSERT_DOC_ID_2"],
"schemaId": "INSERT_SCHEMA_ID,
"guidelines": "Dates appear in the upper right corner, don't take the one from the bottom left corner!",
"displayMode": "auto",
"splitMode": "all",
"effortLevel": "standard"
}
headers = {
"accept": "application/json",
"content-type": "application/json",
"X-API-Key": "INSERT_API_KEY"
}

response = requests.post(url, json=payload, headers=headers)
assert response.status_code == 200
res_json = response.json()
job_id = res_json["jobId"]

Afterwards, you can check on the job status using this endpoint. Here is some Python code for checking on the job ID you get back inside the previous response:

import requests

url = "https://app.docupanda.io/job/INSERT_JOB_ID"

headers = {
"accept": "application/json",
"X-API-Key": "INSERT_API_KEY"
}

response = requests.get(url, headers=headers)
assert response.status_code == 200
res_json = response.json()
if res_json["status"] == "completed":
print("DocuPanda has finished standardizing!")

When the job status indicates that the standardization has finished, you can retrieve the output using this endpoint. Here is some Python code for fetching the output.

import requests

url = "https://app.docupanda.io/standardization/INSERT_STD_ID"

headers = {
"accept": "application/json",
"X-API-Key": "INSERT_API_KEY"
}

response = requests.get(url, headers=headers)
assert response.status_code == 200
res_json = response.json()
print("Here is the output:")
print(res_json["data"])

And that's it! Now you know what standardization is πŸ˜ƒ

Did this answer your question?