Skip to main content
All CollectionsHelp Center
Walkthrough: Using DocuPanda visual review to locate and redact PII
Walkthrough: Using DocuPanda visual review to locate and redact PII

Follow along a demo where we redact sensitive user information like tax identifier and ID numbers from a PDF using DocuPanda's visual

Updated over 3 months ago

Overview

This tutorial uses DocuPanda to identify and redact personally identifiable information also known as PII. Example PII would be account numbers, social security numbers, etc.

The guide covers the following parts components:

  1. How to define a complex schema.

  2. How to improve the schema when results are unsatisfactory.

  3. How to leverage DocuPanda's review endpoint, to find the exact coordinates where PII information is located, in order to redact it.

We recommend to watch video, which contains all the information in this guide.

Our Plan

Here's how will accomplish our goal.

  1. Upload a bunch of documents that contain PII. I'm using my own data for this tutorial, you can either use your own or search online for fake PII examples

  2. Generate a schema where we carefully explain what sort of information we want to find

  3. Use the DocuPanda review capabilities to retrieve x,y, width, height coordinates that localize exactly where each value lives in the actual document.

  4. Overlay the rectangles on the input PDF.

Once we have the coordinates of each value, overlaying a rectangle to hide it isn't too hard, and accomplishable with a bit of coding or just asking GPT to do it.

Uploading Documents

Uploading documents is as easy as selecting them and hitting the upload button. I've chosen to upload 8 documents from variable sources: a picture of my social security card, a few bank statements, a doctor visit and some tax forms.

DocuPanda is HIPAA-compliant and fully encrypted end to end, and subject to the most rigorous terms and conditions - your data will never be shared with any 3rd party under any conditions, so it is a good place to store such highly sensitive information.

Here's one example document. Note I'm manually redacting the PII here - of course it's uploaded to DocuPanda without any redactions.

Defining a Schema

To create a schema, simply go to your documents tab, select all the documents, and hit the "Create Schema" Button. This will take you to a wizard where you need to specify what you want to extract from documents. In this demo, I've gone ahead and gave straightforward instruction that you can read below - basically just telling the DocuPanda's AI to extract any personally identifiable information, with both the value, type, and page number.

We're also giving it an exhaustive list of PII types - card numbers, account numbers, address etc etc.

Hit submit, and wait until the results are ready.

Examining Results and Improving the Schema

Let's go to standardization tabs and take a look at the results:

Success! Kind of.

We did in fact recover my personal address and account numbers, which I've redacted manually by overlaying red rectangles on them to protect my privacy. But - DocuPanda also recovered Bank of America's phone number, which isn't PII and a bit of an overzealous result.

The reason this happens is, our schema doesn't include a lot of clarification about desired behaviors. You have multiple tools at your disposal to improve results:

  1. Edit the schema itself, changing definitions, examples, etc. For example we may remove phone numbers from the schema definition, or edit a description field to better explain what sort of phone numbers count as PII.

  2. We can stick to your current schema but add standardization guidelines. These are instructions that explain what logic we want to apply when standardizing the results into the existing schema.

Let's try the second approach. Stick to the schema, but explain more carefully how to map documents on to that schema. Select the schema and hit "edit"

Add some instructions that clarify the desired behavior: only extract the person, do not extract merchant or banks phone numbers, etc.

That's it, now we can just run standardize again and see if the results are good now that we've specified the desired behavior more thoroughly.

Awesome. I've redacted all the results except my full name, which I'm comfortable sharing in spite of it being PII.

Finding Exact Locations of PII and Redacting

DocuPanda has the unique visual review, where it locates the physical location of evidence or findings that it has standardized. This is often useful for review purposes, where a result is high-stakes. A common use pattern for our users is for example in invoice processing or insurance claims, it may be necessary to put a human in the loop to verify that a state dollar amount is correct with human review.

The same capability can be used here, however - not for review, but for redaction.

Let's first see what the the review capability looks like.

Launch a visual review

Go to the standardizations tab in your dashboard dashboard, mark one ore more standardization results (like my bank statement), and hit review.

After about a minute, you will see the review result populate in the review tab.

If you click on the review object, you will find that every result - shown on the left, is clickable, and scrolls to the right part of the document. Again I've redacted all my private information. We're intentionally showing the first two digits of my account number on the left (highlighted in yellow), and you can see the corresponding value highlighted on the right.

Writing code to redact the PDF Using Review Results (developers only)

To tie this all together, we'll show how to actually redact the PDF.

This comprises of two steps:

Get the review object from our API. Here's an example call using python

import requests

url = "https://app.docupanda.io/review?review_id=yourReviewID"

headers = {
"accept": "application/json",
"X-API-Key": "YOUR_API_KEY"
}
response = requests.get(url, headers=headers)

This object contains a nested structure like

{
"piiExtraction": [
{
"type": {
"value": "bankAccountNumber",
"review": {
"page": 1,
"confidence": "high",
"boundingBox": [0.15, 0.2, 0.02, 0.02]
}
}
}
]
}

Of course the real object contains a longer list, but you get the idea - the bank account number extraction is localized to the page and bounding box. Page is self explanatory. Bounding box is simply x, y, width, height of the rectangle that contains the bank account number, given as fractions of the total page width and height (so 0.1 means 10% of width, for example).

Opening a PDF, overlaying a black rectangle on it, and saving it to disk isn't too hard. We could figure it out but it's 2024, so we'll just ask GPT. Here's how you can generate a complete example working with just prompting chatGPT.

If you're a developer and interested in seeing a complete example where we integrate with the GET review api endpoint and redact a PDF end to end, here's the full code. Also here's a short video that shows how to develop this code end to end.


​

Did this answer your question?