Extracting HOA Transaction Data from a PDF using Indexify¶

In this notebook, we're going to learn how we can extract transactional data from a PDF using Indexify. For that, we'll be using a sample PDF that contains transactional data from a Home Owners Association (HOA).

We will explore several way to extract this data from the PDF using Indexify Extractor into a structured format that we can use further for RAG pipeline for example.

Preview data

Install the Indexify Extractor SDK, Langchain Retriever and the Indexify Client¶

In [2]:

Copied!

%%capture
!pip install indexify-extractor-sdk indexify virtualenv
%%capture
!pip install indexify-extractor-sdk indexify virtualenv

Trying out different extractors offered¶

We have several PDF and Invoice Extractor. Here are a few extractors that worked really well to get various fields from my HOA receipt.

First, get a taste of playing with these extractors locally.

PDFExtractor & SchemaExtractor¶

First, we will try PDFExtractor with SchemaExtractor. By default the SchemaExtractor uses OpenAI and works on the Content of chained extractor as data for JSON extraction from schema, however we can manually overwrite both the schema and the data. It can extract all the values from text in one shot.

Download the PDF extractor and Schema extractor:

In [ ]:

Copied!

!indexify-extractor download hub://pdf/pdf-extractor
!indexify-extractor download hub://text/schema
!indexify-extractor download hub://pdf/pdf-extractor
!indexify-extractor download hub://text/schema

Load the PDF extractor and the file:

In [1]:

Copied!

import requests
req = requests.get("https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/Statement_HOA.pdf")

with open('Statement_HOA.pdf','wb') as f:
    f.write(req.content)

from indexify_extractor_sdk import load_extractor, Content

pdfextractor, pdfconfig_cls = load_extractor("pdf-extractor.pdf_extractor:PDFExtractor")
content = Content.from_file("Statement_HOA.pdf")
import requests
req = requests.get("https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/Statement_HOA.pdf")

with open('Statement_HOA.pdf','wb') as f:
    f.write(req.content)

from indexify_extractor_sdk import load_extractor, Content

pdfextractor, pdfconfig_cls = load_extractor("pdf-extractor.pdf_extractor:PDFExtractor")
content = Content.from_file("Statement_HOA.pdf")

/Users/rishiraj/miniconda3/envs/tensorlake/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Extract the data and find the content with content_type 'text/plain'::

In [21]:

Copied!

pdf_result = pdfextractor.extract(content)
text_content = next(content.data.decode('utf-8') for content in pdf_result if content.content_type == 'text/plain')
text_content
pdf_result = pdfextractor.extract(content)
text_content = next(content.data.decode('utf-8') for content in pdf_result if content.content_type == 'text/plain')
text_content

100%|██████████| 13/13 [00:02<00:00,  5.51it/s]

Max number of columns: 6

Out[21]:

'Axis\nSTATEMENT\nAccount Number:\nOwner:\nProperty:SUMMAR\nY OF ACCOUNT\nLast Month Balance:\nCurrent Amount Due:$653.03\n$653.03\nAccount details on back.\nProfessionally\nprepared by:\nSTATEMENT MESSAGE\nWelcome to Action Property Management! We are excited to be\nserving your community. Our Community Care team is more than\nhappy to assist you with any billing questions you may have. For\ncontact options, please visit www.actionlife.com/contact. Visit the\nAction Property Management web page at: www.actionlife.com.BILLING QUESTIONS\nScan the QR code to\ncontact our\nCommunity Care\nteam.\nactionlife.com/contact\nCommunityCare@actionlife.com\nTo learn more about issues facing HOAs, say "Hey Siri, search the web for The Uncommon Area by Action Property Management."\nMake checks payable to:\nAxis\nPLEA\nSE REMIT PAYMENT TO:\n** AUTOPAY SCHEDULED **\n** NO REMITTANCE NECESSARY **CURRENT AMOUNT DUE\n$653.03\nDUE DATE\n5/1/202492287435600\nJohn Doe\n200 Park Avenue, \nManhattan, New York 10166\nAccount Number:  92287435600\nOwner:  John DoeInvoice Number:  918273-64500\nDate:  4/19/2024\nRegistration Key/ID:\nFLOWR2U'

Load the Schema extractor and extract the JSON using the Schema extractor:

In [ ]:

Copied!





from pydantic import BaseModel

class Invoice(BaseModel):
    invoice_number: str
    date: str
    account_number: str
    owner: str
    address: str
    last_month_balance: str
    current_amount_due: str
    registration_key: str
    due_date: str
from pydantic import BaseModel

class Invoice(BaseModel):
    invoice_number: str
    date: str
    account_number: str
    owner: str
    address: str
    last_month_balance: str
    current_amount_due: str
    registration_key: str
    due_date: str

In [12]:

Copied!

schema = Invoice.model_json_schema()
schema = Invoice.model_json_schema()

In [13]:

Copied!





schemaextractor, schemaconfig_cls = load_extractor("schema.schema_extractor:SchemaExtractor")

config = schemaconfig_cls(service="openai", schema=schema)
result = schemaextractor.extract(Content.from_text(text_content), config)
llm_content = next(content.data.decode('utf-8') for content in result if content.content_type == 'text/plain')
llm_content
schemaextractor, schemaconfig_cls = load_extractor("schema.schema_extractor:SchemaExtractor")

config = schemaconfig_cls(service="openai", schema=schema)
result = schemaextractor.extract(Content.from_text(text_content), config)
llm_content = next(content.data.decode('utf-8') for content in result if content.content_type == 'text/plain')
llm_content

Out[13]:

'{\n  "invoice_number": "918273-64500",\n  "date": "4/19/2024",\n  "account_number": "92287435600",\n  "owner": "John Doe",\n  "address": "200 Park Avenue, Manhattan, New York 10166",\n  "last_month_balance": "$653.03",\n  "current_amount_due": "$653.03",\n  "registration_key": "FLOWR2U",\n  "due_date": "5/1/2024"\n}'

PDFExtractor & LLMExtractor¶

Next, for more control, we will try PDFExtractor with LLMExtractor. The PDFExtractor can extract all the values from text as well as tables in one shot and passes it to the chained LLMExtractor which can be used for question answering.

Download the LLM extractor:

In [ ]:

Copied!

!indexify-extractor download hub://text/llm
!indexify-extractor download hub://text/llm

Load the LLM extractor and extract the JSON using the LLM extractor:

In [18]:

Copied!

query = "by when do I have to make the payment and what amount? also what was the EV charge amount?"
prompt = """Extract information according to this schema and return json in this format {"Invoice No.": "", "Date": "", "Account Number": "", "Owner": "", "Address": "", "Registration Key": "", "Last Month Balance": "", "Current Amount Due": "", "Due Date": ""}:
Axis\nSTATEMENTInvoice No. "Invoice No."\nDate: 4/19/2024\nAccount Number:\nOwner:\nProperty:"Account Number"\n"Owner"\n"Property"\n"Owner"\n"Property"\n"Address"SUMMARY OF ACCOUNT\nLast Month Balance:\nCurrent Amount Due:"Last Month Balance"\n"Current Amount Due"\nAccount details on back.\nProfessionally\nprepared by:\nSTATEMENT MESSAGE\nWelcome to Action Property Management! We are excited to be\nserving your community. Our Community Care team is more than\nhappy to assist you with any billing questions you may have. For\ncontact options, please visit www.actionlife.com/contact. Visit the\nAction Property Management web page at: www.actionlife.com.BILLING QUESTIONS\nScan the QR code to\ncontact our\nCommunity Care\nteam.\nactionlife.com/contact\nCommunityCare@actionlife.com\nRegister your Resident\nPortal account now!\nRegistration Key/ID:\n"Registration Key"\nresident.actionlife.com\nTo learn more about issues facing HOAs, say "Hey Siri, search the web for The Uncommon Area by Action Property Management."\nMake checks payable to:\nAxisAccount Number: "Account Number"\nOwner: "Owner"\nPLEASE REMIT PAYMENT TO:\n** AUTOPAY SCHEDULED **\n** NO REMITTANCE NECESSARY **CURRENT AMOUNT DUE\n"Current Amount Due"\nDUE DATE\n"Due Date"\n0049 00008330 0000922000203826 7 00065303 00000000 9"""
query = "by when do I have to make the payment and what amount? also what was the EV charge amount?"
prompt = """Extract information according to this schema and return json in this format {"Invoice No.": "", "Date": "", "Account Number": "", "Owner": "", "Address": "", "Registration Key": "", "Last Month Balance": "", "Current Amount Due": "", "Due Date": ""}:
Axis\nSTATEMENTInvoice No. "Invoice No."\nDate: 4/19/2024\nAccount Number:\nOwner:\nProperty:"Account Number"\n"Owner"\n"Property"\n"Owner"\n"Property"\n"Address"SUMMARY OF ACCOUNT\nLast Month Balance:\nCurrent Amount Due:"Last Month Balance"\n"Current Amount Due"\nAccount details on back.\nProfessionally\nprepared by:\nSTATEMENT MESSAGE\nWelcome to Action Property Management! We are excited to be\nserving your community. Our Community Care team is more than\nhappy to assist you with any billing questions you may have. For\ncontact options, please visit www.actionlife.com/contact. Visit the\nAction Property Management web page at: www.actionlife.com.BILLING QUESTIONS\nScan the QR code to\ncontact our\nCommunity Care\nteam.\nactionlife.com/contact\nCommunityCare@actionlife.com\nRegister your Resident\nPortal account now!\nRegistration Key/ID:\n"Registration Key"\nresident.actionlife.com\nTo learn more about issues facing HOAs, say "Hey Siri, search the web for The Uncommon Area by Action Property Management."\nMake checks payable to:\nAxisAccount Number: "Account Number"\nOwner: "Owner"\nPLEASE REMIT PAYMENT TO:\n** AUTOPAY SCHEDULED **\n** NO REMITTANCE NECESSARY **CURRENT AMOUNT DUE\n"Current Amount Due"\nDUE DATE\n"Due Date"\n0049 00008330 0000922000203826 7 00065303 00000000 9"""

In [19]:

Copied!





llmextractor, llmconfig_cls = load_extractor("llm.llm_extractor:LLMExtractor")

config = llmconfig_cls(service="openai", prompt=prompt)
result = llmextractor.extract(Content.from_text(text_content), config)
llm_content = next(content.data.decode('utf-8') for content in result if content.content_type == 'text/plain')
llm_content
llmextractor, llmconfig_cls = load_extractor("llm.llm_extractor:LLMExtractor")

config = llmconfig_cls(service="openai", prompt=prompt)
result = llmextractor.extract(Content.from_text(text_content), config)
llm_content = next(content.data.decode('utf-8') for content in result if content.content_type == 'text/plain')
llm_content

Out[19]:

'{\n    "Invoice No.": "918273-64500",\n    "Date": "4/19/2024",\n    "Account Number": "92287435600",\n    "Owner": "John Doe",\n    "Address": "200 Park Avenue, Manhattan, New York 10166",\n    "Registration Key": "FLOWR2U",\n    "Last Month Balance": "$653.03",\n    "Current Amount Due": "$653.03",\n    "Due Date": "5/1/2024"\n}'

Table Extraction¶

The document also has tables in it so let's find the data from tables with content_type 'application/json' and get it in a dataframe:

In [22]:

Copied!





import json
import pandas as pd

json_content = next(content.data for content in pdf_result if content.content_type == 'application/json')

# Convert the JSON string to a Python dictionary
data_dict = json.loads(json_content)

# Convert the dictionary to a pandas DataFrame
df = pd.DataFrame.from_dict(data_dict, orient='index')

# Print the DataFrame
print(df)
import json
import pandas as pd

json_content = next(content.data for content in pdf_result if content.content_type == 'application/json')

# Convert the JSON string to a Python dictionary
data_dict = json.loads(json_content)

# Convert the dictionary to a pandas DataFrame
df = pd.DataFrame.from_dict(data_dict, orient='index')

# Print the DataFrame
print(df)

             0                           1                                2  \
0         Date                 Description                       Assessment   
1   02/01/2024                      Charge                 Storage Fee 2024   
2   02/01/2024   Monthly Assessment- A12H2  Monthly Assessment- A12 H2 2024   
3   02/06/2024                      Charge               EV Charge Fee 2024   
4   02/11/2024                      eCheck                           eCheck   
5   03/01/2024  Monthly Assessment- A12 H2  Monthly Assessment- A12 H2 2024   
6   03/01/2024                      Charge                  StorageFee 2024   
7   03/11/2024                      eCheck                           eCheck   
8   04/01/2024                      Charge                 Storage Fee 2024   
9   04/01/2024  Monthly Assessment- A12 H2  Monthly Assessment- A12 H2 2024   
10  04/11/2024                      eCheck                           eCheck   
11  05/01/2024  Monthly Assessment- A12 H2  Monthly Assessment- A12 H2 2024   
12  05/01/2024                      Charge                  StorageFee 2024   

         3           4          5  
0     Type      Amount    Balance  
1   Charge      $50.00    $100.00  
2   Charge     s603.03    $703.03  
3   Charge     8599.55  $1,302.58  
4   eCheck  $-1,302.58      S0.00  
5   Charge     S603.03    $603.03  
6   Charge      $50.00    $653.03  
7   eCheck    $-653.03      So.00  
8   Charge      S50.00     S50.00  
9   Charge     S603.03    $653.03  
10  eCheck    $-653.03      So.0o  
11  Charge     $603.03    $603.03  
12  Charge      s50.00    S653.03

Question answering with extracted content:

In [25]:

Copied!





config = llmconfig_cls(service="openai", prompt=str(data_dict) + str(llm_content))
result = llmextractor.extract(Content.from_text(query), config)
llm_content = next(content.data.decode('utf-8') for content in result if content.content_type == 'text/plain')
llm_content
config = llmconfig_cls(service="openai", prompt=str(data_dict) + str(llm_content))
result = llmextractor.extract(Content.from_text(query), config)
llm_content = next(content.data.decode('utf-8') for content in result if content.content_type == 'text/plain')
llm_content

Out[25]:

'You have to make the payment by May 1, 2024, and the amount due is $653.03. The EV charge amount was $8599.55.'

LayoutLMDocumentQA¶

Next we try LayoutLMDocumentQA. It can't extract all the values in one shot, but can answer to single questions.

Download the extractor:

In [ ]:

Copied!

!indexify-extractor download hub://pdf/layoutlm_document_qa
!indexify-extractor download hub://pdf/layoutlm_document_qa

Load the extractor and the file:

In [17]:

Copied!

from indexify_extractor_sdk import load_extractor, Content
extractor, config_cls = load_extractor("layoutlm_document_qa.layoutlm_document_qa:LayoutLMDocumentQA")
from indexify_extractor_sdk import load_extractor, Content
extractor, config_cls = load_extractor("layoutlm_document_qa.layoutlm_document_qa:LayoutLMDocumentQA")

Ask question to the extractor:

In [9]:

Copied!

config = config_cls(query="What's the due date?")
result = extractor.extract(content, config)
result
config = config_cls(query="What's the due date?")
result = extractor.extract(content, config)
result

Out[9]:

[Feature(feature_type='metadata', name='metadata', value={'query': "What's the due date?", 'answer': '5/1/2024', 'page': 0, 'score': 0.9999791383743286}, comment=None)]

Start the Indexify Server¶

To make this extractor continously extract -

Download the Indexify Server
Start it in development mode on your laptop
Create extraction policies with questions that extracts the fields from the PDF
Finally, you can get all the extracted value for a document by making an API call

Download the Server¶

curl https://tensorlake.ai | sh

In [ ]:

Copied!

!./indexify server -d
!./indexify server -d

Create the Extraction Policies¶

In [2]:

Copied!

from indexify import IndexifyClient
client = IndexifyClient()
from indexify import IndexifyClient
client = IndexifyClient()

In [3]:

Copied!

client.add_extraction_policy(extractor='tensorlake/layoutlm-document-qa-extractor', name="hoa-fees-due-date", input_params={"query": "Whats the due date?"})
client.add_extraction_policy(extractor='tensorlake/layoutlm-document-qa-extractor', name="hoa-fees-outstanding", input_params={"query": "Whats the outstanding amount?"})
client.add_extraction_policy(extractor='tensorlake/layoutlm-document-qa-extractor', name="hoa-fees-due-date", input_params={"query": "Whats the due date?"})
client.add_extraction_policy(extractor='tensorlake/layoutlm-document-qa-extractor', name="hoa-fees-outstanding", input_params={"query": "Whats the outstanding amount?"})

Upload Files¶

In [4]:

Copied!

content_id = client.upload_file("Statement_HOA.pdf")
content_id = client.upload_file("Statement_HOA.pdf")

In [6]:

Copied!

client.get_structured_data(content_id)
client.get_structured_data(content_id)

Out[6]:

[{'id': '3Ie8VXVxfNTPAL5L',
  'content_id': 'efcf0931508836d3',
  'metadata': {'answer': '5/1/2024',
   'page': 0,
   'query': 'Whats the due date?',
   'score': 0.9999799728393556},
  'extractor_name': 'tensorlake/layoutlm-document-qa-extractor'},
 {'id': 'VmCTqMFR-m7IG0nn',
  'content_id': 'efcf0931508836d3',
  'metadata': {'answer': '$603.03',
   'page': 1,
   'query': 'Whats the outstanding amount?',
   'score': 0.9992976188659668},
  'extractor_name': 'tensorlake/layoutlm-document-qa-extractor'}]

In [36]:

Copied!

client.sql_query("select * from ingestion;")
client.sql_query("select * from ingestion;")

Out[36]:

SqlQueryResult(result=[{'answer': '$603.03', 'content_id': 'd8ec685dd9cc3505', 'page': 1, 'query': 'Whats the outstanding amount?', 'score': 0.9992976188659668}, {'answer': '5/1/2024', 'content_id': 'd8ec685dd9cc3505', 'page': 0, 'query': 'Whats the due date?', 'score': 0.9999799728393556}])