Extracting HOA Transaction Data from a PDF using Indexify¶
In this notebook, we're going to learn how we can extract transactional data from a PDF using Indexify. For that, we'll be using a sample PDF that contains transactional data from a Home Owners Association (HOA).
We will explore several way to extract this data from the PDF using Indexify Extractor into a structured format that we can use further for RAG pipeline for example.
Install the Indexify Extractor SDK, Langchain Retriever and the Indexify Client¶
%%capture
!pip install indexify-extractor-sdk indexify virtualenv
Trying out different extractors offered¶
We have several PDF and Invoice Extractor. Here are a few extractors that worked really well to get various fields from my HOA receipt.
First, get a taste of playing with these extractors locally.
PDFExtractor & SchemaExtractor¶
First, we will try PDFExtractor with SchemaExtractor. By default the SchemaExtractor uses OpenAI and works on the Content of chained extractor as data for JSON extraction from schema, however we can manually overwrite both the schema and the data. It can extract all the values from text in one shot.
Download the PDF extractor and Schema extractor:
!indexify-extractor download hub://pdf/pdf-extractor
!indexify-extractor download hub://text/schema
Load the PDF extractor and the file:
import requests
req = requests.get("https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/Statement_HOA.pdf")
with open('Statement_HOA.pdf','wb') as f:
f.write(req.content)
from indexify_extractor_sdk import load_extractor, Content
pdfextractor, pdfconfig_cls = load_extractor("pdf-extractor.pdf_extractor:PDFExtractor")
content = Content.from_file("Statement_HOA.pdf")
/Users/rishiraj/miniconda3/envs/tensorlake/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Extract the data and find the content with content_type 'text/plain'::
pdf_result = pdfextractor.extract(content)
text_content = next(content.data.decode('utf-8') for content in pdf_result if content.content_type == 'text/plain')
text_content
100%|██████████| 13/13 [00:02<00:00, 5.51it/s]
Max number of columns: 6
'Axis\nSTATEMENT\nAccount Number:\nOwner:\nProperty:SUMMAR\nY OF ACCOUNT\nLast Month Balance:\nCurrent Amount Due:$653.03\n$653.03\nAccount details on back.\nProfessionally\nprepared by:\nSTATEMENT MESSAGE\nWelcome to Action Property Management! We are excited to be\nserving your community. Our Community Care team is more than\nhappy to assist you with any billing questions you may have. For\ncontact options, please visit www.actionlife.com/contact. Visit the\nAction Property Management web page at: www.actionlife.com.BILLING QUESTIONS\nScan the QR code to\ncontact our\nCommunity Care\nteam.\nactionlife.com/contact\nCommunityCare@actionlife.com\nTo learn more about issues facing HOAs, say "Hey Siri, search the web for The Uncommon Area by Action Property Management."\nMake checks payable to:\nAxis\nPLEA\nSE REMIT PAYMENT TO:\n** AUTOPAY SCHEDULED **\n** NO REMITTANCE NECESSARY **CURRENT AMOUNT DUE\n$653.03\nDUE DATE\n5/1/202492287435600\nJohn Doe\n200 Park Avenue, \nManhattan, New York 10166\nAccount Number: 92287435600\nOwner: John DoeInvoice Number: 918273-64500\nDate: 4/19/2024\nRegistration Key/ID:\nFLOWR2U'
Load the Schema extractor and extract the JSON using the Schema extractor:
from pydantic import BaseModel
class Invoice(BaseModel):
invoice_number: str
date: str
account_number: str
owner: str
address: str
last_month_balance: str
current_amount_due: str
registration_key: str
due_date: str
schema = Invoice.model_json_schema()
schemaextractor, schemaconfig_cls = load_extractor("schema.schema_extractor:SchemaExtractor")
config = schemaconfig_cls(service="openai", schema=schema)
result = schemaextractor.extract(Content.from_text(text_content), config)
llm_content = next(content.data.decode('utf-8') for content in result if content.content_type == 'text/plain')
llm_content
'{\n "invoice_number": "918273-64500",\n "date": "4/19/2024",\n "account_number": "92287435600",\n "owner": "John Doe",\n "address": "200 Park Avenue, Manhattan, New York 10166",\n "last_month_balance": "$653.03",\n "current_amount_due": "$653.03",\n "registration_key": "FLOWR2U",\n "due_date": "5/1/2024"\n}'
PDFExtractor & LLMExtractor¶
Next, for more control, we will try PDFExtractor with LLMExtractor. The PDFExtractor can extract all the values from text as well as tables in one shot and passes it to the chained LLMExtractor which can be used for question answering.
Download the LLM extractor:
!indexify-extractor download hub://text/llm
Load the LLM extractor and extract the JSON using the LLM extractor:
query = "by when do I have to make the payment and what amount? also what was the EV charge amount?"
prompt = """Extract information according to this schema and return json in this format {"Invoice No.": "", "Date": "", "Account Number": "", "Owner": "", "Address": "", "Registration Key": "", "Last Month Balance": "", "Current Amount Due": "", "Due Date": ""}:
Axis\nSTATEMENTInvoice No. "Invoice No."\nDate: 4/19/2024\nAccount Number:\nOwner:\nProperty:"Account Number"\n"Owner"\n"Property"\n"Owner"\n"Property"\n"Address"SUMMARY OF ACCOUNT\nLast Month Balance:\nCurrent Amount Due:"Last Month Balance"\n"Current Amount Due"\nAccount details on back.\nProfessionally\nprepared by:\nSTATEMENT MESSAGE\nWelcome to Action Property Management! We are excited to be\nserving your community. Our Community Care team is more than\nhappy to assist you with any billing questions you may have. For\ncontact options, please visit www.actionlife.com/contact. Visit the\nAction Property Management web page at: www.actionlife.com.BILLING QUESTIONS\nScan the QR code to\ncontact our\nCommunity Care\nteam.\nactionlife.com/contact\nCommunityCare@actionlife.com\nRegister your Resident\nPortal account now!\nRegistration Key/ID:\n"Registration Key"\nresident.actionlife.com\nTo learn more about issues facing HOAs, say "Hey Siri, search the web for The Uncommon Area by Action Property Management."\nMake checks payable to:\nAxisAccount Number: "Account Number"\nOwner: "Owner"\nPLEASE REMIT PAYMENT TO:\n** AUTOPAY SCHEDULED **\n** NO REMITTANCE NECESSARY **CURRENT AMOUNT DUE\n"Current Amount Due"\nDUE DATE\n"Due Date"\n0049 00008330 0000922000203826 7 00065303 00000000 9"""
llmextractor, llmconfig_cls = load_extractor("llm.llm_extractor:LLMExtractor")
config = llmconfig_cls(service="openai", prompt=prompt)
result = llmextractor.extract(Content.from_text(text_content), config)
llm_content = next(content.data.decode('utf-8') for content in result if content.content_type == 'text/plain')
llm_content
'{\n "Invoice No.": "918273-64500",\n "Date": "4/19/2024",\n "Account Number": "92287435600",\n "Owner": "John Doe",\n "Address": "200 Park Avenue, Manhattan, New York 10166",\n "Registration Key": "FLOWR2U",\n "Last Month Balance": "$653.03",\n "Current Amount Due": "$653.03",\n "Due Date": "5/1/2024"\n}'
Table Extraction¶
The document also has tables in it so let's find the data from tables with content_type 'application/json' and get it in a dataframe:
import json
import pandas as pd
json_content = next(content.data for content in pdf_result if content.content_type == 'application/json')
# Convert the JSON string to a Python dictionary
data_dict = json.loads(json_content)
# Convert the dictionary to a pandas DataFrame
df = pd.DataFrame.from_dict(data_dict, orient='index')
# Print the DataFrame
print(df)
0 1 2 \ 0 Date Description Assessment 1 02/01/2024 Charge Storage Fee 2024 2 02/01/2024 Monthly Assessment- A12H2 Monthly Assessment- A12 H2 2024 3 02/06/2024 Charge EV Charge Fee 2024 4 02/11/2024 eCheck eCheck 5 03/01/2024 Monthly Assessment- A12 H2 Monthly Assessment- A12 H2 2024 6 03/01/2024 Charge StorageFee 2024 7 03/11/2024 eCheck eCheck 8 04/01/2024 Charge Storage Fee 2024 9 04/01/2024 Monthly Assessment- A12 H2 Monthly Assessment- A12 H2 2024 10 04/11/2024 eCheck eCheck 11 05/01/2024 Monthly Assessment- A12 H2 Monthly Assessment- A12 H2 2024 12 05/01/2024 Charge StorageFee 2024 3 4 5 0 Type Amount Balance 1 Charge $50.00 $100.00 2 Charge s603.03 $703.03 3 Charge 8599.55 $1,302.58 4 eCheck $-1,302.58 S0.00 5 Charge S603.03 $603.03 6 Charge $50.00 $653.03 7 eCheck $-653.03 So.00 8 Charge S50.00 S50.00 9 Charge S603.03 $653.03 10 eCheck $-653.03 So.0o 11 Charge $603.03 $603.03 12 Charge s50.00 S653.03
Question answering with extracted content:
config = llmconfig_cls(service="openai", prompt=str(data_dict) + str(llm_content))
result = llmextractor.extract(Content.from_text(query), config)
llm_content = next(content.data.decode('utf-8') for content in result if content.content_type == 'text/plain')
llm_content
'You have to make the payment by May 1, 2024, and the amount due is $653.03. The EV charge amount was $8599.55.'
LayoutLMDocumentQA¶
Next we try LayoutLMDocumentQA. It can't extract all the values in one shot, but can answer to single questions.
Download the extractor:
!indexify-extractor download hub://pdf/layoutlm_document_qa
Load the extractor and the file:
from indexify_extractor_sdk import load_extractor, Content
extractor, config_cls = load_extractor("layoutlm_document_qa.layoutlm_document_qa:LayoutLMDocumentQA")
Ask question to the extractor:
config = config_cls(query="What's the due date?")
result = extractor.extract(content, config)
result
[Feature(feature_type='metadata', name='metadata', value={'query': "What's the due date?", 'answer': '5/1/2024', 'page': 0, 'score': 0.9999791383743286}, comment=None)]
Start the Indexify Server¶
To make this extractor continously extract -
- Download the Indexify Server
- Start it in development mode on your laptop
- Create extraction policies with questions that extracts the fields from the PDF
- Finally, you can get all the extracted value for a document by making an API call
Download the Server¶
curl https://tensorlake.ai | sh
!./indexify server -d
Create the Extraction Policies¶
from indexify import IndexifyClient
client = IndexifyClient()
client.add_extraction_policy(extractor='tensorlake/layoutlm-document-qa-extractor', name="hoa-fees-due-date", input_params={"query": "Whats the due date?"})
client.add_extraction_policy(extractor='tensorlake/layoutlm-document-qa-extractor', name="hoa-fees-outstanding", input_params={"query": "Whats the outstanding amount?"})
Upload Files¶
content_id = client.upload_file("Statement_HOA.pdf")
client.get_structured_data(content_id)
[{'id': '3Ie8VXVxfNTPAL5L', 'content_id': 'efcf0931508836d3', 'metadata': {'answer': '5/1/2024', 'page': 0, 'query': 'Whats the due date?', 'score': 0.9999799728393556}, 'extractor_name': 'tensorlake/layoutlm-document-qa-extractor'}, {'id': 'VmCTqMFR-m7IG0nn', 'content_id': 'efcf0931508836d3', 'metadata': {'answer': '$603.03', 'page': 1, 'query': 'Whats the outstanding amount?', 'score': 0.9992976188659668}, 'extractor_name': 'tensorlake/layoutlm-document-qa-extractor'}]
client.sql_query("select * from ingestion;")
SqlQueryResult(result=[{'answer': '$603.03', 'content_id': 'd8ec685dd9cc3505', 'page': 1, 'query': 'Whats the outstanding amount?', 'score': 0.9992976188659668}, {'answer': '5/1/2024', 'content_id': 'd8ec685dd9cc3505', 'page': 0, 'query': 'Whats the due date?', 'score': 0.9999799728393556}])