SEC 10 K docs
Install the Indexify Extractor SDK, Langchain Retriever and the Indexify Client¶
%%capture
!pip install indexify-extractor-sdk indexify-langchain indexify
Start the Indexify Server¶
!./indexify server -d
Download an Embedding Extractor¶
On another terminal we'll download and start the embedding extractor which we will use to index text from the FORM 10-K pdf document.
!indexify-extractor download hub://embedding/minilm-l6
!indexify-extractor join-server
Download an Chunking Extractor¶
On another terminal we'll download and start the chunking extractor that will create chunks from the text and embeddings.
!indexify-extractor download hub://text/chunking
!indexify-extractor join-server
Download the PDF Extractor¶
On another terminal we'll install the necessary dependencies and start the PDF extractor which we will use to get text, bytes or json out of FORM 10-K PDF documents.
Install Poppler on your machine
!sudo apt-get install -y poppler-utils
Download and start the PDF extractor
!indexify-extractor download hub://pdf/pdf-extractor
!indexify-extractor join-server
Test the extractors¶
We will try PDFExtractor first. The PDFExtractor can extract all the values from text as well as tables in one shot and passes it to the next chained extractors which can be used for question answering.
from indexify_extractor_sdk import load_extractor, Content
pdfextractor, pdfconfig_cls = load_extractor("pdf-extractor.pdf_extractor:PDFExtractor")
content = Content.from_file("uber-20231231.pdf")
pdf_result = pdfextractor.extract(content)
text_content = next(content.data.decode('utf-8') for content in pdf_result if content.content_type == 'text/plain')
text_content
/Users/rishiraj/miniconda3/envs/tensorlake/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm 100%|██████████| 23/23 [00:05<00:00, 4.02it/s]
Max number of columns: 1
100%|██████████| 5/5 [00:02<00:00, 1.93it/s]
Max number of columns: 1
100%|██████████| 12/12 [00:02<00:00, 4.74it/s]
Max number of columns: 1
100%|██████████| 4/4 [00:02<00:00, 1.42it/s]
Max number of columns: 1
100%|██████████| 10/10 [00:05<00:00, 1.92it/s]
Max number of columns: 1
100%|██████████| 5/5 [00:01<00:00, 3.70it/s]
Max number of columns: 1
100%|██████████| 11/11 [00:07<00:00, 1.46it/s]
Max number of columns: 5
100%|██████████| 9/9 [00:01<00:00, 5.68it/s]
Max number of columns: 2
100%|██████████| 20/20 [00:11<00:00, 1.81it/s]
Max number of columns: 3
100%|██████████| 26/26 [00:06<00:00, 4.14it/s]
Max number of columns: 4
100%|██████████| 4/4 [00:05<00:00, 1.28s/it]
Max number of columns: 4
100%|██████████| 11/11 [00:06<00:00, 1.72it/s]
Max number of columns: 4
100%|██████████| 6/6 [00:05<00:00, 1.11it/s]
Max number of columns: 4
100%|██████████| 2/2 [00:05<00:00, 2.91s/it]
Max number of columns: 8
100%|██████████| 3/3 [00:02<00:00, 1.25it/s]
Max number of columns: 3
100%|██████████| 2/2 [00:01<00:00, 1.18it/s]
Max number of columns: 1
100%|██████████| 19/19 [00:04<00:00, 3.87it/s]
Max number of columns: 3
100%|██████████| 8/8 [00:06<00:00, 1.24it/s]
Max number of columns: 2
100%|██████████| 27/27 [00:14<00:00, 1.88it/s]
Max number of columns: 4
100%|██████████| 9/9 [00:06<00:00, 1.41it/s]
Max number of columns: 4
100%|██████████| 21/21 [00:16<00:00, 1.26it/s]
Max number of columns: 8
100%|██████████| 15/15 [00:05<00:00, 2.61it/s]
Max number of columns: 8
100%|██████████| 15/15 [00:03<00:00, 4.57it/s]
Max number of columns: 8
100%|██████████| 23/23 [00:03<00:00, 7.01it/s]
Max number of columns: 4
100%|██████████| 5/5 [00:04<00:00, 1.20it/s]
Max number of columns: 4
100%|██████████| 9/9 [00:04<00:00, 1.85it/s]
Max number of columns: 2
100%|██████████| 2/2 [00:05<00:00, 2.72s/it]
Max number of columns: 2
100%|██████████| 6/6 [00:04<00:00, 1.22it/s]
Max number of columns: 4
100%|██████████| 26/26 [00:06<00:00, 3.74it/s]
Max number of columns: 3
100%|██████████| 15/15 [00:02<00:00, 5.78it/s]
Max number of columns: 9
100%|██████████| 3/3 [00:00<00:00, 9.87it/s]
Max number of columns: 2
100%|██████████| 8/8 [00:04<00:00, 1.74it/s]
Max number of columns: 5
100%|██████████| 4/4 [00:05<00:00, 1.29s/it]
Max number of columns: 4
100%|██████████| 13/13 [00:05<00:00, 2.19it/s]
Max number of columns: 3
100%|██████████| 9/9 [00:01<00:00, 8.97it/s]
Max number of columns: 4
100%|██████████| 10/10 [00:00<00:00, 10.12it/s]
Max number of columns: 3
100%|██████████| 8/8 [00:00<00:00, 11.35it/s]
Max number of columns: 2
100%|██████████| 9/9 [00:00<00:00, 11.59it/s]
Max number of columns: 2
100%|██████████| 9/9 [00:00<00:00, 12.03it/s]
Max number of columns: 2
100%|██████████| 12/12 [00:14<00:00, 1.25s/it]
Max number of columns: 4
100%|██████████| 10/10 [00:01<00:00, 8.23it/s]
Max number of columns: 4
100%|██████████| 8/8 [00:06<00:00, 1.24it/s]
Max number of columns: 6
100%|██████████| 6/6 [00:03<00:00, 1.89it/s]
Max number of columns: 4
100%|██████████| 12/12 [00:01<00:00, 9.54it/s]
Max number of columns: 4
100%|██████████| 21/21 [00:05<00:00, 3.55it/s]
Max number of columns: 3
100%|██████████| 7/7 [00:00<00:00, 7.58it/s]
Max number of columns: 4
100%|██████████| 6/6 [00:00<00:00, 13.03it/s]
Max number of columns: 2
100%|██████████| 4/4 [00:03<00:00, 1.09it/s]
Max number of columns: 2
100%|██████████| 21/21 [00:02<00:00, 8.80it/s]
Max number of columns: 4
100%|██████████| 5/5 [00:00<00:00, 10.07it/s]
Max number of columns: 3
100%|██████████| 4/4 [00:00<00:00, 10.17it/s]
Max number of columns: 2
100%|██████████| 9/9 [00:00<00:00, 9.47it/s]
Max number of columns: 2
100%|██████████| 17/17 [00:01<00:00, 11.28it/s]
Max number of columns: 2
100%|██████████| 6/6 [00:00<00:00, 9.72it/s]
Max number of columns: 2
100%|██████████| 13/13 [00:04<00:00, 2.73it/s]
Max number of columns: 5
100%|██████████| 21/21 [00:41<00:00, 1.99s/it]
Max number of columns: 7
100%|██████████| 17/17 [00:13<00:00, 1.23it/s]
Max number of columns: 6
100%|██████████| 19/19 [00:12<00:00, 1.53it/s]
Max number of columns: 6
100%|██████████| 18/18 [00:07<00:00, 2.33it/s]
Max number of columns: 3
'UNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C.\xa020549\n____________________________________________\xa0\nFORM 10-K\n____________________________________________\xa0\n(Mark One)\n☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d)\xa0OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the fiscal year ended December 31, 2023\nOR\n☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d)\xa0OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the transition period from_____ to _____\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\nCommission File Number:\xa0001-38902\n____________________________________________\xa0\nUBER TECHNOLOGIES, INC.\n(Exact name of registrant as specified in its charter)\n____________________________________________\xa0\nDelaware\n45-2647441\n(State or other jurisdiction of incorporation or organization)\n(I.R.S. Employer Identification No.)\n1725 3rd Street\nSan Francisco, California 94158\n(Address of principal executive offices, including zip code)\n(415)\xa0612-8582\n(Registrant’s telephone number, including area code)\n\xa0____________________________________________\nSecurities registered pursuant to Section 12(b) of the Act:\nTitle\xa0of\xa0each\xa0class\nTrading Symbol(s)\nName\xa0of\xa0each\xa0exchange\non\xa0which\xa0registered\nCommon Stock, par value $0.00001 per share\nUBER\nNew York Stock Exchange\nSecurities registered pursuant to Section 12(g) of the Act:\xa0None\nIndicate by check mark whether the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act.\xa0Yes\xa0 ☒ No ☐\nIndicate by check mark whether the registrant\xa0is not required to file reports pursuant to Section 13 or Section 15(d) of the Act. Yes\xa0 ☐ No\xa0\n☒\nIndicate by check mark whether the registrant\xa0 (1)\xa0 has filed all reports required to be filed by Section\xa0 13 or 15(d) of the Securities\nExchange Act of 1934 during the preceding 12\xa0months (or for such shorter period that the registrant was required to file such reports),\nand\xa0(2)\xa0has been subject to such filing requirements for the past 90\xa0days. Yes\xa0\xa0☒ No ☐\nIndicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant\nto Rule 405 of Regulation S-T (§232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant\nwas required to submit such files). Yes\xa0\xa0☒ No ☐\nIndicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting\ncompany, or an emerging growth company. See the definitions of “large accelerated filer,” “accelerated filer,” “smaller reporting\ncompany,” and “emerging growth company” in Rule 12b-2 of the Exchange Act.\n'
Create Extraction Policies¶
Instantiate the Indexify Client
from indexify import IndexifyClient
client = IndexifyClient()
Extraction Graph Setup¶
Import the
ExtractionGraph
class from theindexify
package.Define the extraction graph specification in YAML format:
- Set the name of the extraction graph to "pdfqa".
- Define the extraction policies:
- Use the "tensorlake/pdf-extractor" extractor for PDF marking and name it "pdf-extraction".
- Use the "tensorlake/chunk-extractor" for text chunking and name it "chunks".
- Set the input parameters for the chunker:
chunk_size
: 1000 (size of each text chunk)overlap
: 100 (overlap between chunks)content_source
: "pdf-extraction" (source of content for chunking)
- Set the input parameters for the chunker:
- Use the "tensorlake/minilm-l6" extractor for embedding and name it "get-embeddings".
- Set the content source for embedding to "chunks".
Create an
ExtractionGraph
object from the YAML specification usingExtractionGraph.from_yaml()
.Create the extraction graph on the Indexify client using
client.create_extraction_graph()
.
from indexify import ExtractionGraph
extraction_graph_spec = """
name: 'pdfqa'
extraction_policies:
- extractor: 'tensorlake/pdf-extractor'
name: 'pdf-extraction'
- extractor: 'tensorlake/chunk-extractor'
name: 'chunks'
input_params:
chunk_size: 1000
overlap: 100
content_source: 'pdf-extraction'
- extractor: 'tensorlake/minilm-l6'
name: 'get-embeddings'
content_source: 'chunks'
"""
extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)
Upload a FORM 10-K PDF File¶
import requests
req = requests.get("https://www.sec.gov/files/form10-k.pdf")
with open('form10-k.pdf','wb') as f:
f.write(req.content)
client.upload_file(path="form10-k.pdf")
What is happening behind the scenes¶
Indexify is designed to seamlessly respond to ingestion events by assessing all existing policies and triggering the necessary extractors for extraction. Once the PDF extractor completes the process of extracting texts, bytes, and JSONs from the document, it automatically initiates the embedding extractor to chunk the content, extract embeddings, and populate an index.
With Indexify, you have the ability to upload hundreds of PDF files simultaneously, and the platform will efficiently handle the extraction and indexing of the contents without requiring manual intervention. To expedite the extraction process, you can deploy multiple instances of the extractors, and Indexify's built-in scheduler will transparently distribute the workload among them, ensuring optimal performance and efficiency.
Perform RAG¶
Initialize the Langchain Retriever.
from indexify_langchain import IndexifyRetriever
params = {"name": "pdfqa.get-embeddings.embedding", "top_k": 3}
retriever = IndexifyRetriever(client=client, params=params)
Now create a chain to prompt OpenAI with data retrieved from Indexify to create a simple Q&A bot
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
template = """Answer the question based only on the following context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI()
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| model
| StrOutputParser()
)
Now ask any question related to the ingested FORM 10-K PDF document
chain.invoke("What are the disclosure with respect to Foreign Subsidiaries?")
# It may be omitted to the extent that the required disclosure would be detrimental to the registrant.