Installation and Setup¶
- Install the
indexify-extractor-sdk
package using pip.
pip install -q indexify-extractor-sdk
- Download the required extractors:
hub://embedding/minilm-l6
: An embedding extractor based on the MiniLM-L6 model.hub://text/chunking
: A text chunking extractor.hub://pdf/marker
: A PDF marker extractor.
!indexify-extractor download hub://embedding/minilm-l6
!indexify-extractor download hub://text/chunking
!indexify-extractor download hub://pdf/marker
- Start the Indexify Extractor server on a separate terminal using the
indexify-extractor join-server
command.
!indexify-extractor join-server
- Install the
indexify
package using pip.
pip install -q indexify
Indexify Client Setup¶
Import the
IndexifyClient
class from theindexify
package.Create an instance of the
IndexifyClient
calledclient
.
from indexify import IndexifyClient
client = IndexifyClient()
Extraction Graph Setup¶
Import the
ExtractionGraph
class from theindexify
package.Define the extraction graph specification in YAML format:
- Set the name of the extraction graph to "pdfqa".
- Define the extraction policies:
- Use the "tensorlake/marker" extractor for PDF marking and name it "mdextract".
- Use the "tensorlake/chunk-extractor" for text chunking and name it "chunker".
- Set the input parameters for the chunker:
chunk_size
: 1000 (size of each text chunk)overlap
: 100 (overlap between chunks)content_source
: "mdextract" (source of content for chunking)
- Set the input parameters for the chunker:
- Use the "tensorlake/minilm-l6" extractor for embedding and name it "pdfembedding".
- Set the content source for embedding to "chunker".
Create an
ExtractionGraph
object from the YAML specification usingExtractionGraph.from_yaml()
.Create the extraction graph on the Indexify client using
client.create_extraction_graph()
.
from indexify import ExtractionGraph
extraction_graph_spec = """
name: 'pdfqa'
extraction_policies:
- extractor: 'tensorlake/marker'
name: 'mdextract'
- extractor: 'tensorlake/chunk-extractor'
name: 'chunker'
input_params:
chunk_size: 1000
overlap: 100
content_source: 'mdextract'
- extractor: 'tensorlake/minilm-l6'
name: 'pdfembedding'
content_source: 'chunker'
"""
extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)
Document Ingestion¶
- Add the PDF document to the "pdfqa" extraction graph using
client.upload_file()
.
client.upload_file("pdfqa", "chess.pdf")
Context Retrieval Function¶
Define a function called
get_context
that takes a question, index name, and top_k as parameters.Search the specified index using
client.search_index()
with the given question and top_k.Concatenate the retrieved passages into a single context string.
Return the context string.
def get_context(question: str, index: str, top_k=3):
results = client.search_index(name=index, query=question, top_k=3)
context = ""
for result in results:
context = context + f"content id: {result['content_id']} \n\n passage: {result['text']}\n"
return context
Prompt Creation Function¶
Define a function called
create_prompt
that takes a question and context as parameters.Create a prompt string that includes the question and context.
Return the prompt string.
def create_prompt(question, context):
return f"Answer the question, based on the context.\n question: {question} \n context: {context}"
Question Answering¶
Define a question string.
Call the
get_context
function with the question, index name ("pdfqa.pdfembedding.embedding"), and top_k (default is 3) to retrieve the relevant context.
question = "Who is the greatest player of all time and what is his record?"
context = get_context(question, "pdfqa.pdfembedding.embedding")
OpenAI Client Setup¶
Import the
OpenAI
class from theopenai
package.Create an instance of the
OpenAI
client calledclient_openai
with the API key.
from openai import OpenAI
client_openai = OpenAI(api_key="")
Question Answering with OpenAI¶
Call the
create_prompt
function with the question and retrieved context to generate the prompt.Use the
client_openai.chat.completions.create()
method to send the prompt to the OpenAI API.- Set the model to "gpt-3.5-turbo".
- Pass the prompt as a message with the "user" role.
Print the generated answer from the API response.
prompt = create_prompt(question, context)
chat_completion = client_openai.chat.completions.create(
messages=[
{
"role": "user",
"content": prompt,
}
],
model="gpt-3.5-turbo",
)
print(chat_completion.choices[0].message.content)