Chat with a 300-Page PDF

intermediario · 10 min · Por Ana Brainiall

Why PDFs Are a Special Challenge

PDFs are tricky because they combine 3 worlds:

1. Structured text: paragraphs, lists, footnotes
2. Visual layout: columns, tables, diagrams, charts
3. Images: photos, logos, embedded screenshots

PDF is a visual-first format: it preserves appearance across any device. But text is just a byproduct — extracting the original semantic content isn't always straightforward.

At Brainiall, when you upload a PDF:
- Raw text is extracted (pdfplumber or pdfium)
- Tables are detected (camelot or tabula)
- Pages are converted to images
- OCR (Whisper-OCR or Mistral-OCR) is applied to pages where text can't be extracted directly
- Hierarchical structure is identified (headings, sections)
- Optionally: summarized + vectorized for RAG

ilustração de um PDF sendo "destrinchado" em 4 camadas — texto, tabelas, imagens

Conversation Flow: RAG vs Full Context

Two strategies depending on document size:

PDF < 50 pages (~100k tokens):
- Send the full text in the Claude Sonnet or Gemini Pro prompt
- The model "sees" everything and responds based on complete context
- Advantage: no information is lost
- Disadvantage: costly for multiple questions (each request reprocesses the PDF)

PDF > 50 pages:
- Use RAG (Retrieval Augmented Generation)
- Split the PDF into chunks of ~500 tokens
- Vectorize each chunk
- For each user question, retrieve the 5–10 most semantically relevant chunks
- Send ONLY those chunks in the prompt
- Advantage: affordable + scalable
- Disadvantage: if the model needs to connect information from distant sections, context may be lost

Brainiall automatically decides which strategy to use based on the PDF size.

Practical Use Cases

Legal documents: chat with an 80-page contract to find specific clauses
Academic papers: "what are the main arguments against the author's thesis?"
Financial reports: "compare Q3 vs Q4 growth in this 10-K"
Technical manuals: "what's the procedure to reset the equipment?"
Textbooks: private tutoring on any topic
Legal proceedings: search for dates, parties, and key facts across 500+ page case files

Common Pitfalls

Complex tables: nested or merged tables can come out garbled in extracted text; use image OCR as a fallback
Mathematical formulas: LaTeX in PDFs turns into unreadable text; vision models handle this much better
Old scanned documents: PDFs that are image-only (no embedded text) require OCR, which can misread words
Exotic languages: low-resource languages tend to have lower OCR accuracy
Password-protected PDFs: copy-protected PDFs can block extraction — a password is required

Questions That Work Well vs. Poorly

Work well:
- "What is the central argument of chapter 3?"
- "List all dates mentioned in this report"
- "Compare the conclusions from section 4 and section 7"
- "What was the net revenue in 2025?"

Work poorly:
- "Summarize this entire PDF in 2 paragraphs" (requires full context that may be lost in RAG)
- "What is the author's emotional tone at the end?" (nuance that's hard to capture across chunks)
- "What's in the image on page 45?" (requires dedicated vision processing)

comparação visual de 2 colunas — "perguntas que funcionam" com checkmarks verdes

Integrating via API

`python
import httpx

# Upload the PDF first
with open("contract.pdf", "rb") as f:
r = httpx.post(
"https://api.brainiall.com/v1/files",
files={"file": f},
headers={"Authorization": "Bearer brnl-xxx"}
)
file_id = r.json()["id"]

# Then, chat referencing the file
r = httpx.post(
"https://api.brainiall.com/v1/chat/completions",
json={
"model": "claude-sonnet-4-6",
"messages": [
{"role": "user", "content": [
{"type": "text", "text": "List all parties in this contract"},
{"type": "file", "file_id": file_id}
]}
]
},
headers={"Authorization": "Bearer brnl-xxx"}
)
`

Try It Right Now

In the Brainiall chat, drag a PDF into the input area and start asking questions. Up to 10MB per file. The Pro plan at $29 allows generous uploads; Business includes batch processing + 30-day file retention.