Images, PDFs & Files
Images, PDFs & Files
Your agent can see and process files that visitors upload. Images, PDFs, Word documents — visitors can share them directly in the chat and your agent will understand and respond to them.
Supported Formats
Images:
- JPG, PNG, WebP, GIF
- Sent as base64 data URL to the vision model
- Drag & drop or paste from clipboard
PDFs:
- Extracted using
unpdf+pdfjs-dist - Image-only PDFs (scanned documents) are processed with
sharpfor OCR - Up to 10MB file size
DOCX:
- Extracted using
mammoth(converts Word to clean text) - Handles formatting, tables, lists
How Visitors Upload Files
Three ways:
- Drag and drop — drag a file onto the chat widget
- Paste — Ctrl/Cmd+V to paste an image from clipboard
- Upload button — click the attachment icon in the chat input
How It Works Under the Hood
When a visitor uploads a file:
- The file is converted to the appropriate format (base64 for images, text extraction for PDFs/DOCX)
- The content is included in the message sent to the LLM
- The LLM processes the content alongside the text message (if any)
- The agent responds based on both the file content and the conversation context
Important: During multimodal processing (when an image is included in the message), tool use is disabled. This means your agent can't run web searches while also processing an image. This is a technical limitation of image-only mode — the agent gives its full attention to the visual content.
Soul Patterns for Image-Heavy Agents
If your agent regularly receives images (portfolio reviews, product photos, diagrams), add specific handling in your soul:
<knowledge>
## IMAGE HANDLING
- When a visitor shares an image, describe what you see first,
then provide your analysis
- For portfolio images: focus on composition, technique, and style
- For product photos: identify the product and provide relevant information
- For diagrams: read and explain the structure, then answer questions
</knowledge>
<output_format>
## IMAGE RESPONSES
- Start with a brief description of the image (1 sentence)
- Then provide your analysis or answer
- If the image is unclear, say so — don't guess
</output_format>
PDF Processing Details
PDFs go through a multi-stage pipeline:
- Text extraction —
unpdfextracts text from standard PDFs - Image detection — if the PDF is mostly images (scanned documents),
sharpprocesses each page - Content assembly — extracted text is combined and sent to the LLM
What works well: Resumes, reports, articles, documentation, contracts What's challenging: Complex layouts with tables/charts (content is extracted but layout may be lost), heavily designed PDFs (think marketing brochures with text embedded in images)
Use Cases
Resume review agents: Visitors paste a job description and upload their resume. The agent compares the two and provides fit analysis.
Document analysis agents: Visitors upload contracts, reports, or articles. The agent reads and answers questions about the content.
Visual feedback agents: Visitors share images of their work (design, photography, code screenshots). The agent provides feedback based on its expertise.
Product identification agents: Visitors share photos of products. The agent identifies the product and provides information (pricing, availability, alternatives).
Limitations
- File size: Maximum 10MB per upload
- No tool use: When processing images, web search and other tools are disabled for that message
- One file per message: Best results with one file per message — multiple files in a single message can dilute the agent's attention
- No video: Video files are not supported
- Image quality: Low-resolution or blurry images may produce poor results — the LLM can only work with what it can see