Images, PDFs & Files

4 min readUpdated March 23, 2026

Images, PDFs & Files

Your agent can see and process files that visitors upload. Images, PDFs, Word documents — visitors can share them directly in the chat and your agent will understand and respond to them.

Supported Formats

Images:

  • JPG, PNG, WebP, GIF
  • Sent as base64 data URL to the vision model
  • Drag & drop or paste from clipboard

PDFs:

  • Extracted using unpdf + pdfjs-dist
  • Image-only PDFs (scanned documents) are processed with sharp for OCR
  • Up to 10MB file size

DOCX:

  • Extracted using mammoth (converts Word to clean text)
  • Handles formatting, tables, lists

How Visitors Upload Files

Three ways:

  1. Drag and drop — drag a file onto the chat widget
  2. Paste — Ctrl/Cmd+V to paste an image from clipboard
  3. Upload button — click the attachment icon in the chat input

How It Works Under the Hood

When a visitor uploads a file:

  1. The file is converted to the appropriate format (base64 for images, text extraction for PDFs/DOCX)
  2. The content is included in the message sent to the LLM
  3. The LLM processes the content alongside the text message (if any)
  4. The agent responds based on both the file content and the conversation context

Important: During multimodal processing (when an image is included in the message), tool use is disabled. This means your agent can't run web searches while also processing an image. This is a technical limitation of image-only mode — the agent gives its full attention to the visual content.

Soul Patterns for Image-Heavy Agents

If your agent regularly receives images (portfolio reviews, product photos, diagrams), add specific handling in your soul:

<knowledge>
## IMAGE HANDLING
- When a visitor shares an image, describe what you see first,
  then provide your analysis
- For portfolio images: focus on composition, technique, and style
- For product photos: identify the product and provide relevant information
- For diagrams: read and explain the structure, then answer questions
</knowledge>
<output_format>
## IMAGE RESPONSES
- Start with a brief description of the image (1 sentence)
- Then provide your analysis or answer
- If the image is unclear, say so — don't guess
</output_format>

PDF Processing Details

PDFs go through a multi-stage pipeline:

  1. Text extractionunpdf extracts text from standard PDFs
  2. Image detection — if the PDF is mostly images (scanned documents), sharp processes each page
  3. Content assembly — extracted text is combined and sent to the LLM

What works well: Resumes, reports, articles, documentation, contracts What's challenging: Complex layouts with tables/charts (content is extracted but layout may be lost), heavily designed PDFs (think marketing brochures with text embedded in images)

Use Cases

Resume review agents: Visitors paste a job description and upload their resume. The agent compares the two and provides fit analysis.

Document analysis agents: Visitors upload contracts, reports, or articles. The agent reads and answers questions about the content.

Visual feedback agents: Visitors share images of their work (design, photography, code screenshots). The agent provides feedback based on its expertise.

Product identification agents: Visitors share photos of products. The agent identifies the product and provides information (pricing, availability, alternatives).

Limitations

  • File size: Maximum 10MB per upload
  • No tool use: When processing images, web search and other tools are disabled for that message
  • One file per message: Best results with one file per message — multiple files in a single message can dilute the agent's attention
  • No video: Video files are not supported
  • Image quality: Low-resolution or blurry images may produce poor results — the LLM can only work with what it can see