Image Analysis and Vision
Triggerfish supports image input across all interfaces. You can paste images from your clipboard in the CLI or browser, and the agent can analyze image files on disk. When your primary model does not support vision, a separate vision model can automatically describe images before they reach the primary model.
Image Input
CLI: Clipboard Paste (Ctrl+V)
Press Ctrl+V in the CLI chat to paste an image from your system clipboard. The image is read from the OS clipboard, base64-encoded, and sent to the agent as a multimodal content block alongside your text message.
Clipboard reading supports:
- Linux --
xcliporxsel - macOS --
pbpaste/osascript - Windows -- PowerShell clipboard access
Tidepool: Browser Paste
In the Tidepool web interface, paste images directly into the chat input using your browser's native paste functionality (Ctrl+V / Cmd+V). The image is read as a data URL and sent as a base64-encoded content block.
image_analyze Tool
The agent can analyze image files on disk using the image_analyze tool.
| Parameter | Type | Required | Description |
|---|---|---|---|
path | string | yes | Absolute path to the image file |
prompt | string | no | Question or prompt about the image (default: "Describe this image in detail") |
Supported formats: PNG, JPEG, GIF, WebP, BMP, SVG
The tool reads the file, base64-encodes it, and sends it to a vision-capable LLM provider for analysis.
Vision Model Fallback
When your primary model does not support vision (e.g., Z.AI glm-5), you can configure a separate vision model to automatically describe images before they reach the primary model.
How It Works
- You paste an image (Ctrl+V) or send multimodal content
- The orchestrator detects image content blocks in the message
- The vision model describes each image (you see an "Analyzing image..." spinner)
- Image blocks are replaced with text descriptions:
[The user shared an image. A vision model described it as follows: ...] - The primary model receives a text-only message with the descriptions
- A system prompt hint tells the primary model to treat the descriptions as if it can see the images
This is completely transparent -- you paste an image and get a response, regardless of whether the primary model supports vision.
Configuration
Add a vision field to your models config:
yaml
models:
primary:
provider: zai
model: glm-5 # Non-vision primary model
vision: glm-4.5v # Vision model for image description
providers:
zai:
model: glm-5The vision model reuses credentials from the primary provider's keychain entry. In this example, the primary provider is zai, so glm-4.5v uses the same API key stored in the OS keychain for the zai provider.
| Key | Type | Description |
|---|---|---|
models.vision | string | Optional vision model name for automatic image description |
When Vision Fallback Activates
- Only when
models.visionis configured - Only when the message contains image content blocks
- String-only messages and text-only content blocks skip the fallback entirely
- If the vision provider fails, the error is handled gracefully and the agent continues
Events
The orchestrator emits two events during vision processing:
| Event | Description |
|---|---|
vision_start | Image description begins (includes imageCount) |
vision_complete | All images described |
These events drive the "Analyzing image..." spinner in the CLI and Tidepool interfaces.
If your primary model already supports vision (e.g., Anthropic Claude,
OpenAI GPT-4o, Google Gemini), you do not need to configure models.vision. Images will be sent directly to the primary model as multimodal content. :::
