Image To Audio
Overview
The Image To Audio node accepts an image and produces a spoken audio description by passing the image through a multimodal model that generates a caption or narrative, then synthesizing that text to speech. Use it in battle workflows where visual content must be accessible via audio, or as a pre-processing step before audio-based judging. A valid multimodal model credential and a TTS voice selection are required; if the model call or synthesis fails the node emits on the error port rather than halting the workflow. Output audio is returned as a base64-encoded buffer with a configurable MIME type.
Configuration
| Field | Type | Required | Description |
|---|---|---|---|
model | enum | Yes | Multimodal model used to generate the image description (e.g. gpt-4o, gemini-1.5-pro, claude-3-5-sonnet). |
ttsVoice | enum | Yes | Text-to-speech voice used to synthesize the generated description (e.g. alloy, nova, echo). |
descriptionPrompt | string | No | Custom system prompt appended to the model request to steer the style or focus of the image description. Defaults to a neutral accessibility-style caption prompt. |
outputFormat | enum | No | Audio encoding format for the synthesized output: mp3, opus, or wav. Defaults to mp3. |
maxDescriptionTokens | number | No | Maximum tokens allowed for the generated description before TTS synthesis. Limits cost and audio length. Defaults to 300. |
Inputs
| Port | Type | Description |
|---|---|---|
image | string or binary buffer | The source image. Accepts a URL, base64-encoded data URI, or raw binary buffer. Must be a supported format (JPEG, PNG, WebP, GIF). |
promptOverride | string | Optional runtime override for the description prompt, taking precedence over the static descriptionPrompt config field. |
Outputs
| Port | Type | Description |
|---|---|---|
audio | audio object | Emitted on success. Includes base64-encoded audio (data), MIME type (mimeType, e.g. audio/mpeg), and the model-generated caption (description) before synthesis. |
error | error object | Emitted when the model call or TTS synthesis fails. Includes message, code, and stage (description or synthesis) so downstream nodes can branch on the failing step. |
Example
json
{
"nodeType": "image_to_audio",
"config": {
"model": "gpt-4o",
"ttsVoice": "nova",
"descriptionPrompt": "Describe this image concisely for a visually impaired listener. Focus on subject, composition, and mood.",
"outputFormat": "mp3",
"maxDescriptionTokens": 250
}
}